As we use backpropagation to reduce cost function which can be used for error measurement. Now we will discuss how cost function can be reduced by using gradient descent.
As we know Gradient descent is an optimization technique. It is used to improve deep learning and neural network-based models by minimizing the cost function.
What is Bias?
Bias is just like an intercept added to a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Moreover, the bias value allows you to shift the activation function to either right or left.
Output = Sum (weights * Inputs) + bias
The output is calculated by multiplying the inputs with their weights and then passing it through an activation function like the sigmoid function, etc. Here, bias acts like a constant which helps the model to fit the given data. The steepness of the sigmoid depends on the weight of the inputs.
A simpler way to understand bias is through a constant C of a linear function
y= mx + c
Here is the formula for weighted sum and we will drive this formula as:
(W*X1) +b and we will apply it from X1 to Xn and compute the weighted sum. Here b is the constant bias value. Then we will use the resulting value for the activated function.
Stochastic vs Batch Gradient Descent
What is Gradient Descent (GD)?
Gradient descent is a process that occurs in the backpropagation phase. In that phase where the goal is to continuously resample the gradient of the model’s parameter in the opposite direction based on the weight w, updating consistently until we reach the global minimum of function J(W).
Gradient descent is an algorithm, which is used to iterate through different combinations of weights in an optimal way to find the best combination of weights that has a minimum error.
It is used to find the optimal solutions to overall neuron values. It will combine all weights to find the best global minimal optimal weight which has minimal errors.
Let’s assume a real-life scenario
As we can see in the above image the hat shows the predicted value which is used to indicate the loss function by using the cost function. The lines on Y-axis show all the values of the neural network. If we want to apply the brute force algorithm to a given problem let’s assume, we have 25 synapses.
If we have been applying a brute force algorithm on 25 synapses then we will make a combination on each neuron synapse then there will be a thousand combinations and it will be a huge value of 10 75 at the end and we will require many days even with a high-end computer. So, that’s why we use the GD technique to find a global minimum more quickly.
Gradient Descent in 3D and 2D Graph
If you have data in 3D format. In the 3D graph, we visual we will try to find the best and shortest way to touch our minimal point from peak value to base value means if you are on peak value and trying to find out the global minimum value the same thing is showing in the 2D graph.
Type of Gradient Decent
There are three types of gradient descent
- Gradient Descent (Batch Gradient Descent)
- Stochastic Gradient Descent
- Mini Batch Gradient Descent
Stochastic Gradient Descent
The word ‘stochastic’ means a system or a process that is linked with a random probability. Hence, in stochastic gradient descent, a few samples are selected randomly instead of the whole data set for each iteration.
In Gradient descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration.
In typical gradient descent optimization, like batch gradient descent, the batch is taken to be the whole dataset. Although using the whole dataset is useful for getting to the minimum less noisily or randomly, the problem arises when our dataset gets huge.
When we use a huge data set from gradient descent it will take a long time so, in this situation, we will use SGD because it processes of scanning will be going line to line.
Graphical Example of SGD and Global Minimum
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with smoothness properties (e,g differentiable or subdifferentiable).
The left side graph in the above image shows linear results and here the value of global minimum value always will be one. The right-hand side graph in the above image shows that the value of the global minimum value is more than one.
The loss function in linear results is called convex loss function and convex loss function means the result is the same as we want from this method.
The above image shows the working of stochastic gradient descent. Let’s assume that we have a bunch of data sets and the whole data set is called a batch. The whole batch goes to the training stage at the same time and the cost function adjusts its weights then it will backpropagate and go for the training stage.
Stochastic train data sets one by one, not as a whole. So, due to processing one by one the global minimum value will be reduced at each stage.
When it reaches the best global minimal value it will indicate and generate a result. In this way, we reduce our loss function in a very appropriate manner.
Batch and Stochastic
As we can see in the above image the batch gradient descent uses the data set as a whole. But on the other hand, as we can see we process data one by one. We will use stochastic gradient descent when our loss function value is not in the convex form.
This means our process data does not entertain only one global minimum value. It will entertain more than one global minimum value. So, by using the stochastic method we will get our best global minimal value.
You can also read these: