Here we are going to discuss the second and third types of gradient descent which are known as Stochastic Gradient Descent and mini-batch gradient descent.

**Stochastic gradient descent**

The word â€˜Stochasticâ€™ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called â€˜batchâ€™ which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration, In typical Gradient descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy or less random manner. But the problem arises when our datasets get really huge.

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable).

** Convex Loss Function**

Figure 1: Global minimum value = 1 Figure 2: Global minimum value > 1

Figure 1 shows that the value of the global minimum is 1 and figure 2 shows that the value of the global minimum is more than 1. As we discuss later the batch gradient descent uses the whole data set at the same time. As we know that in stochastic gradient descent processes each row is individually one by one, we use stochastic gradient descent where the loss function is not in the convex form means our process data cannot entertain only one value with respect to visualization of the global minimum value is more than 1 in stochastic gradient descent.Â

**Mini Batch Gradient Descent**

It is the third type of gradient descent. Mini-batch gradient descent is a variation of the gradient descent algorithm. It split the training dataset into small batches that are used to calculate model error and update model coefficients.

Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient. Mini batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

If we have a huge amount of data in millions of billions and we have to process this data. So, if we use batch gradient descent a batch gradient descent process. The whole data set at the same time will load in the training stage. And we also reduce its errors so, it will take more time and consume more computational power.

When we try to solve this huge amount of data set using SGD. So, first, we check whether our problem is linearly separable or not means it has more than “1” global minimal or not. If it has more than 1 global minimal then we will use the SGD method. Otherwise, we will use the batch gradient descent method.

If we use mini-batch gradient descent to solve huge data sets first. We will divide this data set into small chunks. So it’s easy to find errors and solve errors and the values of coefficients. And at last, we will use the same working procedure that we used for batch gradient descent.