Featured

What Are the Types of Gradient Descent? A Look at Batch, Stochastic, and Mini-Batch

Training modern machine learning models requires navigating an enormous landscape of parameters, data points, and computations.

 

While gradient descent gives us a powerful framework for minimizing loss, the straightforward version of the algorithm hides a massive practical problem: the cost of computing each update step. When datasets contain billions of examples and models include millions of features, even a single pass through all data becomes prohibitively expensive. To understand why—and how practitioners work around this limitation—we can look more closely at the update rule itself and explore tools like contour plots that help us visualize what gradient descent is really doing under the hood.

 

The Issue

We actually have a major issue in the formula for update that we came up with for gradient descent:

 

Gradient descent formula

 

Recall that the summation is over all data points m. When we try to actually carry out this computation, we encounter a significant practical challenge with this gradient descent step formula. Let’s examine the computational cost to understand why this is a serious problem.

 

Our update rule requires us to sum over all data points m in our dataset. In modern machine learning applications, m can easily reach a billion data points. For each of these billion points, we need to do the following:

  1. Compute the prediction (involving all features).
  2. Multiply this error by each feature value.

With a million features (actually below average in modern machine learning), each prediction step involves a million multiplications. Multiply this by our billion data points, and we’re looking at 1015 operations—just for a single update step. Because we’re carrying out this step operation using the whole batch of data, we’ll call this batch gradient descent. Let’s explain this in a little more detail, and in the process, also introduce an important mathematical/visualization tool—the contour plot.

 

Contour Plots for Visualization of Gradient Descent

While the 3D mesh plot in the figure below effectively shows how loss varies with different parameter combinations, it has several practical drawbacks.

 

Mesh Plot with Two Model Parameters

 

The surface itself becomes an obstacle to visualization—peaks can hide valleys behind them, making it impossible to see certain combinations of parameter values. If we want to track a specific point’s trajectory during gradient descent, it might disappear behind a hill or beneath a valley. Even worse, when we want to compare multiple parameter combinations, some points might be completely hidden from view regardless of how we rotate the plot. This problem becomes particularly acute in regions of high curvature, precisely where we most need to understand the landscape’s behavior. We can solve this visualization challenge by transforming our 3D plot into a more readable 2D representation.

 

The key insight is that we can flatten our 3D landscape while preserving all of its critical information. Instead of showing height directly, we can project our surface onto the ground plane and use contour lines to represent points of equal loss value. This is called a contour plot—an example of which can be seen in the below figure. Each contour line connects all points that share the same loss value similar to elevation lines on a topographical map. (Ignore the lines with arrows for now. We’ll come back to these in a minute.)

 

Contour Plot Mapping the 3D Landscape to the 2D Plane

 

This flattened representation offers several advantages:

  1. We can clearly see all parameter combinations without any points being hidden behind the surface.
  2. The spacing between contour lines tells us about the steepness of our loss landscape.
  3. We can easily trace paths between different parameter combinations and understand how the loss changes along these paths.

Each contour line effectively marks a level set—all points where our loss function equals a specific value. By labeling these lines with their corresponding loss values, we maintain all the essential information from our 3D plot while making it much easier to analyze specific regions of our parameter space.

 

Working with Contour Plots

It’s a great idea to find some contour plots and get used to what they look like. In addition, try creating them in the tool of your choice. Matplotlib has great examples of contour plots in its documentation. You can see some of them here: http://s-prs.co/v614201.

 

The contour plot shows us the steps of a batch gradient descent process in action (the lines with arrows). Starting from randomly initialized values of θ1 = -3 and θ2 = -3.5, we can trace how the algorithm navigates toward the minimum. At each step, the partial derivatives calculated from our formula guide us in the optimal direction. Over three iterations, we see the algorithm progressing directly toward the minimum point of the loss landscape.

 

This visualization illustrates both the strength and limitation of batch gradient descent. Its strength lies in taking the optimal path toward the minimum, as it uses information from all data points to determine each step. However, this thoroughness comes at a computational cost—each step requires processing the entire dataset, making the algorithm very slow when dealing with many features or data points.

 

To put batch gradient descent in perspective: even if we had a supercomputer capable of performing a trillion operations per second, each gradient descent step would still take more than 15 minutes. And remember, we typically need thousands or even millions of steps to reach convergence.

 

This computational burden makes our current approach impractical for real-world applications, even with the most powerful computing resources available. We need a more efficient strategy.

 

Improving the Efficiency of Batch Gradient Descent

Let’s discuss a surprisingly effective solution to our computational challenge. Instead of calculating the loss over all data points at each step, we can choose a single, random data point and compute the loss for just that point. Although this approach might seem counterintuitive (after all, how can we make good decisions with such limited information?), it really well in practice.

 

In this modified approach, each update step works like this:

  1. Randomly select one data point from our dataset.
  2. Calculate the loss using only this point.
  3. Update our parameters based on this single-point calculation.
  4. For the next step, pick another point (without replacement) and repeat.

This dramatically reduces our computational burden. Instead of performing billions of calculations for each step (summing over all data points), we now only need to calculate the loss for a single point. Our million multiplications no longer get repeated for a billion data points. That’s a billion times speedup!

 

What about the issue of limited information used in each step? While each individual step might not point us in exactly the right direction (because we’re only looking at one data point’s gradient), we compensate for this by taking many more steps. Because we pick each data point at random and arrive at the next location based on chance, this is called stochastic gradient descent (SGD). This meandering behavior can be seen on the contour plot in this figure.

 

Gradient Descent Steps in Stochastic Gradient Descent

 

The randomness in our point selection helps ensure that we eventually consider all of our training data. This method essentially trades the precision of batch gradient descent for dramatically improved computational efficiency.

 

A Paradox in the Use of Stochastic Gradient Descent

Our discussion of SGD reveals an interesting paradox in practice. While it seems computationally efficient to process just one data point at a time, this approach actually underutilizes modern computing hardware. GPUs and specialized machine learning processors are designed for parallel computation—they excel at processing multiple data points simultaneously. When we process single data points sequentially, we’re essentially leaving most of our computational power idle. Moreover, to process our entire dataset once (called an epoch), we need to perform m separate update steps—one for each data point. This sequential processing can end up being slower than we anticipated.

 

The solution lies in finding a middle ground, called mini-batch gradient descent. Here’s how it works: We first shuffle our entire dataset randomly, and then divide it into small, equally-sized chunks called mini-batches. Each mini-batch typically contains data points that modern GPUs can process efficiently in parallel.

 

For example, if we have 10,000 training examples and choose a batch size of 100, we’ll create 100 mini-batches of 100 examples each. We then process these mini-batches sequentially: compute the loss for all points in the current mini-batch, calculate the average gradient, and update our parameters. Once we’ve used a mini-batch, we don’t reuse it until we’ve processed all other mini-batches to ensure that each data point contributes equally to training.

 

When we’ve processed all mini-batches once (completing the full epoch), we typically reshuffle the entire dataset and create new mini-batches for the next epoch. This reshuffling helps prevent our model from learning any accidental patterns in the order of our data.

 

This approach offers several advantages:

  • Uses hardware parallelization capabilities
  • Reduces the computational burden compared to full batch processing
  • Provides more stable updates than SGD
  • Allows for efficient memory usage and data throughput

The size of these mini-batches, known as the batch size, becomes a crucial hyperparameter in modern machine learning. A hyperparameter is a tunable variable that isn’t a model parameter but affects the model’s parameters, nonetheless. The batch size hyperparameter helps us balance computational efficiency with the quality of our parameter updates.

 

When you start working with machine learning frameworks and libraries, you’ll notice something potentially confusing: many tools label their gradient descent implementations as “SGD” (aka stochastic gradient descent), even though they’re actually implementing mini-batch gradient descent. The key tell-tale sign is the presence of a batch_size parameter. This naming convention, while technically imprecise, has become standard in the field.

 

In any case, choosing the optimal batch size is more art than science, heavily dependent on the number of features, the number of data points, and specific hardware configuration, among many others. The decision involves balancing several factors. If your batch size is too small (e.g., 8 or 16), you might experience the following:

  • Updates becoming noisy and unstable
  • Training path meandering more than necessary
  • Underusing GPU parallel processing capabilities
  • Significantly increased overall training time

If your batch size is too large (e.g., 1,024 or 2,048), you might experience the following:

  • GPU memory becoming a bottleneck
  • More time spent on memory transfers
  • Significant overhead in data movement
  • May slow down training despite seeming more efficient

The sweet spot depends on your specific GPU:

  • Available memory
  • Number of CUDA cores
  • Memory bandwidth
  • Cache size

This is why you’ll often see recommendations to experiment with different batch sizes for your specific setup. Start with common values such as 32, 64, or 128, and then adjust based on both training performance and hardware utilization metrics.

 

One issue still remains though: Both mini-batch and SGD face a particular challenge near the minimum point of our loss landscape. Unlike batch gradient descent, which uses the entire dataset to compute precise updates, these methods will tend to oscillate around the minimum rather than settling precisely on it. This happens because each update is based on a subset of data, leading to slightly different gradient directions at each step.

 

Conclusion

Gradient descent may seem conceptually simple, but scaling it to real-world machine learning workloads introduces a host of computational challenges. Batch gradient descent gives us clean, stable updates, but at a cost that quickly becomes unsustainable. Stochastic and mini-batch approaches dramatically reduce that burden, trading perfect precision for speed, hardware efficiency, and real-world practicality. As we’ve seen, choosing the right batch size and update strategy isn’t just a mathematical detail—it’s a central design decision that affects convergence, performance, and how effectively we can use modern computing architectures. Understanding these trade-offs prepares us for the next step: refining optimization techniques to get the best results from today’s large-scale machine learning systems.

 

Editor’s note: This post has been adapted from a section of the book Keras 3: The Comprehensive Guide to Deep Learning with the Keras API and Python by Mohammad Nauman. Dr. Nauman is a seasoned machine learning expert with more than 20 years of teaching experience and a track record of educating 40,000+ students globally through his paid and free online courses on platforms like Udemy and YouTube.

 

This post was originally published 11/2025.

Recommendation

Keras 3
Keras 3

Harness the power of AI with this guide to using Keras! Start by reviewing the fundamentals of deep learning and installing the Keras API. Next, follow Python code examples to build your own models, and then train them using classification, gradient descent, and regularization. Design large-scale, multilayer models and improve their decision making with reinforcement learning. With tips for creating generative AI models, this is your cutting-edge resource for working with deep learning!

Learn More
Rheinwerk Computing
by Rheinwerk Computing

Rheinwerk Computing is an imprint of Rheinwerk Publishing and publishes books by leading experts in the fields of programming, administration, security, analytics, and more.

Comments