Mathematical underpinnings

Machine learning and big data go hand-in-hand. Fitting a machine learning model to this data can be expensive. For example, training OpenAI’s popular GPT model required a process that took “1 month on 8 GPUs” (source).

Of course, it’d be convenient to get this results more quickly. But of course, performing more computation would cost a lot more. On Amazon EC2, using 32 GPUs for half a month would cost twice as much. Then, one questions is relevant: how can model training finish more quickly with same number of floating point operations?


These machine learning problems solve minimizations of the form

\[\widehat{w} = \arg\min_{w} \frac{1}{n}\sum_{i=1}^n f(w; z_i)\]

where \(z_i\) is the \(i\) for a scalar target variable \(y_i\) and a feature vector \(x_i\), then \(f(w, z_i)\) might measure the least squares loss with \(f(w; z_i) = (y_i - x_i^T w)^2\).

Typically, the number of examples \(n\) is very large. This is challenging for the optimization underlying machine learning. To help address this issue, the commonly used machine learning optimizations approximate the gradient with a small number of examples, aka the “batch size”. In the simplest form, these stochastic optimizations run updates of the form

\[w_{k+1} = w_k - \gamma_k \sum_{i=1}^B \nabla f(w_k; z_{i_s})\]

where \(\nabla f(w_k; z_{i_s})\) is the gradient for model \(w_k\) with example \(i_s\) where \(i_s\) is chosen uniformly at random.

This means the gradient is approximated with a constant number of examples \(B\) regardless of how different the gradients are for each example.

Why should the batch size grow?

Why should the batch size be constant? With poor initialization, the gradients for each example might point in the same direction. Using one example to approximate the gradient will suffice if the gradient for each example is exactly the same (or if \(\nabla f(w; z_i) = g\) for all \(i\) and some vector \(g\)).

Likewise, let’s say the model \(w\) is close to the optimum. Then the gradients for each example will point in different directions because the gradient for each example points to the optimal model for that example (and presumably all the data are unique).

This is best illustrated with a simple 1D example. Let’s perform a simple minimization for illustration:

\[\color{red}{\widehat{w}} = \arg\min_{\color{orange}{w}} \sum_{i=1}^{10} \frac{1}{2}(\color{blue}{y_i} - \color{orange}{w})^2\]

Then the gradient for model \(w_k\) for each example will be \(\color{green}{g_i = y_i - w_k}\).

Example of how example gradients become more diverse as distance to optimum shrinks

These example gradients \(\color{green}{g_i}\) are pretty similar with poor initialization: they’re all positive, and only differ in magnitude (between values of 0.5 and 1.5). Near the optimum, they’re very different: some gradients are positive, some are negative and some are approximately zero.

This “gradient diversity” can be shown to increase in higher dimensions for various function classes. Gradient diversity measures the orthogonality between the gradients for each example 2. Complete details of how this grows can be found in Section 4.1 of Sievert et. al 1.

How should the batch size grow?

For neural networks, a method that geometrically increases the batch size as a function of epochs has shown really good performance. 3

If you have a strongly convex problem, it should grow geometrically with the number of model updates (not epochs). 4


“Improving the convergence of SGD with adaptive batch sizes”. Scott Sievert and Zachary Charles. 2019.


“Gradient Diversity: a Key Ingredient for Scalable Distributed Learning”. D Yin, A Pananjady, M Lam, D Papailiopoulos, K Ramchandran, P Bartlett. 2018.


“Don’t Decay the Learning Rate, Increase the Batch Size”. Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2017.


Chapter 5 of “Optimization methods for large-scale machine learning.” Bottou, Leon and Curtis, Frank E and Nocedal, Jorge. 2018.