LAMB

LAMB Optimizer

(Mini-batch) Stochatic Gradient Descent (SGD) is the basic learning algorithm used to update model parameters.

$x_{i+1} := x_i - \eta \cdot \nabla J(x_i)$

where $x_i$ represents a parameter of the model at the i-th iteration of training, $\eta$ is the learning rate and $J$ is the loss function.

In this formulation there is nothing preventing the update $\lambda_i \nabla J(x_i)$ becoming very large leading to exploding gradients and divergence of the loss curves.

This issue is made worse in the large batch setting where larger learning rates are used to account for reduced number of training updates.

The idea put forward by You et al. is that individual layers can hold you back. That is, the weights in certain layers can explode, and that these layers become the limiting factor when setting the learning rate.

Layerwise optimizers are designed with this observation in mind, and seek to normalize the learning rate per-layer in two steps:

First normalize the gradients by that layer's gradients' $L_2$ -norm, $||\nabla J(x^{(L)})||$
Then scale the graient normalized update by that layer's weights' $L_2$ -norm, $||x^{(L)}||$
- More generally we can scale by some funtion of the weights' norm $\phi(||x^{(L)}||)$

This allows for layers with larger $||x^{(L)}||$ to use higher learning rates.

Putting the above steps together we obtain the trust ratio:

$\lambda_i^{(L)} \mapsto \lambda_i^{(L)} \cdot \frac{\phi(||x^{(L)}||)}{||\nabla J(x^{(L)})||}$

where $\lambda_i^{(L)}$ is the learning rate in the L-th layer.

Adapting the layerwise learning rate by the trust ratio can be applied to various learning algorithms.

LARS

By applying these ideas to SGD with momentum you arrive at Layerwise Adaptive Rate Scaling (LARS) introduced by You et al..

LAMB

By applying these ideas to Adam (in the AdaGrad family) we arrive at Layerwise Adaptive Moments Based (LAMB) optimization.