Cautious Optimizers: Improving Training with One Line of Code

mkaic 4 months ago

Damn, this is a strikingly simple modification. Basically, modern deep learning optimizers typically calculate the update to the weights each step using some kind of momentum and/or LR scaling based on the running variance of the gradients. This means that, in theory, the actual "instantaneous" gradients from a particular backward pass might point in a different direction than the actual update the optimizer applies. The change the authors propose is to simply ignore any parameter updates proposed by the optimizer that have the opposite sign of the current gradient from the most recent backwards pass. They're essentially saying "only apply the long-term stabilized update where it agrees with the current 'instantaneous' gradient." They show that this simple change significantly speeds up model training.

I'm pretty intrigued by this, but will, as usual, wait for independent replications to come out before I fully believe it. That said, because of how simple this is, I'd expect such replications to happen within 24 hours. Exciting work!

shoubidouwah 4 months ago

I wonder if there mioght not be an opportunity for a warmup based mask inversion: for the first few epoches, only apply the momentum agreeing with instantaneous - after that, invert it since the momentum would technically have more info?
In any case, good idea - reminds me of the "apply same gradient multiple times" trick from a few years ago. May have weird behaviours at low batch sizes though...