On the Convergence of Adam and Beyond, Reddi, Kale, Kumar; 2019 - Summary
author: joshpapermaster
score: 7 / 10

This paper introduces scenarios where the ADAM optimizer fails to converge to the best solution. As a solution, the paper provides variants of the ADAM optimizer that not only succeed in fixing the specific issues pointed out by the paper, but also often improve training.

The main issue with ADAM and similar variants is that they use exponential moving averages, which puts a major focus on the past few gradients. As a result, large gradient values with important information are quickly multiplied by a decaying factor to have very little influence.

There are three theorems presented in this paper on the non-conergence of ADAM, with the third one being the most specific:

For any constant β1, β2 ∈ [0, 1) such that β1 < β2^(1/2), there is a stochastic convex optimization problem where ADAM does not converge to the optimal solution

Solution: AMSGrad

Here is the outline of the generic adaptive method. The m term refers to the momentum and the v term refers to the averaging function. AMSGrad

The key difference in AMSGrad from the generic adaptive method is seen when the maximum gradient is kept from the previous set. This recursively ensures the largest gradient is kept track of. By doing so, the learning step size is non-increasing.

AMSGrad

The training and test lost for AMSGrad clearly beats ADAM in all tests. Even if ADAM does manage to converge to the optimal solution, the issues brought to light in this paper still likely slow it down.

AMSGrad

The paper also presents an alternative solution called ADAMNC

TL;DR