On the Convergence of Adam and Beyond, Reddi, Kale, Kumar; 2019 - Summary
author: aabayomi
score: 8 / 10

Summary 1: ON THE CONVERGENCE OF ADAM AND BEYOND

What is the core idea?

The problem is this paper tried to solve is the limitation of optimizers using exponential moving averages such as ADAM. These type algorithms was shown not do well with mini-batches with large gradients the advantages of exponential averaging dies leading to poor convergence.

How is it realized (technically)?

The modified algorithm known as AMSGrad based on adam.

Key difference compared to ADAM

[1] Smaller learning rates

[2] Retains the max value of $v_t$ (Compute bias-corrected second raw moment estimate)

During training AMSGrad does not iteratively change the learning rate unlike ADAGRAD that slightly decrease the learning rate and ADAM that aggressively increases the learning rate.

How well does the paper perform?

Top row (left,center) - logistics regression, (right) - 1- hidden layer feed forward neural network on MINST dataset and Bottom CIFARNET

Logistic Regression :

Neural Networks:

AMSGrad vs ADAM for logistic regression, feed forward neural network and CIFARNET.AMSGrad outperforms ADAM on all three experiments

What interesting variants are explored?

AdamNC alternate approach varies the values of $\beta_1$ and $\beta_2$ (exponential decay rate for moment estimates)

TL;DR