Adam: A Method for Stochastic Optimization, Kingma, Ba; 2014 - Summary
author: jordi1215
score: 9 / 10

Adam (derived from adaptive moment estimation) is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

Algorithm

Adam algorithm

Adam Configuration parameters

“Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8”

How is Adam Different to classical stochastic gradient descent?

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. “[Adam] computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.”

The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

Result

Adam algorithm

“Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.”

“Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems.”

TL;DR