ADADELTA: An Adaptive Learning Rate Method, Zeiler; 2012  Summary
author:  sritank 
score:  10 / 10 
TODO: Summarize the paper:
 Main idea
 Adaptively change the learning rate in a way that it reduces oscillations i.e. slows down close to minimas, and is large otherwise.
 Change step size based on the windowed accumulated gradient and also accumulated step sizes, which only require 1st order computations.
 Windowed accumulation makes sure updates after multiple iterations aren’t drowned out by massive accumulated denominator like in ADAGRAD

Scaling the step size by the accumulated previous step sizes gives it the correct units, as if it were a 2nd order adaptive method
 Technical implementation
 Scaling gradient helps ensure we take an equal step in all directions
 Builds on ADAGRAD which scales the learning rate based on the accumulated sum of previous gradients.
 By choosing an exponential decaying average for accumulating the gradient they solved the 0 step size problem of ADAGRAD
where epsilon is used to condition the denominator for numerical purposes.
 By assuming locally smooth curvature, they scaled up the step size using the exponentially decaying average of previous step sizes.
 Algorithm performance
 numerator RMS term lags behind the denominator by 1 time step, making it robust to large sudden gradients. Denominator increases slowing down the progress before numerator can blow up.
 The method makes approximations on Hessian and local curvature, giving it second order characteristics while still costing only one gradient computation per iteration.
 In experiments, the training method is used on DNNs with sigmoid and ReLU units. The DNNs were trained for MNIST classification and speech recognition.
 ADADELTA is less sensitive to hyperparameter settings compared to other methods and converges quickly.
 The lower layer gradients are larger than top layer gradients indicating that ADADELTA doesn’t suffer from diminishing gradient problem (also tackles vanishing gradient in tanh network).
 Step size converges to a constant at the end of training resulting in parameter updates converging to zero (gradients become too small). Acts as if annealing schedule is present.
 For Audio signal classification, ADADELTA converged faster than other methods even under circumstances where accumulated gradients had significant noise.
TL;DR
 New adaptive learning rate algorithm (ADADELTA) is introduced and it has first order computational cost
 ADADELTA is more robust to choice of hyperparameters used and converges smoothly and faster than other learning rates
 ADADELTA doesn’t suffer from diminishing gradient problem for cases tested in the paper and exhibits characteristics of second order learning rates