On the importance of initialization and momentum in deep learning, Sutskever, Martens, Dahl, Hinton; 2013 - Summary
author: specfazhou
score: 9 / 10

TODO: Summarize the paper:

for reference, the followings are CM and NAG methods’
classical momentum (CM):
\(v_{t+1} = v_{t} - \epsilon \nabla f(\theta_{t})\)
\(\theta_{t+1} = \theta_{t}+v_{t+1}\)
Nesterov’s accelerated gradient (NAG):
\(v_{t+1} = v_{t} - \epsilon \nabla f(\theta_{t}+\mu v_{t})\)
\(\theta_{t+1} = \theta_{t}+v_{t+1}\)


  1. Use momentums method with well-designed initialization to train DNNs and RNNs that previously thought impossible. But it turns out that first-order method just as good as truncated Newton methods like HF.
  2. Compare the differences between classical momentum and Nesterov’s accelerated gradient.
  3. Large momentum coefficient will make momentum methods achieve better performance especially for NAG, because it has better tolerance for large momentum.