SGDR: Stochastic Gradient Descent with Warm Restarts, Loshchilov, Hutter; 2016 - Summary
 author: biofizzatreya score: 6 / 10

TODO: Summarize the paper:

• What is the core idea?
• The paper discusses a strategy for performing stochastic gradient descent with restarts (SGDR) for better convergence to a global minima.
• How is it realized (technically)?
• Normal stochastic gradient descent goes as: $$x_{t+1} = x_t-\eta_t\cdot\nabla f_t(x_t)$$
• Stochastic gradient descent with momentum goes as: $$v_{t+1} = \mu_t\cdot v_t-\eta_t\cdot\nabla f_t(x_t)$$, and $$x_{t+1} = x_t + v_{t+1}$$
• In this situations $$\eta_t$$ is the learning rate. In warm-restarts at $$n$$ epochs, after each $$n$$ epochs $$\eta_t$$ is periodically reset to an initial value of $$\eta_max$$ and then allowed to decay to $$\eta_{min}$$ with a cosine annealing function. At the next restart the last solution $$x_t$$ is taken as the initial condition. The annealing follows the given equation: $$\eta_t = \eta^i_{min} + \frac{1}{2}\left( \eta^i_{max}-\eta^i_{min} \right)\left( 1+\text{cos}\left(\frac{T_{cur}}{T_i}\right)\right)$$
• How well does the paper perform?
• The author’s performed SGDR on CIFAR-10 and CIFAR-100 datasets. The test error of 4.03% on CIFAR-10 and 19.57% on CIFAR-100 can be improved to 3.51% on CIFAR-10 and 17.75% on CIFAR-100.
• This points to the validity of this approach and also shows that this could be better than learning simply with momentum.
• What interesting variants are explored?
• The authors also perform SGDR on EEG datasets.
• The authors show that SGDR can achieve smaller test errors than the original learning rate schedule used.
• The authors also demonstrate that the SGDR is prime candidate for training an ensemble of neural networks, where each network is initially trained using the output at the $$M^{th}$$ restart. This allowed them to further improve accuracy on CIFAR-10 and CIFAR-100 datasets.

## TL;DR

• Stochastic gradient descent with restarts
• Learning rate set at max value on restart and decays till next restart
• Performs as well as momentum based learning