# Momentum

Stochastic gradient descent involves performing the update to network parameters based on gradient computed from a single sample. As a result, the convergence pattern can resemble a zig-zag pattern and it can take a long time for the network to converge. A possible reason behind this can be that the training data contains noisy samples and this introduces a large amount of variance among the gradients computed at different update steps.

Momentum computes a weighted average of the gradient and updates the weights using this average. It reduces the noise and hence the variance in the gradient estimate.

Consider a neural network with its parameters being represented by $W$ and the loss function for $i^{th}$ sample, as a function of $W$, given by $l_{i}(W)$. Then the update step in stochastic gradient descent using momentum at $t^{th}$ step is given by the equation:

where, $\alpha$ is the learning rate. $\beta$ is the weight assigned to previous updates in the momentum equation and takes values between 0 and 1. Generally $\beta \ge 0.9$.

# PyTorch Usage

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)