# Momentum

Stochastic gradient descent involves performing the update to network parameters based on gradient computed from a single sample. As a result, the convergence pattern can resemble a zig-zag pattern and it can take a long time for the network to converge. A possible reason behind this can be that the training data contains noisy samples and this introduces a large amount of variance among the gradients computed at different update steps.

Momentum computes a weighted average of the gradient and updates the weights using this average. It reduces the noise and hence the variance in the gradient estimate.

Consider a neural network with its parameters being represented by and the loss function for sample, as a function of , given by . Then the update step in stochastic gradient descent using momentum at step is given by the equation:

where, is the learning rate. is the weight assigned to previous updates in the momentum equation and takes values between 0 and 1. Generally .

# PyTorch Usage

```
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
```

Refer to these two sources for more information on use of momentum and other tricks in optimization.