Stochastic gradient descent involves performing the update to network parameters based on gradient computed from a single sample. As a result, the convergence pattern can resemble a zig-zag pattern and it can take a long time for the network to converge. A possible reason behind this can be that the training data contains noisy samples and this introduces a large amount of variance among the gradients computed at different update steps.

Momentum computes a weighted average of the gradient and updates the weights using this average. It reduces the noise and hence the variance in the gradient estimate.

Consider a neural network with its parameters being represented by and the loss function for sample, as a function of , given by . Then the update step in stochastic gradient descent using momentum at step is given by the equation:

where, is the learning rate. is the weight assigned to previous updates in the momentum equation and takes values between 0 and 1. Generally .

PyTorch Usage

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward() 
>>> optimizer.step()

Refer to these two sources for more information on use of momentum and other tricks in optimization.

  1. An overview of gradient descent optimization algorithms
  2. Stochastic Gradient Descent with momentum