Stochastic gradient descent involves performing the update to network parameters based on gradient computed from a single sample. As a result, the convergence pattern can resemble a zig-zag pattern and it can take a long time for the network to converge. A possible reason behind this can be that the training data contains noisy samples and this introduces a large amount of variance among the gradients computed at different update steps.
Momentum computes a weighted average of the gradient and updates the weights using this average. It reduces the noise and hence the variance in the gradient estimate.
Consider a neural network with its parameters being represented by and the loss function for sample, as a function of , given by . Then the update step in stochastic gradient descent using momentum at step is given by the equation:
where, is the learning rate. is the weight assigned to previous updates in the momentum equation and takes values between 0 and 1. Generally .
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
Refer to these two sources for more information on use of momentum and other tricks in optimization.