Stochastic Gradient Descent

Stochastic gradient descent is an incremental form of gradient descent for optimizing a differentiable objective function. In Gradient Descent, the gradient needs to be computed on all the samples in the training set to perform one update to the network parameters. Stochastic gradient descent (SGD) changes this. SGD computes the gradient only on a randomly chosen sample or subset of the data (hence the name “stochastic”), and updates the parameters following the gradient of this sample or subset.

Let represent the network parameters and the loss function for sample, as a function of , is given by . Then the update step in stochastic gradient descent at step is given by the equation:

where, is the learning rate.

However, these days vanilla SGD is rarely used. It is often combined with momentum. More sophisticated techniques like Adam, RSMProp and Adagrad are sometimes preferred in pratice.

Pytorch Usage

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward() 
>>> optimizer.step()

Refer to torch.optim.SGD for more details.