author: elampietti
score: 8 / 10

Datasets have grown to a point where machine learning is now constrained by computing time rather than a need for more training data. Bottou proposes that stochastic gradient descent is an effective machine learning algorithm for this excess of data. Since large scale datasets are not limited by the amount of data available, the optimization error can be increased to allow the computing reasources to be used more effectively.

A model’s performance can be measured either with empirical risk, En(f), which is the effectiveness of the model on the training set:

empirical risk

Or it can be measured with expected risk, E(f), which is the effectiveness of the model on future data:

expected risk

Gradient descent updates the weights using the entire training set with the following formula and has been previously cited as a good algorithm to minimize the empirical risk.

gradient descent

In comparison, stochastic gradient descent (SGD) randomly chooses one datapoint to calculate the gradients at each iteration rather than considering the whole dataset to reduce the computational cost.

SGD

The function, f, represents the best prediction function that we can obtain given our start to SGD. Therefore, we let F encompass the family of functions acquired during SGD from the chosen parameters where f is the function that minimizes E(f). Finding the empirical risk optimum can be costly, therefore the iterations are stopped once a specific accuracy is reached.

specific accuracy

The excess error generated from optimizing the empirical function is split into the approximation error, the estimation error, and the optimization error. With these three areas for error, a tradeoff exists between the optimization accuracy, the family of functions size, and size of the dataset. Small-scale learning problems are not constrained by computing time, therefore the optimization error can be made negligible to attain the best accuracy. On the other hand, large-scale learning problems have more training data, therefore the optimization error can be increased given the amount of computing power available.

The following table shows that the SGD and 2SGD algorithms perform the best due to having the fastest times to find the expected risk, which is the most important factor for big data when compute time is the main constraint.

realized that SGD is best for big data

The following diagram reveals how the stochastic gradient descent algorithm drastically outperforms the other algorithms in very little time and is only overtaken by TRON after the expected risk stops improving.

results 1

The next two diagrams reveal how the other SGD variants (ASGD & SGDQN) perform. In the following graphs, averaged stochastic gradient descent (ASGD) achieves optimal expected risk in only one iteration.

results 2

The following graphs reveal how single second order stochastic gradient pass (SGDQN) performs best for a CRF and the paper reports that all three SGD algorithms outperform the CRF L-BFGS optimizer which took 72 minutes instead of a couple minutes to achieve the same results.

results 3

TL;DR