Log likelihood

Given i.i.d samples \(X=\{x_i\}\), the log likelihood of the samples under model parameters \(\theta\) is \[ \begin{align} ll(X|\theta) &= \log{\prod_{i}P_\theta(x_i)} \\ &= \sum_{i}\log{P_\theta(x_i)} \end{align} \] Note that it is simply taking log of the original likelihood function. Due to its monoticity and simple derivative, log-likelihood is often substituted with the likelihood function when we are interested in finding its maximum. It is more convenient to work with not only because calculating the gradient of additions is easier than that of multiplications, but also because additions has better numerical stability. Log-likelihood also has better gradient properties:

Note that the gradient of the sigmoid goes to \(0\) for both large positive and negative values. This means that the sigmoid will not give you any gradient signal if you’re right (large positive value) or wrong (large negative value). If you end up in a bad solution, you’ll be stuck (no gradient to get you out). The sigmoid cross entropy (log-sigmoid) does not suffer from this issue, as the gradient for large negative values is 1. In practice, always use log-likelihood if possible.

The log likelihood is a special case of cross entropy with binary one hot encodings. When using the log-likelihood as a loss in pytorch, you’ll almost always have to use the cross-entropy.

You’ll never have to use the log-likelihood loss directly, but instead combine it with softmax or sigmoid.

See wiki for more details.

PyTorch Usage

PyTorch has a log likelihood loss function torch.nn.NLLLoss, but you should never have to use it in class. Use softmax or sigmoid cross entropy.