author: samatharhay
score: 9 / 10

The paper introduces weight normalization, a reparameterization of the weight vectors of a neural network, to speed up convergence of stochastic gradient descent.

In standard neural networks the weighted sum computed by each neuron is as follows:

weighted sum

where w (weights) and x (input) are both k-dimensional vectors and b is a scalar bias term.

In weight normalization each w vector is reparametrized in terms of a parameter vector v, of dimensionality k, and a scalar parameter g:

weighted norm

The gradient is taken with respect to v and g:

gradient

WN is able to scale the weight gradients by g/   v   and project the gradient away from the current weight vector. They argue that due to projecting away from w, the norm of v grows monotonically with the number of weight updates, which helps grow the unnormalized weights until an appropriate learning rate is reached. Additionally, by the projection, the covariance matrix of the gradients in v is brought closer to the identity matrix, which may speed up training.

BN fixes the scale of the features generated by each layer of neural networks, making optimization robust against parameter initializations for which scales vary across layers. WN lacks this so proper initialization is needed. This can be done by sampling the elements of v from a simple distribution with a fixed scale, and then before training, initialize b and g to fix the minibatch statistics of all pre-activations for a single minibatch of data. This ensures all features initially have zero mean and unit variance.

Since the mean of the neuron activations are still dependent on v, a variant of batch normalization called mean-only batch normalization, was introduced.

Similar to full batch normalization, the minibatch means are subtracted out, but the minibatch standard deviations are not divided by when computing neuron activations. The activations are computed as follows:

activations

The gradient of the loss with respect to preactivation t is:

loss

To prove the usefulness of their model the authors conducted experiments using 4 different models.

Supervised Classification: CIFAR-10

figure 1

Generative Modelling: Convolutional VAE:

figure 3

Generative Modelling: Draw

figure 4

Reinforcement Learning: DQN

figure 5

The authors were able to show that weight normalization can help improve convergence speed in several different models, as well as several where batch normalization was not implementable.

They also clearly explained how to implement WN and proposed a variant of batch normalization to go along with it.

TL;DR