author: WhyToFly
score: 7 / 10

What is the core idea?

Batch normalization is used in state of the art ResNets to achieve high accuracies in image classification tasks.

It has multiple benefits that cause an improvement in training:

However, normalization also has multiple negative side effects:

That is why the paper introduces “Normalizer-Free ResNets” (“NF-ResNets”), which perform very well without normalization while also being more efficient to train.

Main ideas:

How is it realized (technically)?

Residual blocks have the form

\[h_{i+1}=h_i+\alpha f_i (h_i/\beta_i)\\ \textrm{where } h_i=\textrm{inputs to the residual block i}\\ f_i=\textrm{function computed by residual block i, parameterized to be variance-preserving}\\ \alpha=\textrm{scalar specifying variance increase rate per block; usually small value like 0.2}\\ \beta_i=\textrm{scalar predicting the standard deviation of the inputs to block i}\]

Scaled weight standardization This prevents a mean-shift and combined with scaled activation functions leads to a variance preserving function.

Reparameterization of convolutional layers:

\[\hat{W_{ij}}=\frac{W_{ij}-\mu_i}{\sqrt{N}\sigma_i}\\ \textrm{where } \mu_i=(1/N)\sum_j{W_{ij}}\\ \sigma_i^2=(1/N)\sum_j{(W_{ij}-\mu_i)^2}\\ N=\textrm{fan-in}\]

Scaling the activation function (for ReLU):

\[\textrm{with scaling factor }\gamma=\sqrt{2/(1-(1/\pi))}\]

Adaptive Gradient Clipping

The gradients are clipped unit-wise based on the ratio of the norm of the gradient to the corresponding weight for each layer:

\[G_i\rightarrow\left\{\begin{array}{ll} \lambda\frac{\lVert W_i \rVert_F}{\lVert G_i \rVert_F} , & if \frac{\lVert G_i \rVert_F}{\lVert W_i \rVert_F} > \lambda\\ G_i, & otherwise\end{array}\right.\\ \textrm{where } \lambda=\textrm{scalar hyperparameter}\\ G_i=\textrm{unit i of the gradient}\\ W_i=\textrm{unit i of the weight matrix}\\ \lVert \rVert_F=\textrm{Frobenius norm,}\]

Clipping is performed on all but the last layer.

The model architecture is a modified version of the SE-ResNeXt-D model:

Figure 1: Residual block

The modifications include increasing the number of output channels for convolutions (since training speed stays the same on modern hardware) and introducing a simpler, more efficient depth scaling rule for larger model variants.

How well does the paper perform?

Smaller Normalization-Free ResNets are able to match the accuracy of the batch-normalized EfficientNet-B7 on ImageNet while being a lot (up to 8.7 times) faster to train (See Figure 2). The larger architectures were able to achieve a new state-of-the-art top-1 accuracy of 86.5%.

Figure 2: Accuracy compared to training speed

What interesting variants are explored?

Since this architecture cannot take advantage of the regularizing effect of normalization, it tends to overfit on datasets like ImageNet. The authors show however, that Normalizer-Free networks perform better than networks with normalization when pretrained using a large dataset and then fine-tuning on ImageNet.

TL;DR