author: nilesh2797
score: 7 / 10

What is the core idea?

The authors propose and study Weight Standardization (WS) and Batch-Channel Normalization (BCN) for effectively training deep conv nets on vision tasks. With many empirical results and theoretical analysis they argue the benefits of WS and BCN for training with small batches.

Context

Bring the success factors of Batch Norm into micro-batch training without relying on large batch-sizes during training

Weight Standardization (WS)

Normalize (Standardize) weights in the convolution layer along with Group Normalization

WS smoothes out the optimization landscape

Authors theoretically argue that WS smoothes out the optimization by reducing the lipschitz constants on the loss and gradients

Empirically they show that both the operations in WS (i.e. making the mean 0 and making the variance 1) help in getting better performance, with the bulk of the gains coming from the former

Training ResNet-50 on Imagenet with Group Normalization (GN), Eq 11 refers to making mean 0 in WS and Eq 12 refers to making variance 1 in WS

WS avoides elimination singularities

Intuitively WS is able to pass similarities in input channels to output channels. Empirically shown in two steps:

Closer to BN is far from singularity WS brings channel normalization closer to BN

Batch-Channel Normalization (BCN)

Apply both batch and channel normalization together to activations

Channel based normalizations make estimates based normalizations possible, and estimate based normalization helps channel based normalizations avoid elimination singularities

Results

Results with WS Results with WS + BCN

TL;DR