Rethinking "Batch" in BatchNorm, Wu, Johnson; 2021 - Summary
author: chengchunhsu
score: 7 / 10

A short review on BatchNorm

In the paper, BatchNorm is described as a “necessary evil” in the design of CNNs

BN_EQ1

BN EQ2

Sampling strategies:

Design issue: how to compute μ and σ^2?


Issue #1: Compute μ and σ^2 during training

Discussion between two statistics computation methods - EMA and PreciseBN.

Exponential moving average (EMA):

EMA equation

EMC_training

PreciseBN:

Result:


Issue #2: inconsistent behaviors between training and inference

The gap between population statistics and mini-batch statistics introduces inconsistency.

Strategy #1: using mini-batch during inference

Strategy #2: using population batch during training


Issue #3: batch from diverse domains/sources

BatchNorm can be considered as two separate phases:

  1. Features learned by SGD
  2. Population statistics trained using the features by EMA or PreciseBN

What if multiple domains involving during training and inference?

Domain gaps occur between the inputs in:

Let’s discuss two scenarios.

Scenarios #1: a model is trained on one domain but tested on others

Scenarios #2: a model is trained on multiple domains (in this case we mean feature layers).


Issue#4: information leakage within a batch

Unwanted batch-wise information might be exploited since the model learned from a mini-batch rather than an individual sample.

Some example:

Solutions:

Result: both SyncBN and random shuffle can alleviate the leakage issue.

Table 6.

TL;DR