author:  jordi1215 
score:  9 / 10 
1. Class Imbalance Problem
 The paper tackles the problem of longtailed data distribution where a few classes account for most of the data, while most classes are underrepresented.

Head: For the class with small indices, these classes have the larger number of samples.

Tail: For the class with large index, these classes have the smaller number of samples.

Black Solid Line: Models directly trained on these samples are biased toward dominant classes.

Red Dashed Line: Reweighting the loss by inverse class frequency may yield poor performance on realworld data with high class imbalance.

Blue Dashed Line: A classbalanced term is designed to reweight the loss by inverse effective number of samples.
2. Effective Number of Samples
2.1 Definition
Intuitively, the more data, the better. However, since there is information overlap among data, as the number of samples increases, the marginal benefit a model can extract from the data diminishes.
 Left: Given a class, denote the set of all possible data in the feature space of this class as S. Assume the volume of S is N and N ≥ 1.
For a class, N can be viewed as the number of unique prototypes.

Middle: Each sample in a subset of S has the unit volume of 1 and may overlap with other samples.

Right: Each subset is randomly sampled from S to cover the entire set of S. The more data is being sampled, the better the coverage of S is. The expected total volume of sampled data increases as the number of samples increases and is bounded by N.
Therefore, the effective number of samples is defined as the expected volume of samples.
The idea is to capture the diminishing marginal benefits by using more data points of a class. Due to intrinsic similarities among realworld data, as the number of samples grows, it is highly possible that a newly added sample is a nearduplicate of existing samples.
In addition, CNNs are trained with heavy data augmentations, all augmented examples are also considered as same with the original example.
2.2 Mathematical Formulation

Denote the effective number (expected volume) of samples as \(E_n\), where \(n\) is the number of samples.

To simplify the problem, the situation of partial overlapping is not considered.

That is, a newly sampled data point can only interact with previously sampled data in two ways: either entirely inside the set of previously sampled data with the probability of p or entirely outside with the probability of 1p.
Proposition (Effective Number):
\[E_n = \frac{(1−β^n)}{(1−β)}\]where
\[β = \frac{(N− 1)}{N}\]This proposition is proved by mathematical induction.
3. ClassBalanced Loss (CB Loss)

The ClassBalanced Loss is designed to address the problem of training from imbalanced data by introducing a weighting factor that is inversely proportional to the effective number of samples.

The paper denotes the loss as \(ረ(p,y)\), where \(y\in\{1,2,...,C\}\), and \(C\) is the total number of classes, and \(p\) is the estimated class probability.
The classbalanced (CB) loss can be written as:
where\(n_y\)is the number of samples in the groundtruth class \(y\). The visualization is as follows where it depicts a function of \(n_y\) for different β
 Note that β=0 corresponds to no reweighting and β→1 corresponds to reweighing by inverse class frequency. The proposed novel concept of effective number of samples enables us to use a hyperparameter β to smoothly adjust the classbalanced term between no reweighting and reweighing by inverse class frequency.
3.1 ClassBalanced Softmax CrossEntropy Loss
Given a sample with class label y, the softmax crossentropy (CE) loss for this sample is written as:
Suppose class y has \(n_y\) training samples, the classbalanced (CB) softmax crossentropy loss is:
3.2 ClassBalanced Sigmoid CrossEntropy Loss

When using sigmoid function for multiclass problem, each output ode of the network is performing a onevsall classification to predict the probability of the target class over the rest of classes.

In this case, Sigmoid doesn’t assume the mutual exclusiveness among classes.

Since each class is considered independent and has its own predictor, sigmoid unifies singlelabel classification with multilabel prediction. This is a nice property to have since realworld data often has more than one semantic label.
The sigmoid crossentropy (CE) loss can be written as:
The classbalanced (CB) sigmoid crossentropy loss is:
3.3 ClassBalanced Focal Loss
The focal loss (FL) proposed in RetinaNet, reduce the relative loss for wellclassified samples and focus on difficult samples:
The classbalanced (CB) Focal Loss is:
4 Experimental Results
4.1 Datasets

5 longtailed versions of both CIFAR10 and CIFAR100 with imbalance factors of 10, 20, 50, 100 and 200 respectively, are tried.

iNaturalist and ILSVRC are class imbalance in nature.
 The above shows the number of images per class with different imbalance factors.
4.2 CIFAR Datasets

The search space of hyperparameters is {softmax, sigmoid, focal} for loss type, β ∈ {0.9, 0.99, 0.999, 0.9999}, and γ ∈ {0.5, 1.0, 2.0} for Focal Loss.

The best β is 0.9999 on CIFAR10 unanimously.

But on CIFAR100, datasets with different imbalance factors tend to have different and smaller optimal β.

On CIFAR10, when reweighting based on β = 0.9999, the effective number of samples is close to the number of samples. This means the best reweighting strategy on CIFAR10 is similar with reweighting by inverse class frequency.

On CIFAR100, the poor performance of using larger β suggests that reweighting by inverse class frequency is not a wise choice. A smaller β is needed that has smoother weights across classes.

For example, the number of unique prototypes of a specific bird species should be smaller than the number of unique prototypes of a generic bird class. Since classes in CIFAR100 are more finegrained than CIFAR10, CIFAR100 have smaller N compared with CIFAR10.
4.3 LargeScale Datasets

The classbalanced Focal Loss is used since it has more flexibility and it is found that β = 0.999 and γ = 0.5 yield reasonably good performance on all datasets.

Notably, ResNet50 is able to achieve comparable performance with ResNet152 on iNaturalist and ResNet101 on ILSVRC 2012 when using classbalanced Focal Loss to replace softmax crossentropy loss.
 The above figures show the classbalanced Focal Loss starts to show its advantage after 60 epochs of training.
TL;DR
 The paper tackles the problem of longtailed data distribution where a few classes account for most of the data, while most classes are underrepresented.
 The paper provides a theoretical framework to study the effective number of samples and show how to design a classbalanced term to deal with longtailed training data.
 The paper shows that significant performance improvements can be achieved by adding the proposed classbalanced term to existing commonly used loss functions including softmax crossentropy, sigmoid crossentropy and focal loss.