Decoupling Representation and Classifier for Long-Tailed Recognition, Kang, Xie, Rohrbach, Yan, Gordo, Feng, Kalantidis; 2019 - Summary
author: aabayomi
score: 9/10 / 10

What is the core idea?

The introduces a two-fold approach to solving long-tail distribution problem in image recognitions by using representation learning and classification during training.

How is it realized (technically)?

Long-tail distribution has been a common problem in real world recognition tasks because of imbalance in data classes. However the authors considered different sampling methods with the combination of different classification methods show better performance and generalization compared to current SOTA.

[1] Learning representations

The authors employed use of these sampling methods during training.

Sampling-1

Sampling-1

[2] Classifications

The authors employed different approaches to train the classification model. classifier weights were retrained.

How well does the paper perform?

Datesets

Places-LT - 365 classes with 4980 to 5 images per class.

ImageNet-LT - 1000 classes with 1280 to 5 images per class.

iNaturalist - real-world data with consisting of samples from 8,142 species

Experimental Setup

On Places-LT dataset, ResNet-152 as the backbone network and pretrained on the full ImageNet-2012 dataset.

On ImageNet-LT , ResNet - 10,50,101,152 and ResNeXt- 50,101,152 (32x4d) but mainly used the ResNeXt-50 for analysis

iNaturalist used ResNet 50,101,152 as backbone network

SGD optimizer with momentum 0.9, batch size 512 with a cosine learning rate schedule from 0.2 - 0 on 224 × 224 image resolution.

First representation learning stage, the backbone network is usually trained for 90 epochs. In the second stage, i.e., for retraining a classifier (cRT), we restart the learning rate and train it for 10 epochs while keeping the backbone network fixed.

Sampling-1

Comparison among different sampling methods combined with different classification models

Sampling-2

Retraining result comparing linear classifier C with using a ResNeXt-50 as backbone B and a retrained last block LB done using smaller (0.1×) learning rate on the last block.

Sampling-3

Performance changes for the τ-normalized classifier varies and τ increases from 0 where many-shot accuracy decays dramatically while few-shot accuracy increases.

Sampling-4

The authors compare the recognition accuracy on ImageNet-LT for different backbone architectures. LWS was performant across different backbones.

Sampling-5

On both the iNaturalist and Places-LT datasets, both τ -normalized and LWS gained performance improvement when trained on 300 epochs.

TL;DR