Large-Scale Long-Tailed Recognition in an Open World, Liu, Miao, Zhan, Wang, Gong, Yu; 2019 - Summary
author: sritank
score: 9 / 10

OLTR combines 3 different tasks namely, imbalanced classification, few-shot learning and open-set recognition. Direct feature embeddings in CNNs lack sufficient gradient updates from tail classes. To deal with tail recognition, they introduce “dynamic meta-embedding”. It combines direct features from standard embedding and induced features to enrich tail classes. They generate induced features from memory using principles from meta learning to capture visual concepts from training classes that are retrieved from a shallow model. The meta embedding is combined with the direct embedding to give more features to tail classes.

They compute class centroids in two steps: neighborhood sampling and propagation. For an input image v^memory enhances its direct feature if it is in the tail class. v^memory is calculated as:

memory embedding

where o is the coefficient hallucinated from the direct feature.

To obtain the dynamic meta embedding, they use the formulation:

dynamic embedding

where gamma is given by:

gamma

gamma is small if the input likely belongs to the training class and is very big otherwise. This helps in encoding open classes.

e is a concept selector that adaptively selects between the memory and direct features in a soft manner. It is given by:

concept selector e

They use modulated attention to encourage samples from different classes to use different contexts, given by: attention

A cosine classifier is used as: cosine classifier

The loss function is a combination of cross entropy and large margin loss between the embeddings and centroids:

loss

The final model is given by: OLTR model

They test their model on different curated datasets namely ImageNet, Places, MS1M. They are curated to have long tails. The model performance is evaluated on classification accuracy of many-shot, medium-shot and few-shot classes. It is also evaluated based on the open-class detection error. The model achieves SOTA results on ImageNet-LT, MegaFace and SUN-LT datasets and does very well on Places-LT dataset as shown below:

performance 1

performance 2

robustness

learnt meta feature

TL;DR