Momentum Contrast for Unsupervised Visual Representation Learning, He, Fan, Wu, Xie, Girshick; 2019 - Summary
author: elampietti
score: 9 / 10

Traditionally, computer vision tasks are more effective with supervised pre-training while tasks in natural language processing (NLP) are successful with unsupervised representation learning. This is because NLP tasks build tokenized dictionaries in their discrete signal spaces that unsupervised learning can be applied to. However, computer vision has a continuous, high-dimensional space that makes it more challenging for dictionary building. This paper focuses on how to build dynamic dictionaries for unsupervised visual representation learning using their Momentum Constrast (MoCo) for training.

The keys in this dynamic dictionary are images or patches and are put into an encoder network allowing dictionary look-up for a matching key. The goal of this learning would be to minimize contrastive loss when training the visual representation encoder. This training process is visualized in the below figure.

train_encoder

The optimal dynamic dictionary should be large to represent the continuous, high-dimensional visual space and they should also be consistent. The dictionary has a queue of samples where the current mini-batch encodings are enqueued while the older are dequeued. Consistency is then maintained with a key encoder that slowly progressed as a momentum-based moving average of the query encoder.

Contrastive learning is used to train the encoder for a dictionary look-up task. This is done by finding the key that has the smallest contrastive loss with the query, meaning it is the most similar. This similarity loss function is shown below.

contrastive_loss

Momentum Contrast is employed to make sure that the dynamic dictionary being built is large and the encoder of the keys is consistent during its evolution. This is done by using a queue for the dictionary data samples so that the dictionary size can be larger than the mini-batch size and set as a hyper-parameter. The queue is set up so that the current mini-batch is enqueued to the dictionary and then the oldest mini-batch is removed, which allows the re-use of immediately preceding mini-batches. This dictionary will therefore always represent a sampled subset of all data. A momentum update is then employed during back-propagation to solve the issue for updating keys when using a queue for samples. This means that the encoder will evolve slowly, therefore keeping it consistent. This momentum update is shown below where m = [0,1) is the momentum coefficient.

momentum_update

Since only theta_q is updated with back-propagation, the key updates evolve slowly and are encoded by slightly different encoders in different mini-batches.

Two previous mechanisms are end-to-end updates and the memory bank approach. The end-to-end update uses back-propagation however it is challegend by large mini-batch optimization and the memory bank approach sufferes from inconsistency in the encodings.

Pseudocode of Momentum Constrast is shown below where the pretext tasks has positive sample pairs for a query and key if they are from the same image, and negative sample pairs otherwise.

moco_pseudocode

The encoder was built with a ResNet that outputs a fixed 128-D vector for the representation of the query or the key. The ecoders also have shuffling Batch Normalization to prevent them from cheating the pretext task and benefits training.

Momentum Contrast performance is studied on ImageNet-1M and Instagram-1B with an SGD optimizer. Three variations of contrastive loss are visualized below along with their results using the linear classification protocol.

loss_variations

contrastive_loss_variation_results

All three methods show that larger K is beneficial, and therefore a larger dictionary is always better. Additionally, the results of using a larger momentum value are shown below and reveal that slow evolution is best.

momentum_results

Comparison with previous unsupervised learning methods are done with accuracy vs. #parameters trade-offs. MoCo is able to achieve 60.6% accuracy compared to competitors of similar model sizes and also does best with large models as shown below.

lcp_results

Unsupervised learning should learn features that are transferrable to downstream tasks. Feature normalization and schedules for fine-tuning MoCo are performed so that MoCo uses the same hyper-parameters and schedule as the supervised ImageNet counterpart. PASCAL VOC object detection was used to evaluate Moco vs. ImageNet supervised pre-training. The results below show that MoCo is able to beat the supervised learning counterpart for VOC object detection.

VOC_results

And here are results showing MoCo beating the supervised counterpart with a C4 backbone.

VOC_results_2

Here are the results for the COCO dataset showing that MoCo is better than the supervised counterpart for both backbones.

COCO_results

Finally, downstream task evaluation reveals that MoCo is competitive with ImageNet supervised pre-training.

downstream_evaluation

The better results with MoCo pre-trained on the less curated dataset IG-1B compared to IN-1M shows that MoCo is good for real-world unsupervised learning.

Some ablations performed include COCO longer fine-tuning which shows that the MoCo pre-trained features have an advantage over the supervised features of ImageNet if they are fine-tuned for longer. Another ablation includes shuffling and not shuffling the batch normalization. Without shuffling shows overfitting for the pretext task and cheating by revealing which sub-batch the positive key is in.

shuffling_BN

TL;DR