Momentum Contrast for Unsupervised Visual Representation Learning, He, Fan, Wu, Xie, Girshick; 2019 - Summary
author: zayne-sprague
score: 10 / 10

What’s the big idea

A large dynamic dictionary of keys in a queue, built via a momentum encoder, can improve contrastive learning making unsupervised pretraining in computer vision viable and better than supervised pretraining.

What’s hard about unsupervised pre-training in vision

What are pretext tasks

How does contrastive learning work

\[L_q = - log \frac{exp(q \cdot k_+ / t)}{\sum^K_{t=0}exp(q \cdot k_i / t)}\]

Why is a Queue and Momentum useful for Encoding keys in Contrastive Learning (technical implementation)

MoCo Model

(queue is not shown in this diagram)

\[\theta_k \leftarrow m\theta_k + (1-m)\theta_q\]

How does it perform

Pretext Tasks

Downstream Tasks

VOC object detection, all use C4 backbone compared against different pretraining datasets and pretext tasks vs supervised pretraining.

Downstream Tasks

MOCO performance on downstream tasks.

Does the pretext task or backbone matter (other variants)

MoCo compared with IG 1B vs IN 1M and C50 vs FPN

TL;DR