Learning Transferable Visual Models From Natural Language Supervision, Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever; 2021 - Summary
author: DartingMelody
score: 10/10 / 10

What is the core idea?

The paper demonstrates that basic pre-training tasks of predicting the caption which best describes the image is a scalable and efficient method to learn state of the art image representation from scratch on the WebImageText dataset (introduced in this paper) of 400 million (image, text) pairs obtained from internet. Motivated by the idea of learning perception from supervision contained within natural language, CLIP (Contrastive Language-Image Pretraining, where constrastive learning is identifying between similiar and dissimilar images), efficiently performs zero-shot transfer (predicting unseen tasks/datasets).

How is it realized (technically)?

How well does the paper perform?

Zero shot clip versus Linear Probe on ResNet-50 Comparison of zero shot versus few-shot linear probes ‘
results1 results2

Linear probe performance of CLIP models in comparison with state-of-the-art computer vision models.

results3

results4

What interesting variants are explored?

The paper used different backbones for image encoder like Resnet-50, 101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14 and ViT-L/14@336px with the last CLIP model being the best.

TL;DR