An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby; 2020 - Summary
author: zayne-sprague
score: 7 / 10

The Big Idea

By splitting images into chunks and putting the chunks in a sequence, we can utilize transformers for image classification without heavy attention costs.

Why can’t we use transformers directly on images?

How can we avoid the large cost?

How is it realized?

Basic Vision Transformer

Math on Image Patches

Image Definition

Patch Definiton

Does it work?

Results table

Note: JFT-300M is a dataset google made with 300million images. (This was and I think currently is still privately held)

Why does it do so well?

Visualization of ViT Features

So why not use ViT everywhere, where’s the catch?

Example of ViT vs Dataset Size Visualization of ViT vs CNN learning

Why not combine CNNs & Transformers (other variants)

TL;DR