An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby; 2020 - Summary
author: elampietti
score: 8 / 10

Traditionally, computer vision problems were attempted with CNN architectures, however, due to the success of Transformers with NLP tasks, this paper experiments with applying Transformers to computer vision problems. The advantage of this is that transformers require much less computational resources than CNN architectures during training. Related work includes the iGPT model that first reduces image resolution and color space before employing Transformers to the image pixels. The iGPT model achieves 72% accuracy on ImageNet.

This technique of using Transformers for vision is done by dividing the images into patches which are then flattened and transformed into a sequence of linear embeddings. Positional information is retained by adding positional embeddings to the patch embeddings. This sequence of embeddings is then fed into a standard Transformer. The model is then trained with supervision for image classification. An overview of this Vision Transformer model is picture below in Figure 1.


The paper does not make significant modifications to the standard Transformer architecture so that scalable NLP Transformer architectures can be used for these compute vision tasks as well.

CNNs have much more image-specific inductive bias compared to the Vision Transformer due to their locality, 2D neighborhood structure, and translation equivariance in each layer of the model while the Vision Transformer only has translation equivariance and locality in MLP layers.

Training of the Vision Transformer is done on large datasets and then it is fine-tuned for smaller tasks. The paper finds that training on large datasets mitigates the inductive bias issue. They also found that using a higher resolution dataset for fine-tuning is more effective than during pre-training. For these high resolution images the patch size remains the same fora larger effective sequence length.

Evaluation of the model shows that the Vision Transformer attained state-of-the-art results for representation learning at much lower pre-training costs. The ImageNet dataset was used to evaluate the scalability of the model. When pretrained on the smallest dataset, ImageNet, the large vision transformer models performed worse than the base ones. The ‘base’ and ‘large’ vision transformer variants were based on the BERT configurations while the ‘huge’ model was novel to the paper. They evaluated different size configurations of the model as seen in Table 1 below.


The following Table 2 displays results showing that all the Vision Transformer variants outperformed the Big Transfer and Noisy Student state-of-the-art CNNs on image classification benchmarks with much less pre-training computational resources when pre-trained on a large JFT-300M dataset.


The following chart in Figure 4 shows how the Vision Transformer performs better with larger datasets while the ResNets are better on smaller datasets.


Visualizing the attention aspect of the model reveals that it is able to focus on regions of the image that are relavant for classification as seen in the figure below.