LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, Graham, El-Nouby, Touvron, Stock, Joulin, Jégou, Douze; 2021 - Summary
author: biofizzatreya
score: 8 / 10

The paper attempts to combine, convolutional networks and training principle from CNNs with vision transformers. While transformers can generalize well over large datasets, transformers themselves have certain problems. Transformers lose all positional information and since most images are locally similar, it takes more data to train vision transformers. Moreover due to quadratic complexity of the self-attention matrix, transformers have difficulty dealing with large images. LeVit’s solution to these problems is instead of feeding an image to the transformer, they feed a image after passing it through multiple convolution layers.

image image

LeVit design principles:

image image

LeVit is trained by a teacher network. One head performs classification with cross-entropy loss the second from a RegNetY-16GF trained on imageNet. image

Ablations: To understand LeVit’s performance they performed multiple ablation studies by selectively changing different parts of the network:

TL;DR