Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, Katharopoulos, Vyas, Pappas, Fleuret; 2020 - Summary
author: kelseyball
score: 9 / 10

Core Idea

Main Contributions

Softmax Attention to Linear Attention via Kernel Feature Maps

The transformer from Vaswani et. al. 2017 implements the following form of self-attention, where the similarity score is the exponential of the dot product between a query and a key:

Generalizing from dot product to any positive-valued similarity function, we can rewrite (2) as:

Our similarity function can also be a kernel function. In this context, a kernel function takes two vectors, maps them to another vector space using a kernel feature map, and returns their inner product in that space. Here is the self-attention equation using kernel feature map $\phi$:

The associative property of matrix multiplication allows us to make the following simplification,

The reduction from quadratic to linear complexity comes from the fact that the matrix product of the keys and values can be computed once and reused for every query.

Autoregressive Transformers

Linearizing Autoregressive Transformers

Autoregressive Transformers as RNNs

Experiments

Synthetic experiments

Image generation

Automatic Speech Recognition

TL;DR