Compressive Transformers for Long-Range Sequence Modelling, Rae, Potapenko, Jayakumar, Lillicrap; 2019 - Summary
author: zayne-sprague
score: 9 / 10

What’s the big idea

What’s Wrong With Attention

animation of transformerXL

How Can We Improve Attention

How Does This Work (How Is It Realized)

1.) You pass a sequence \(S_i = x_t, ..., x_n\) through the network

2.) As the model progresses to the next sequence \(S_i\) is moved into memory of past activation (we can still attend to it)

3.) Eventually, the memory gets to large so we evict the oldest memory

4.) Instead of deleting the oldest memory, we create a mapping to compress it and store it in a stack

 How compression works

How do we compress an activation aka “memory”

\[\\\]

The best method was Convolution with Attention-Reconstruction loss, essentially using convolutional networks to reconstruct memories.

The Convolution Nets were trained to weight the reconstruction of the memories so that the parts of the memory that received the most attention were persevered the best.

So… did it work?

Yes, due to the compact representation of the memories the Compressive Transformer has increased range when dealing with sequences

Where

Essentially, you can achieve a vastly larger range of memories, for the same attention cost of the Transformer XL.

What are the effects of longer temporal ranges

How does it perform?

Table 4 from paper Table 6 from paper

Down sides?

TL;DR