Invalid paper tag!
author: biofizzatreya
score: 8 / 10

TODO: Summarize the paper: Transformers changed NLP because they are capable of using the self attention mechanism. Where an entire sequence is converted into an attention matrix. This allows the network to see the entire sequence simultaneously. Thus the network can infer long-range correlations, which is otherwise difficult for linear models such as LSTM or RNN. However creating the attention matrix is memory intensive as it is quadratic in input sequence length, which restricts the use of transformers on longer sequences. Big-bird attempts to circumvent this challenge by using sparse-attention.

image

Instead of a full attention matrix, BigBird uses a mixture of:

In BigBird, attention is modelled as a directed graph where,

\[\text{ATTN}_D(X)_i = x_i + \sum_{h=1}^H \sigma \left( Q_h(x_i)K_h(X_{N(i)})^T\right)\cdot V_h(X_{N(i)})\]

\(Q_h\) and \(K_h\) are query and key functions respectively. \(V_h\) is a value function, \(\sigma\) is a scoring function and \(H\) is the number of heads. The BigBird attention mechanism is tested on a number of datasets.

image

BigBird performs better than models such as RoBERTa and Longformer when it comes to QA type tasks. BigBird also performs better in classification tasks, compared to BERT especially for longer documents. This is presumably because BERT itself starts failing on longer documents.

image

image

The authors tested BigBird on genomic sequences as well. Here BigBird outperforms currently available networks, especially in Promoter Region prediction.

TL;DR