A Primer in BERTology: What we know about how BERT works, Rogers, Kovaleva, Rumshisky; 2020 - Summary
author: timchen0618
score: 8 / 10

Core Idea

It is basically a survey paper that try to identify how BERT works and what it captures to do the end tasks. They also investigate the training process of BERT and the overparameterization problem.

Knowledge That BERT Has

Syntactic Knowledge

Semantic Knowledge

World Knowledge

Localized Linguistic Knowledge

Attention Heads

Layers

Training BERT

They also offer some tips as to how to train a BERT model. Some useful tips include:

Model Architecture Choices

Improvements to Training

Improvements to Pretraining Data

Larger pretraining datasets, longer training, including linguistic information in the data, and considering structured knowledge such as knowledge base or entites when training.

Improvements on Fine-tuning

Overparameterization Problem

They also stated the BERT is largely overparameterized. For example, people could prune most of the attention heads without great performance loss. In some task, more layers cause performance drops.

Solution: Compression Techniques

Suggestion on Future Work

TL;DR