Perceiver IO: A General Architecture for Structured Inputs & Outputs, Jaegle, Borgeaud, Alayrac, Doersch, Ionescu, Ding, Koppula, Zoran, Brock, Shelhamer, Hénaff, Botvinick, Zisserman, Vinyals, Carreira; 2021 - Summary
author: aabayomi
score: 10 / 10

TODO: Summarize the paper:

What is the core idea?

The authors proposes a perceiverIO model that scales linearly in computation on varying inputs and outputs sizes by encoding its inputs into smaller latent arrays achieving SOTA results .

How is it realized (technically)?

The attention modules combines both a global query-key-value (QKV) operation and a multi-layer perceptron (MLP) operation.Similar to transformer but efficient in compute and memory.The model produces an output query matrix, O, that varies with the input size depending on the task.computational complexity of each attention module in QKV attention,is O(M N F ) for two matrix multiplication of dimension M x F and M x N,with L latent attention blocks has complexity O([M + O + LN ]N F ).

PerceiverIO uses queries constructed with output-specific features to produce outputs with different semantics.

How well does the paper perform?

PerceiverIO performed better compared to transformer models on GLUE benchmarks. Using a UTF-8 byte model it achieved great results without tokenization which is important preprocessing step in NLP.

PerceiverIO results on ImageNet image classification,compared to baseline results in the first two blocks rows.

PerceiverIO with a higher accuracy on Multi-modal autoencoding.On a audio-video labeling task with the goal to reconstruct multi-modal inputs while impeded with network bottlenecks.

Datasets

Showed better a good performance on optical flow computer vision task.

What interesting variants are explored?

TL;DR