Perceiver IO: A General Architecture for Structured Inputs & Outputs, Jaegle, Borgeaud, Alayrac, Doersch, Ionescu, Ding, Koppula, Zoran, Brock, Shelhamer, Hénaff, Botvinick, Zisserman, Vinyals, Carreira; 2021 - Summary
author: joshpapermaster
score: 9 / 10

Core Idea

The intial perceiver model provided good results, but the authors wanted to adapt the model to handle more complex outputs. PerceiverIO can succesfully be used on NLP and visual understanding problems with many different types of outputs, while the intial perceiver model can only handle simpler outputs such as classification.

The idea of perceiver is to create a model that can handle different tasks and perform well with minimal changes needed. They do this by using attention with the intial input to create a latent space that has a constant size. Operations done within the latent space are more scalably efficient with respect to the input because of the smaller size of the latent space dimensions.

How is it realized (technically)?

The structure of perceiverIO is similar to that of perceiver, but with a decoder at the end that can build out the latent space into any dimensions necessary for the given task. They use cross-attention and a query analysis on the output space to map the latent space features to the dynamic structure needed for the output.


The key to the higher performance of perceiverIO over transformers is the latent space. It allows for higher scalability since all of the computationally expensive aspects are done within the fixed size latent space. The model scales linearly with respect to the input and output.

Related work and why this is more adaptable:


Queries are useful to provide different types of outputs from the same latent space information.

The query for outputs of


Experiments were done on language, visual understanding, videogames, and multi-modal tasks.

Using the GLUE benchmark, perceiverIO can outperform BERT, while also scaling to larger inputs. As seen in the results below, PerceiverIO on UTF-8 bytes actually performs comparably to BERT using the SentencePiece tokens and is substantially better than BERT on the UTF-8 bytes. For these tests, the tranformers were replaced with perceiverIO architecture.


The muilti-task architecture also performed the best on the GLUE benchmark with respect to single task, shared input token, and task specific input token approaches

PerceiverIO also performed very well on optical flow, multimodal autoencoding, and StarCraft II

Additionally, the new model performed better than the original architecture on classification of ImageNet



Other variants discussed were to finetune the size of the latent space specific to the task, allow for generative models, and provide more robustness to adversarial attacks. However, the author did not implement or test these within this paper.