Contrastive Learning of Medical Visual Representations from Paired Images and Text, Zhang, Jiang, Miura, Manning, Langlotz; 2020 - Summary
author: TongruiLi
score: 4/10 / 10

TODO: Summarize the paper:

This paper presented ConVIRT, an unsupervisied way to learn visual representation of medical image with no expert input required besides a text labeling the image. It essentially train a backbone that can be transfered to different backbones.

The paper assumes that the inputs are given in \(x_v, x_u\), a pair of image and text input. The goal is to learn an encoder that maps the image to a latent space reprsentation. They proposed the following:

They then use two different losses

image to text contrastive loss

where \(<v, u>\) represents the cosine similarity and \(\tau\) represents the temperature.

text to image contrastive loss

Final loss

Linear combination of the different weights

In both image and text retrival, the paper beat all previous benchmark. However, all other benchmarks were modified at best from state of art method, which may suffer from inductive bias.

The author also released some analysis on hyperparameter, which is interesting to see.

TL;DR