Deep Residual Learning for Image Recognition, He, Zhang, Ren, Sun; 2015 - Summary
author: philkr
score: 10 / 10

If AlexNet started the modern era of deep learning, ResNets made sure it would spread to all of computer vision and beyond. The core idea of residual networks is to parametrize each layer (or group of layers) as a residual of a function. A function \(f_1(x)\) would now become \(f_2(x) = x + g(x)\).

Both representations have the same expressional power and computational requirements. However, they have a different gradient. For regular networks, back-propagation multiplies a gradient with \(\frac{\partial}{\partial x} f_1(x)\), while residual networks use \(\frac{\partial}{\partial x} x + g(x) = I + \frac{\partial}{\partial x} g(x)\). Hence, back-propagation passes the gradient back directly, and adds a second component multiplied by \(\frac{\partial}{\partial x} g(x)\).

Fun fact, the backward pass of a resnet block looks the same as the forward pass

How do we build a network out of this idea? There are two variants of basic-residual network blocks: post-activation blocks (below left), pre-activation blocks (below right).

ResNets then stack many of these blocks in a network

The first layer if each series of blocks is strided, and it’s residual connection uses a 1x1 convolution (to better deal with the change in number of channels).

ResNets really blew the competition out of the water across many computer vision tasks. They are still used today.

TL;DR