Deformable Convolutional Networks, Dai, Qi, Xiong, Li, Zhang, Hu, Wei; 2017 - Summary
author: ishanashah
score: 9 / 10

Background

Problem: Accomodate geometric variations in object scale, pose, or viewpoint in image recognition.

Two previous solutions:

CNN modules have fixed geometric structures.

Deformable Convolution

Deformable convolution adds 2D offsets to the regular grid sampling locations in standard convolutions.

MNIST

The offsets are learned from preceding feature maps.

The offsets are obtained by applying a convolutional layer over the same input.

Both convolutional kernels have the same spatial resolution and dilation.

The output offset fields have the same spatial resolution as the input, with 2N channels corresponding to N 2D offsets.

Both convolutional kernels for producing output features and offsets are learned simulataneously during training.

MNIST

Deformable ROI Pooling

ROI pooling converts an arbitrary sized input rectagular region into fixed size features.

Similar to deformable convolutions, offsets are added to spatial binning positions.

Offsets are obtained by first generating the pooled feature maps, then running a fully connected layer to generate normalized offsets, and finally transforming the normalized offsets into real offsets by scaling it with the ROI’s width and height.

Offset normalization allows offsets to be independent of ROI size.

MNIST

Results

Accuracy improves with more deformable convolutional layers:

MNIST

Effective dilation is a rough measure of receptive field size of a filter.

The receptive field sizes of deformable convolutions are correlated to object sizes:

MNIST

MNIST

TL;DR