Convolution (2D)

A 2D convolution layer is one special case of “convolutional”-like operation. In convolutions each window of kernel size \(K\) is fed through a learned linear transformation (fully connected layer). All different windows use the same linear transforamtion (with the same weights). This reduces the overall number of parameters and makes convolution a very compact linear layer with large input and output tensors.

The input of convolution is tensor of size (N, Cin, Hin, Win) and output is tensor of size (N, Cout, Hout, Wout), where \(N\) is the batche size, \(C\) the number of channels and \(H, W\) the width and height of the input. It is the basic building block of the popular Convolutional Neural Network architecture that is widely used in computer vision tasks.

Mathematically the convolution is defined as follows. Let \(W_{ij} \in R^{k\times k} \) be the convolution filter, \(I_i \) be the input in \(i\)-th channel and \(O_j\) be the output in \(j\)-th channel, then \[ O_j = \sum_{i}^{C_{in}}W_{ij} * I_{i} \] where \(*\) is the 2D-convolution operation.

3D Convolution Animation
credit: Michael Plotke

Note that the convolution layer is a special case of the fully-connected layer where a prior on architecture is imposed. 2D-convolution utilizes the local spatial connection in inputs, therefore is powerful when processing inputs such as images or gomoku game states.

PyTorch Usage

>>> m = nn.Conv2d(16, 33, 3, stride=2)
>>> m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
>>> input = torch.randn(20, 16, 50, 100)
>>> output = m(input)