Convolutional-like operations move a sliding window over an input tensor of size (N, Cin, Hin, Win), where \(N\) is the batch size, \(C\) the number of channels and \(H, W\) the width and height of the input. In each sliding window, we apply the same operation e.g. linear transformation conv2d, or channel-wise max max pooling. The output of a Convolutional-like operation is a tensor of size (N, Cout, Hout, Wout).
All convolutional-like operations share a set of common hyper-parameters:
- The size of the sliding window \(K\)
- The spacing between sliding windows, known as stride
- The position of the top-right window, as controlled by padding.
Striding is a parameter used in both the conv2d layers and the pooling layers. It controls how far the sliding window jumps between consecutive outputs. Striding naturally down-samples the output tensor, by skipping certain input locations. Increase striding when you want a smaller overlap overlap between neighboring windows and smaller output sizes. Typically, the striding of height and width dimensions are set to be equal \(kH = kW\).
Padding adds rows or colums of a constant value c, usually zero, to any side of the input tensor. Padding the input tensor virtually shifts the position of the top-left sliding window. It can also increase the output dimension by allowing for more windows.