author: zhaoyue-zephyrus
score: 10 / 10

The main idea of this paper is to increase the network depth by using 3x3 convolution layers from beginning to the end.

The final VGG-Net is mainly two versions, VGG-16 and VGG-19, where 16 and 19 refers to the number of conv and fc layers.

The paper reveals the importance of weight initialization. For random initialization with normal distribution, the authors first train a shallower version (7 layers of conv + 3 fc) and then fine-tunes a deeper one by initializing the first 4 conv layers and the last 3 fc layers using the pre-trained shallow model. This is not needed if the Glorot’s initialzation is used. This observation raises some research interests on improving network initialization, such as Kaiming’s initialization in 2015.

The paper describe the multi-scale training and testing and verify the effectiveness via extensive experiments.

In the appendix, the authors also show that VGG is generalizable to other datasets. To do so, the last layer classification is removed and the 4096-d feature from the fc layer can be obtained, aggregated across multiple locations and scales, l2-normalized, and fed into a linear SVM classifier. This can show comparable or superior results on a variety of tasks, from recognition, objection detection, semantic segmentation, etc.

TL;DR