You Only Look Once: Unified, Real-Time Object Detection, Redmon, Divvala, Girshick, Farhadi; 2015 - Summary
author: WhyToFly
score: 7 / 10

You Only Look Once: Unified, Real-Time Object Detection

What is the core idea?

While other object detection approaches repurpose classifiers and rely on multiple networks, YOLO uses just one CNN to look at the entire image and detect bounding boxes for objects.

Inference can be run on real-time video (45 FPS) on a Titan X GPU. A smaller version of the network can even process 155 FPS with a lower accuracy.

How is it realized (technically)?


YOLO consists of 24 convolutional layers (9 for the small model) followed by 2 fully connected layers.

The inputs are images resized to a 448x448 resolution.

The output of the model is divided into SxS patches (the paper uses S=7).

Each patch can detect B boxes (B=2 for the experiments in the paper) consisting of the x and y position of the center of the box relative to this patch, the width and height relative to the image and a confidence for the box.

It also predicts a probability for each class (-> is the object detected by the bounding box of class x?)


To pretrain the network, only the first 20 layers followed by an average-pooling and a fully connected layer are trained on ImageNet for classification.

Then, the following layers are added and the network is trained for object detection.

The loss is specifically optimized for this task:

loss function

The network is trained on the PASCAL VOC 2007 dataset using SGD with a scheduled learning rate, dropout and extensive data augmentation.

How well does the paper perform?

YOLO performs well compared to other systems at the time and is notably faster than all of them because of its simple architecture.

It struggles with

The following table shows results on the PASCAL VOC 2007 dataset (higher is better) for both real-time and non real-time systems:

What interesting variants are explored?

An error comparison between Fast R-CNN and YOLO reveals that YOLO makes more localization errors while Fast R-CNN struggles with background errors (recognizing the background of an image as an object).

The authors therefore combine the two methods by letting both predict objects and checking if YOLO makes a similar prediction for each detection the R-CNN makes.

If it does, the prediction is boosted based on the overlap and the confidence of the YOLO prediction.

This successsfully reduces background errors:

The speed of the system is now limited by Fast R-CNN however, since the input has to be passed through both R-CNN and YOLO.