Cascade RCNN: Delving into High Quality Object Detection, Cai, Vasconcelos; 2017  Summary
author:  mmcinnestaylor 
score:  8 / 10 
Core Idea
 In object detection, an intersection over union (IoU) threshold is used to define the percentage of overlap of a predicted bounding box over the ground truth bounding box in order to be considered a positive sample.
 Cascade RCNN uses a sequence of detectors with increasing IoU thresholds, leading to higher selectivity against false positives. The output distribution of a lower quality detector at the previous stage is used as input for a higher quality detector at the successive stage.
 The cascade procedure helps reduce overfitting during training, and leads to closer matches between the hypothesis and detector quality at each stage during inference.
Technical Implementation
 The architecture is framed as a cascaded regression problem:

\[f(x,\textbf{b}) = f_T\circ f_{T1}\circ...\circ f_1(x,\textbf{b})\]
 Here, \(T\) is the total number of cascade stages.
 Each regressor \(f_t\) is optimized w.r.t. the sample distribution \(\textbf{b}^t\) which arrives at stage \(t\) instead of the initial distribution \(\textbf{b}^1\)

\[f(x,\textbf{b}) = f_T\circ f_{T1}\circ...\circ f_1(x,\textbf{b})\]
 Each stage \(t\in T\) includes a classifier \(h_t\) and regressor \(f_t\) which are optimized for an IoU threshold \(u^t\) where \(u^t>u^{t1}\)
 The above is achieved by minimizing the following loss:

\[L(x^t, g) = L_{cls}(h_t(x^t),y^t) + \lambda[y^t\geq 1]L_{loc}(f_t(x^t,\textbf{b}^t),\textbf{g})\]
 \[\textbf{b}^t = f_{t1}(x^{t1},\textbf{b}^{t1})\]
 And where \(g\) is the ground truth object for \(x^t\), \(\lambda = 1\) is the tradeoff coefficient, \([\cdot]\) is the indicator function, and \(y^t\) is the label of \(x^t\) given \(u^t\)
 The function is the sum of two terms related to seperate tasks in object detection.
 The first term in the loss is related to the classification task, while the second term is related to the bounding box regression task
 The tradeoff coefficient helps manage the contribution of the two tasks to the overall loss

\[L(x^t, g) = L_{cls}(h_t(x^t),y^t) + \lambda[y^t\geq 1]L_{loc}(f_t(x^t,\textbf{b}^t),\textbf{g})\]
Variants
 Experiments were performed implementing the architecture while varying the number of cascade stages. The results are summarized below.
 1~n denotes an emsemble of n classifiers on proposals at a stage
Results
 The first table compares results of multiple state of the art single model object detectors.
 The first block are onestage, the second block are twostage, and the third block multistage
 Cascade RCNN outperformed all other detectors in this comparison
 The second table summarizes the cascade procedure applied to different detection architectures. The consistent improvement in detection performance led the authors to suggest that Cascade RCNN had strong generalization capacity.
 The authors note that since the computational cost of a detection head is usually small compared to the Region Proposal Network, the computational overhead of the cascade procedure is thus small both during training and testing.
TL;DR
 Cascade RCNN uses a series of connected stages with increasing IoU thresholds to achieve state of the art detection performance
 Experiments demonstrate the detection performance comes as a small increase in computational cost
 Experiments also demonstrate the architecture to be effective across a number of different backbones