# Page History

## Key

• This line was removed.
• Formatting was changed.

outline true

Author: Michael Brenner

## Introduction

Object detection is an important and complex task in computer vision. In order to approach this task, multi-stage pipelines are commonly used, which is a slow and inelegant way. Object detection is complex because detection requires accurate localization of objects. This creates two challenges: The first being, that numerous candidate object locations ("proposals") must be processed. The second one is that the rough localization of the proposals must be refined to get a precise localization.

Fast R-CNN is a single stage training algorithm that classifies object proposals and refine their localisation. (2)

## Architecture

Figure 1. Architecure of Fast R-CNN. (1)

The input of Fast R-CNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixed-length feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:

1. A vector to estimates the object class is produced by a softmax-function.

2. A four-tuple

LaTeX Math Inline
body (r, c, h, w)
, which define the RoI.
LaTeX Math Inline
body (r, c)
specifies the top-left corner and
LaTeX Math Inline
body (h, w)
is the height and width of the window. (2)(1)

## Training

One big improvement provided by the Fast R-CNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of mini-batch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN. Fast R-CNN uses a streamlined training process, which jointly optimize the softmax classifier and bounding-box regressor.

The network has two outputs. The first one is the probability distribution (for each RoI),

LaTeX Math Inline
body p=(p_0,...,p_K)
, for
LaTeX Math Inline
body K+1
classes. This is computed by a softmax classifier. The second output is the bounding-box regression,
LaTeX Math Inline
body t=(t_x, t_y, t_w, t_h)
, for each of the classes. Each training RoI is labelled with a class
LaTeX Math Inline
body u
and a bounding-box regression target
LaTeX Math Inline
body v
LaTeX Math Inline
body L
is used to jointly train for the classification and the bouding-box regression:

LaTeX Math Inline
body L(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v)
,

in which

LaTeX Math Inline
body L_{cls}(p,u) = -log(p_u)
is log loss for the true class
LaTeX Math Inline
body u
LaTeX Math Inline
body L_{loc}
, is defined over a tuple of the true bounding-box regression targets
LaTeX Math Inline
body v = (v_x, v_y, v_w, v_h)
, for class
LaTeX Math Inline
body u
and a predicted tuple
LaTeX Math Inline
body t=(t_x, t_y, t_w, t_h)
again for class
LaTeX Math Inline
body u
. The bounding-box regression is defined as
LaTeX Math Inline
body L_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u- v_i)
in which
LaTeX Math Inline
body smooth_{L1}
is
LaTeX Math Inline
body 0.5x^2
if
LaTeX Math Inline
body |x|<1
, otherwise it is
LaTeX Math Inline
body |x| - 0.5
. The parameter
LaTeX Math Inline
body \lambda
is used to balance the two task losses.(2)

### Back-Propagation through RoI pooling layers

Back-propagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable

LaTeX Math Inline
body x_i
by following the argmax function:

LaTeX Math Inline
body \frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}}

The argmax function is defined as

LaTeX Math Inline
body i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'}
LaTeX Math Inline
body R(r,j)
is the index set of inputs in the sub-window over which the ouput unit
LaTeX Math Inline
body y_{rj}
max pools. For each mini-batch RoI
LaTeX Math Inline
body r
and for each pooling output unit
LaTeX Math Inline
body y_{rj}
, the derivative
LaTeX Math Inline
body \delta L / \delta y_{rj}
is calculated, if
LaTeX Math Inline
body i
is the argmax selected for
LaTeX Math Inline
body y_{rj}
by max pooling.(2)

### SGD hyper-parameters

The softmax function and bounding-box regression are initialized vom zero-mean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)

## Fazit

### Results

Fast R-CNN achieves top results on the visual object classes challenge of 2007, 2010 and 2012. Table 1 displays three object detectors, which are trained on a 16 layer deep Network. It shows that Fast R-CNN is faster to train, faster to test and achieves higher accuracy. This results present a big step to real time object detection.

Fast R-CNNR-CNN (1)SPP-net (3)
Train time (h)9.58425
Speedup8.8x1x3.4x
Test time/image0.32s47.0s2.3s
Test speedup146x1x20x
mAP66.9%66%63.1%

Table 1. Timings exclude object proposal time, which is equal for all methods. All methods use VGG16 from Simonyan and Zisserman.(2)(3)(1)

Fast R-CNN overcome many disadvantages of earlier methods and improves in speed and accuracy. This method has several advantages:

1. Higher detection quality (mAP) than R-CNN (1), SPPnet (3)
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching (2)

## Literature

1. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

2. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

3. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.

4. Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013).