Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

 

 

Table of Contents
outlinetrue

Author: Michael Brenner

 

Introduction

Object detection is an important and complex task in computer vision. In order to approach this task, multi-stage pipelines are commonly used, which is a slow and inelegant way. Object detection is complex because detection requires accurate localization of objects. This creates two challenges: The first being, that numerous candidate object locations ("proposals") must be processed. The second one is that the rough localization of the proposals must be refined to get a precise localization. 

Fast R-CNN is a single stage training algorithm that classifies object proposals and refine their localisation. (2)

 

Architecture

Figure 1. Architecure of Fast R-CNN. (1)                                             

The input of Fast R-CNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixed-length feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:

1. A vector to estimates the object class is produced by a softmax-function.

2. A four-tuple

LaTeX Math Inline
body(r, c, h, w)
, which define the RoI.
LaTeX Math Inline
body(r, c)
specifies the top-left corner and
LaTeX Math Inline
body(h, w)
 is the height and width of the window. (2)(1)

 

                                         

Training

One big improvement provided by the Fast R-CNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of mini-batch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN. Fast R-CNN uses a streamlined training process, which jointly optimize the softmax classifier and bounding-box regressor.

Multi-task loss

The network has two outputs. The first one is the probability distribution (for each RoI),

LaTeX Math Inline
bodyp=(p_0,...,p_K)
, for
LaTeX Math Inline
bodyK+1
 classes. This is computed by a softmax classifier. The second output is the bounding-box regression,
LaTeX Math Inline
body t=(t_x, t_y, t_w, t_h)
, for each of the classes. Each training RoI is labelled with a class
LaTeX Math Inline
bodyu
 and a bounding-box regression target 
LaTeX Math Inline
bodyv
. A multi-task loss
LaTeX Math Inline
bodyL
 is used to jointly train for the classification and the bouding-box regression:

LaTeX Math Inline
bodyL(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v)
,

in which

LaTeX Math Inline
bodyL_{cls}(p,u) = -log(p_u)
 is log loss for the true class
LaTeX Math Inline
bodyu
. The second task loss
LaTeX Math Inline
bodyL_{loc}
, is defined over a tuple of the true bounding-box regression targets
LaTeX Math Inline
bodyv = (v_x, v_y, v_w, v_h)
, for class 
LaTeX Math Inline
bodyu
 and a predicted tuple 
LaTeX Math Inline
body t=(t_x, t_y, t_w, t_h)
 again for class
LaTeX Math Inline
bodyu
. The bounding-box regression is defined as 
LaTeX Math Inline
bodyL_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u- v_i)
 in which
LaTeX Math Inline
bodysmooth_{L1}
is
LaTeX Math Inline
body 0.5x^2
 if
LaTeX Math Inline
body|x|<1
, otherwise it is
LaTeX Math Inline
body|x| - 0.5
. The parameter 
LaTeX Math Inline
body\lambda
 is used to balance the two task losses.(2)

Back-Propagation through RoI pooling layers

Back-propagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable

LaTeX Math Inline
bodyx_i
 by following the argmax function:

LaTeX Math Inline
body\frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}}

The argmax function is defined as 

LaTeX Math Inline
body i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'}
LaTeX Math Inline
bodyR(r,j)
 is the index set of inputs in the sub-window over which the ouput unit
LaTeX Math Inline
bodyy_{rj}
 max pools. For each mini-batch RoI 
LaTeX Math Inline
bodyr
 and for each pooling output unit 
LaTeX Math Inline
bodyy_{rj}
 , the derivative
LaTeX Math Inline
body\delta L / \delta y_{rj}
 is calculated, if
LaTeX Math Inline
bodyi
 is the argmax selected for
LaTeX Math Inline
bodyy_{rj}
 by max pooling.(2)

SGD hyper-parameters

The softmax function and bounding-box regression are initialized vom zero-mean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)

 

Fazit

Results

Fast R-CNN achieves top results on the visual object classes challenge of 2007, 2010 and 2012. Table 1 displays three object detectors, which are trained on a 16 layer deep Network. It shows that Fast R-CNN is faster to train, faster to test and achieves higher accuracy. This results present a big step to real time object detection.

 Fast R-CNNR-CNN (1)SPP-net (3)
Train time (h)9.58425
Speedup8.8x1x3.4x
Test time/image0.32s47.0s2.3s
Test speedup146x1x20x
mAP66.9%66%63.1%

Table 1. Timings exclude object proposal time, which is equal for all methods. All methods use VGG16 from Simonyan and Zisserman.(2)(3)(1)
 

Advantages

Fast R-CNN overcome many disadvantages of earlier methods and improves in speed and accuracy. This method has several advantages:

1. Higher detection quality (mAP) than R-CNN (1), SPPnet (3)
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching (2) 

Literature

1. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. 

2. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

3. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.

4. Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013).