Architecture
Figure 1. Architecure of Fast RCNN. (1)
The input of Fast RCNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixedlength feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:
1. A vector to estimates the object class is produced by a softmaxfunction.
2. A fourtuple
, which define the RoI.
specifies the topleft corner and
is the height and width of the window. (
2)(
1)
Training
One big improvement provided by the Fast RCNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of minibatch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than RCNN. Fast RCNN uses a streamlined training process, which jointly optimize the softmax classifier and boundingbox regressor.
Multitask loss
The network has two outputs. The first one is the probability distribution (for each RoI),
, for
classes. This is computed by a softmax classifier. The second output is the boundingbox regression,
LaTeX Math Inline 

body  t=(t_x, t_y, t_w, t_h) 


, for each of the classes. Each training RoI is labelled with a class
and a boundingbox regression target
. A multitask loss
is used to jointly train for the classification and the boudingbox regression:
LaTeX Math Inline 

body  L(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v) 


,
in which
LaTeX Math Inline 

body  L_{cls}(p,u) = log(p_u) 


is log loss for the true class
. The second task loss
, is defined over a tuple of the true boundingbox regression targets
LaTeX Math Inline 

body  v = (v_x, v_y, v_w, v_h) 


, for class
and a predicted tuple
LaTeX Math Inline 

body  t=(t_x, t_y, t_w, t_h) 


again for class
. The boundingbox regression is defined as
LaTeX Math Inline 

body  L_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u v_i) 


in which
is
if
, otherwise it is
. The parameter
is used to balance the two task losses.(
2)
BackPropagation through RoI pooling layers
Backpropagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable
by following the argmax function:
LaTeX Math Inline 

body  \frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}} 


The argmax function is defined as
LaTeX Math Inline 

body  i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'} 


.
is the index set of inputs in the subwindow over which the ouput unit
max pools. For each minibatch RoI
and for each pooling output unit
, the derivative
LaTeX Math Inline 

body  \delta L / \delta y_{rj} 


is calculated, if
is the argmax selected for
by max pooling.(
2)
SGD hyperparameters
The softmax function and boundingbox regression are initialized vom zeromean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)