Architecture
Figure 1. Architecure of Fast R-CNN. (1)
The input of Fast R-CNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixed-length feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:
1. A vector to estimates the object class is produced by a softmax-function.
2. A four-tuple
, which define the RoI.
specifies the top-left corner and
is the height and width of the window. (
2)(
1)
Training
One big improvement provided by the Fast R-CNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of mini-batch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN. Fast R-CNN uses a streamlined training process, which jointly optimize the softmax classifier and bounding-box regressor.
Multi-task loss
The network has two outputs. The first one is the probability distribution (for each RoI),
, for
classes. This is computed by a softmax classifier. The second output is the bounding-box regression,
LaTeX Math Inline |
---|
body | t=(t_x, t_y, t_w, t_h) |
---|
|
, for each of the classes. Each training RoI is labelled with a class
and a bounding-box regression target
. A multi-task loss
is used to jointly train for the classification and the bouding-box regression:
LaTeX Math Inline |
---|
body | L(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v) |
---|
|
,
in which
LaTeX Math Inline |
---|
body | L_{cls}(p,u) = -log(p_u) |
---|
|
is log loss for the true class
. The second task loss
, is defined over a tuple of the true bounding-box regression targets
LaTeX Math Inline |
---|
body | v = (v_x, v_y, v_w, v_h) |
---|
|
, for class
and a predicted tuple
LaTeX Math Inline |
---|
body | t=(t_x, t_y, t_w, t_h) |
---|
|
again for class
. The bounding-box regression is defined as
LaTeX Math Inline |
---|
body | L_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u- v_i) |
---|
|
in which
is
if
, otherwise it is
. The parameter
is used to balance the two task losses.(
2)
Back-Propagation through RoI pooling layers
Back-propagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable
by following the argmax function:
LaTeX Math Inline |
---|
body | \frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}} |
---|
|
The argmax function is defined as
LaTeX Math Inline |
---|
body | i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'} |
---|
|
.
is the index set of inputs in the sub-window over which the ouput unit
max pools. For each mini-batch RoI
and for each pooling output unit
, the derivative
LaTeX Math Inline |
---|
body | \delta L / \delta y_{rj} |
---|
|
is calculated, if
is the argmax selected for
by max pooling.(
2)
SGD hyper-parameters
The softmax function and bounding-box regression are initialized vom zero-mean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)