In recent years Deep Convolutional Neural Networks (CNN) demonstrated a high performance on image classification tasks. Experiments showed that the number of layers (depth) in a CNN is correlated to the performance in image recognition tasks. This led to the idea that deeper networks should perform better. Creating deep networks is not as simple as adding layers. One problem is the vanishing/exploding gradients, which hamper the convergence. This obstacle can be overcome by normalized initialization and intermediate normalization layers, so that networks start converging for stochastic gradient descend (SGD) using the backpropagation algorithm. Another problem is the degradation, if the depth of a network increases, the accuracy gets saturated and then degrades rapidly. (Figure 1) A way to counter the degradation problem is using residual learning. (1)
Deep Residual Learning
It is possible to fit an desired underlying mapping H(x)by a few stacked nonlinear layers, so they can also fit an another underlying mapping F(x)=H(x)-x. As a result, it is possible to reformulate it to H(x)=F(x)+x, which consists of the Residual Function F(x) and input x. The connection of the input to the output is called a skipt connection or identity mapping. The general idea is that if multiple nonlinear layers can approximate the complicated function H(x), then it is possible for them to approximate the residual function F(x). Therefore the stacked layers are not used to fit H(x), instead these layers approximate the residual function F(x). Both forms should be able to fit the underlying mapping.
One reason for the degradation problem could be the difficulties in approximating identity mappings by nonlinear layers. The reformulation used identity mapping as a reference and let the residual function represent the perturbations. The identity mapping can be generated by the solver through driving the weights of the residual function to zero if need be. (1)
Residual learning is implented to every few stacked layers. Figure 2 shows an example of 2 layers. As an example, formulation (1) can be defined as:
(1) F(x)=W_2\sigma(W_1x) + x
Where W_1 and W_2 are the weights for the convolutinoal layers and \sigma is the activation function, in this case a RELU function. The operation F + x is realized by a shortcut connection and element-wise addition. The addition is followed by an activation function \sigma.
The resulting formulation for a residual block is:
(2) y(x)=\sigma(W_2\sigma(W_1x) + x) .
After each convolution (weight) layer a batch normalization method (BN) is adopted. The training of the network is achiebed by stochastic gradient descent (SGD) with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus. The weight decay rate is 0.0001 and has a value of 0.9. (1)
The residual and plain networks are compared on the ImageNet 2012 classification dataset that consist of 1000 classes. All Networks are trained on 1.28 million training images and evaluated on the 50k validation images. The final result was obtained on the 100k test images.
The evaluation of the plain models showed that the 34 layer network has a higher training error than the 18 layer model. (Figure 4 left) The reason behind this result is the degradation problem.
The residual models show a contradictiory result. The deeper ResNet with 34 layers has a smaller training error than the 18 layer (Figure 4 right). This result proves that the degradation problem can be addressed with residual learning and that an increased network depth results in a gain accuracy.(1)
The original design of the residual block (Figure 1) can be represented in a more detailed way (Figure 5 a). A proposed optimization is shown in Figure 5 b. The proposed design consist of a direct path for the propagating information through the residual block as a result, through the entire network. This allows the signal to propagate from one block to any other block, during both forward and backward passes. The complexity of the training also becomes simpler with the new block design.
The original Residual block is described as y(x)=\sigma(x+F(x)). The new design change the representation to y(x)=x+F(x), because it is important to have a "clean" path from one block to the next one. The removal of the ReLU function on the main path and the different design of the Residual function F(x) is related (Figure 6).
Figure 6 shows that the activation function from the main path is moved to the the residual function of the next block. This means that the activation functions (ReLU) are now a "pre-activation" function of the weight layers. Experiments showed that the ReLU-only pre-activation performs similiarly to the original design. By adding BN to the pre-activation, the result can be improved by a healty margin. This "pre-activation" model shows consistently better results then original counterpart (Table 1) and the computational complexity is linear to the depth of the Network. (2)
1. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
2. He, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision. Springer International Publishing, 2016.
Links to related/additional Content in the Web