Introduction
In recent years Deep Convolutional Neural Networks (CNN) demonstrated a high performance on image classification tasks. Experiments showed that the number of layers (depth) in a CNN is correlated to the performance in image recognition tasks. This led to the idea that deeper networks should perform better. Creating deep networks is not as simple as adding layers. One problem is the vanishing/exploding gradients, which hamper the convergence. This obstacle can be overcome by normalized initialization and intermediate normalization layers, so that networks start converging for stochastic gradient descend (SGD) using the backpropagation algorithm. Another problem is the degradation, if the depth of a network increases, the accuracy gets saturated and then degrades rapidly. (Figure 1) A way to counter the degradation problem is using residual learning. (1)
Deep Residual Learning
Residual Learning
It is possible to fit an desired underlying mapping
LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


One reason for the degradation problem could be the difficulties in approximating identity mappings by nonlinear layers. The reformulation used identity mapping as a reference and let the residual function represent the perturbations. The identity mapping can be generated by the solver through driving the weights of the residual function to zero if need be. (1)
Implementation
Residual learning is implented to every few stacked layers. Figure 2 shows an example of 2 layers. As an example, formulation (1) can be defined as:
(1)
LaTeX Math Inline  


Where
LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


The resulting formulation for a residual block is:
(2)
LaTeX Math Inline  


After each convolution (weight) layer a batch normalization method (BN) is adopted. The training of the network is achiebed by stochastic gradient descent (SGD) with a minibatch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus. The weight decay rate is 0.0001 and has a value of 0.9. (1)
Results
The residual and plain networks are compared on the ImageNet 2012 classification dataset that consist of 1000 classes. All Networks are trained on 1.28 million training images and evaluated on the 50k validation images. The final result was obtained on the 100k test images.
The evaluation of the plain models showed that the 34 layer network has a higher training error than the 18 layer model. (Figure 4 left) The reason behind this result is the degradation problem.
The residual models show a contradictiory result. The deeper ResNet with 34 layers has a smaller training error than the 18 layer (Figure 4 right). This result proves that the degradation problem can be addressed with residual learning and that an increased network depth results in a gain accuracy.(1)
Optimization
The original design of the residual block (Figure 1) can be represented in a more detailed way (Figure 5 a). A proposed optimization is shown in Figure 5 b. The proposed design consist of a direct path for the propagating information through the residual block as a result, through the entire network. This allows the signal to propagate from one block to any other block, during both forward and backward passes. The complexity of the training also becomes simpler with the new block design.
The original Residual block is described as
LaTeX Math Inline  


LaTeX Math Inline  


LaTeX Math Inline  


Figure 6 shows that the activation function from the main path is moved to the the residual function of the next block. This means that the activation functions (ReLU) are now a "preactivation" function of the weight layers. Experiments showed that the ReLUonly preactivation performs similiarly to the original design. By adding BN to the preactivation, the result can be improved by a healty margin. This "preactivation" model shows consistently better results then original counterpart (Table 1) and the computational complexity is linear to the depth of the Network. (2)
Literature
1. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
2. He, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision. Springer International Publishing, 2016.
Weblinks
Links to related/additional Content in the Web