Table of Contents  


Image 1: Behaviour of the spatial transformer, during 10 training steps. As visible, the spatial transformer is able to remove redundant background data from the image. Image source (2).
Introduction
Current CNNs are only somewhat invariant in translation through the use of a maxpooling layer.^{(1)} A key requirement of a CNN is that it is able to correctly classify objects in real world image data. Objects in images are usually at random positions and taken from random viewpoints at different scales. Therefore, a CNN has to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. The Spatial Transformer Module manages to do that, with remarkable results.
2D Geometric Transformations
2D Geometric Transforms are a set of transforms to alter parameters such as scale, rotation and position of an image. The transformation is done by multiplying each coordinate vector of an image with one of the transformation matrices shown in table 1.
The effect of the respective transformations can be seen on the right.
Table 1: Hierachy of 2D geometrical transformations. Image source [2].  Image 2: Geometrical effect of the transformations on an image. Image source [2]. 
The following shows a generic matrix for each of the 2D geometric transformations. In order to perform the transformation in one single multiplication, homogeneous coordinates are used, which means, that a 1 is added as a third dimension, i.e.
LaTeX Math Inline  


Transformation  Matrix Entries  

Translation 
 
Rigid (Rotation + Translation) 
 
similarity (scaled rotation) 
 
Affine Transfromation 
 
Projection 

Translational Invariance Trough the MaxPooling Layer
MaxPooling is a form of nonlinear downsampling, which is an essential part of standard CNNs.^{(1)} The input to the CNN is separated into nonoverlapping squares of the same size. Each square is then reduced to its maximum pixel value, while the other values are dropped. For further information on MaxPooling, please refer to the basics pages. While the MaxPooling Layer reduces computational complexity in the higher layers of a network, it also provides a form of translation invariance, which will be covered in the following.
Image 3: The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 maxpooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 maxpooling, 5 out of 8 translation directions give an identical result^{(1)}, which means that the translation invariance rises with the size of the maxpooling.
As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.
Performance Compared to Standard CNNs
While providing stateoftheart results, the computation time of the SpatialTransformerCNN introduced by Jaderberg et. al is only 6% slower than the corresponding standard CNN.^{[1]}
The table in image 6 [1] shows the comparison of the results from different neural networks on the MNIST dataset. The table distinguishes between fully convolutional networks (FCN) and convolutional neural networks (CNN). It further includes a spatial transformer module to each of the network types (STFCN and STCNN) with
LaTeX Math Inline  


Image 6: The Table shows the comparison of the classification error of different network models, on several distorted versions of the MNIST data set. Networks which include a spatial transformer module outperform the classic neural networks. The images on the right show examples for the input to the spatial transformer (a), visualization of the transformation (b) and output after the transformation (c). While the left column uses a thin plate spline (TPS), the transformation on the right is affine. Image source [1].
Unsupervised SubObject Classification
Another important discovery from Jaderberg et al is, that they achieved a form of unsupervised and finegrained classification. The presented experiments were done on the CUB2002011 bird data set which contains images of 200 different bird species. These bird images not only consist of different species, but are taken from different angles and points of view, with different scaling and individual background scenery. This gives an idea, how challenging this dataset is. Previous to the introduction of the spatial transformer module Simon and Roder also performed unsupervised finegrained classification on this bird data set. By analyzing the constellation of part detectors, that fire at approximately the same relative location from each other they achieved a stateoftheart classification accuracy of 81.0%^{[3]}. The table in Image 7 shows, that Jaderberg et al were able to improve this result by 1.3 percentage points using a CNN model with Inception architecture.^{[1]} Given the latter network as a basis, they inserted either 2 or 4 spatial transformer modules into the network architecture, achieving even higher classification results. The images next to the table on the right show the region of interest on which each of the transformer modules focused. This shows that when spatial transformer modules are put in parallel, each can learn a different part of an object. In this case, one of the transformer modules focused on the head and the other on the body of a bird. Other work on this data set with subobject classification was done by Branson et al. While the latter explicitly defined parts of the bird, and trained separate detectors on these parts, Jaderberg et al achieved this separation in a completely unsupervised manner.
Image 7: The table shows the classification results of different network architectures for the CUB2002011 bird data set. The spatial transformer networks are able to outperform the other networks. The images on the left show the behaviour of the spatial transformers in the network, when 2 (upper row) or 4 (lower row) are in parallel inside the network. Interestingly, each of the spatial transformers learned to focus on either the head or the body of the bird, in a completely unsupervised manner. Image source [1].
Problems and Limitations
When using a spatial transformer, it is possible to downsample, or oversample a feature map.^{[1]} Using a sampling kernel such as the bilinear kernel which is of fixed width, can cause aliasing effects in the output feature map.
As described in the last chapter, spatial transformer modules can be used for finegrained classification, which means, that also subparts of a class can be detected by a neural network. While this is a very promising result, the number of objects a STN can model is limited to the number of parallel spatial transformers in the network.^{[1]}
Literature AnchorLiterature Literature
Literature  
Literature 
[1] Spatial Transformer Networks (2015, M. Jaderberg et. al)
[2] Computer Vision: Algorithms and Applications (2011, R. Szeliski)
[3] Neural Activation Constellations: Unsupervised Part model Discovery with Convolutioinal Networks (2015, M. Simon and E. Rodner)