# Page History

## Key

• This line was removed.
• Formatting was changed.

# Overview

Spatial Transformer Networks are Convolutional Neural Networks, that contain one or several Spatial Transformer Modules. These modules attempt to make the network spatially invariant to its input data, in a computationally efficient manner, which leads to more accurate object classification results. Further they allow the localization of objects in an image and a sub-classification of an object, such as distinguishing between the body and the head of a bird in an unsupervised manner.

outline true

Image 1: Behaviour of the spatial transformer, during 10 training steps. As visible, the spatial transformer is able to remove redundant background data from the image. Image source (2).

## Introduction

Current CNNs are only somewhat invariant in translation through the use of a max-pooling layer.(1) A key requirement of a CNN is that it is able to correctly classify objects in real world image data.  Objects in images are usually at random positions and taken from random viewpoints at different scales. Therefore, a CNN has to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. The Spatial Transformer Module manages to do that, with remarkable results.

## 2D Geometric Transformations

2D Geometric Transforms are a set of transforms to alter parameters such as scale, rotation and position of an image. The transformation is done by multiplying each coordinate vector of an image with one of the transformation matrices shown in table 1.

The effect of the respective transformations can be seen on the right.

 Table 1:  Hierachy of 2D geometrical transformations. Image source [2]. Image 2: Geometrical effect of the transformations on an image. Image source [2].

The following shows a generic matrix for each of the 2D geometric transformations. In order to perform the transformation in one single multiplication, homogeneous coordinates are used, which means, that a 1 is added as a third dimension, i.e.

LaTeX Math Inline
body \bar{x} = \begin{pmatrix} x \\ y\\ 1 \end{pmatrix}
.

TransformationMatrix Entries
Translation

LaTeX Math Inline
body x' = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\\end{bmatrix} \bar{x}

Rigid (Rotation + Translation)

LaTeX Unit
body x' = \begin{bmatrix} \cos \theta & - \sin \theta & t_x \\ \sin \theta & \cos \theta & t_y \\\end{bmatrix} \bar{x}

similarity (scaled rotation)

LaTeX Unit
body x' = \begin{bmatrix} s \cos \theta & - s \sin \theta & t_x \\ s \sin \theta & s \cos \theta & t_y \\\end{bmatrix} \bar{x}

Affine Transfromation

LaTeX Unit
body x' = \begin{bmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \\\end{bmatrix} \bar{x}

Projection

LaTeX Unit
body x' = \begin{bmatrix} h_{00} & h_{01} & h_{02} \\ h_{10} & h_{11} & h_{12} \\ h_{10} & h_{11} & h_{12} \\\end{bmatrix} \bar{x}

## Translational Invariance Trough the Max-Pooling Layer

Max-Pooling is a form of non-linear downsampling, which is an essential part of standard CNNs.(1) The input to the CNN is separated into non-overlapping squares of the same size. Each square is then reduced to its maximum pixel value, while the other values are dropped. For further information on Max-Pooling, please refer to the basics pages. While the Max-Pooling Layer reduces computational complexity in the higher layers of a network, it also provides a form of translation invariance, which will be covered in the following.

Image 3: The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 max-pooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 max-pooling, 5 out of 8 translation directions give an identical result(1), which means that the translation invariance rises with the size of the max-pooling.

As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.

# Spatial Transformer Module

## Overview

Image 4:  Architecture of a spatial transformer module.[1] U and V are the in- and output feature map respectively. The goal of the spatial transformer is to determine the parameters for

LaTeX Math Inline
body \theta
, i.e. the parameters for the geometric transform. Image source [1].

Image 5: Illustration of a transformation showing the sampling[1] While

LaTeX Unit
body \mathcal{T}_I(G)
depicts a unitary transformation (i.e. no change at all),
LaTeX Unit
body \mathcal{T}_\theta(G)
shows an affine transform. One can see, that the sampling grid produced by
LaTeX Unit
body \mathcal{T}_\theta(G)
does not correspond to the pixel coordinates of
LaTeX Unit
body U
. Therefore some form of interpolation has to be performed. Image 5 also shows that the spatial transformer module is also able to crop the image and remove (possibly) redundant information from the image and focusing on key elements of the image. Image source [1].

## Localization Network

The localization network transforms the input feature map

LaTeX Unit
body U
, which is shown above into the output parameters
LaTeX Math Inline
body \theta
. The dimension of
LaTeX Math Inline
body \theta
depends on the type of transformation which the network should be able to perform. As for example a projective transform has 8 degrees of freedom, the dimension of
LaTeX Math Inline
body \theta
would have to be 8. While the localization network can be either a CNN or a classic neural network, it should contain a final regression layer, in order to output the paramters
LaTeX Math Inline
body \theta
. In general any parameterizable transformation, that is differentiable towards its parameters can be used in a STN.

## Grid Generator

The grid generator

LaTeX Unit
body \mathcal{T}_\theta(G)
uses the transformation parameters to create a sampling grid, which is a set of points
LaTeX Unit
body (\mathcal{x_i^s},\mathcal{y_i^s})
. One of these points defines, where a sampling kernel has to be applied on
LaTeX Unit
body U
in order to obtain a certain output pixel in
LaTeX Unit
body V
.

## Sampler

The sampler combines the input feature map and the sampling grid, resulting in the output feature map

LaTeX Unit
body V
, by performing some form of interpolation.The interpolation step is necessary as the coordinates of the sampling points
LaTeX Unit
body (\mathcal{x_i^s},\mathcal{y_i^s})
in general won't be existing coordinates of the input
LaTeX Unit
body U
.

The sampling step can be performed with any kernel, while the general formula is

LaTeX Math Block
alignment center
V_i^c = \sum_n^H{\sum_m^W{U_{nm}^ck(x_i^s -m;\phi_x)k(y_i^s-n;\phi_y)}} \space \forall i \in [1...H'W'] \space \forall c \in [1...C]

Here

LaTeX Math Inline
body k()
can be any kind of sampling method, with
LaTeX Math Inline
body \phi_x
and
LaTeX Math Inline
body \phi_y
being the sampling parameters. The only constraint for the sampling kernel is, that it is differentiable with respect to the sampling points
LaTeX Unit
body (\mathcal{x_i^s},\mathcal{y_i^s})
. This is necessary, for the backpropagation algorithm, that is used for training the spatial transformer network. In their paper, Jaderberg et al are using a bilinear sampling kernel. This changes the above equation to

LaTeX Math Block
alignment center
V_i^c = \sum_n^H{\sum_m^W{U_{nm}^c max(0,1-|x_i^s -m|) max(0,1-|y_i^s-n|)}}.

For performing backpropagation, one has to find the gradients for

LaTeX Unit
body U
and
LaTeX Math Inline
body G
, which brings us to the equations

LaTeX Math Block
alignment center
\frac{\partial V_i^c}{ \partial U_{nm}^c} = \sum_n^H{\sum_m^W{ max(0,|x_i^s -m|) max(0,|y_i^s-n|)}}.
LaTeX Math Block
alignment center
\frac{\partial V_i^c}{ \partial x_i^s} = \sum_n^H{\sum_m^W{U_{nm}^c max(0,|y_i^s -m|) }} \begin{cases} 0 \space if \space \space |x_i^s -m| \ge 1 \\
1 \space  if \space \space m \ge x_i^s \\
-1 \space if \space  \space m < x_i^s
\end{cases},
LaTeX Math Block
alignment center
\frac{\partial V_i^c}{ \partial y_i^s} = \sum_n^H{\sum_m^W{U_{nm}^c max(0,|x_i^s -m|) }} \begin{cases} 0 \space if \space \space |y_i^s -m| \ge 1 \\
1 \space  if \space \space m \ge y_i^s \\
-1 \space if \space  \space m < y_i^s
\end{cases}.

These equations, allow the loss gradients to backpropagate to the input feature map

LaTeX Unit
body U
of the transformer module on one hand, and to the transformation parameters
LaTeX Math Inline
body \theta
of the ST-layer by using the further derivatives
LaTeX Math Inline
body \frac{\partial x_i^s}{ \partial \theta}
and
LaTeX Math Inline
body \frac{\partial y_i^s}{ \partial \theta}
.[1]

## Performance Compared to Standard CNNs

While providing state-of-the-art results, the computation time of the Spatial-Transformer-CNN introduced by Jaderberg et. al is only 6% slower than the corresponding standard CNN.[1]

The table in image 6 [1] shows the comparison of the results from different neural networks on the MNIST data-set. The table distinguishes between fully convolutional networks (FCN) and convolutional neural networks (CNN). It further includes a spatial transformer module to each of the network types (ST-FCN and ST-CNN) with

LaTeX Math Inline
body \theta
containing either 6 (Aff) or 8 (Proj) transformation parameters or a thin plate spline (TPS). The MNIST data-set was distorted by rotation (R), rotation, translation and scaling (RTS), a projective transform (P) or an elastic distortion (E). The results show, that each spatial transformer network outperforms its standard counterpart, as the classification error is smaller in all cases.

Image 6: The Table shows the comparison of the classification error of different network models, on several distorted versions of the MNIST data set. Networks which include a spatial transformer module outperform the classic neural networks. The images on the right show examples for the input to the spatial transformer (a), visualization of the transformation (b) and output after the transformation (c). While the left column uses a thin plate spline (TPS), the transformation on the right is affine. Image source [1].

## Unsupervised Sub-Object Classification

Another important discovery from Jaderberg et al is, that they achieved a form of unsupervised and fine-grained classification. The presented experiments were done on the CUB-200-2011 bird data set which contains images of 200 different bird species. These bird images not only consist of different species, but are taken from different angles and points of view, with different scaling and individual background scenery. This gives an idea, how challenging this dataset is. Previous to the introduction of the spatial transformer module Simon and Roder also performed unsupervised fine-grained classification on this bird data set. By analyzing the constellation of part detectors, that fire at approximately the same relative location from each other they achieved a state-of-the-art classification accuracy of 81.0%[3]. The table in Image 7 shows, that Jaderberg et al were able to improve this result by 1.3 percentage points using a CNN model with Inception architecture.[1] Given the latter network as a basis, they inserted either 2 or 4 spatial transformer modules into the network architecture, achieving even higher classification results. The images next to the table on the right show the region of interest on which each of the transformer modules focused. This shows that when spatial transformer modules are put in parallel, each can learn  a different part of an object. In this case, one of the transformer modules focused on the head and the other on the body of a bird. Other work on this data set with sub-object classification was done by Branson et al. While the latter explicitly defined parts of the bird, and trained separate detectors on these parts, Jaderberg et al achieved this separation in a completely unsupervised manner.

Image 7: The table shows the classification results of different network architectures for the CUB-200-2011 bird data set. The spatial transformer networks are able to outperform the other networks. The images on the left show the behaviour of the spatial transformers in the network, when 2 (upper row) or 4 (lower row) are in parallel inside the network. Interestingly, each of the spatial transformers learned to focus on either the head or the body of the bird, in a completely unsupervised manner. Image source [1].

## Problems and Limitations

When using a spatial transformer, it is possible to downsample, or oversample a feature map.[1] Using a sampling kernel such as the bilinear kernel which is of fixed width, can cause aliasing effects in the output feature map.

As described in the last chapter, spatial transformer modules can be used for fine-grained classification, which means, that also sub-parts of a class can be detected by a neural network. While this is a very promising result, the number of objects a STN can model is limited to the number of parallel spatial transformers in the network.[1]

# Literature AnchorLiteratureLiterature

[1]  Spatial Transformer Networks (2015, M. Jaderberg et. al)

[2] Computer Vision: Algorithms and Applications (2011, R. Szeliski)

[3] Neural Activation Constellations: Unsupervised Part model Discovery with Convolutioinal Networks (2015, M. Simon and E. Rodner)