Dropout is a method of improvement which is not limited to convolutional neural networks but is applicable to neural networks in general. The basic idea is to remove random units from the network, which should prevent co-adaption. This has proven to reduce overfitting and increase the performance of a neural network.
Table of Contents | ||
---|---|---|
|
Motivation
Overfitting
Overfitting is a problem that can occur wherever existing data should be fitted to a describing model. This includes statistics as well as different disciplines in supervised and unsupervised machine learning such as linear regression, k-nearest neighbor regression, logistic regression, support vector machines and neural nets. Hawkins explains overfitting with the principle of parsimony.[3]. This This principle states, that a model should contain the least possible amount of variables that are necessary to define it. Models of scientific relationships that violate the principle of parsimony are prone to overfitting. An example would be trying to model a set of 2D points to a higher polynomial, when the points have a linear relationship. Image 2 shows such an example.
Deep neural networks can be trained to develop complex relationships between their input data and their outcome. Depending on the amount of training data the network may develop a behavior that brings good results for the training data, but fails as soon as unknown test data is fed into the network. To prevent overfitting in neural networks, there exist a variety of methods. The simplest approach is, to feed more training data into the network. This prevents, that the neural network is only trained on features, that may be a random coherence of the training data, but may not be a general property of the test data. This of course increases the need for more training data, but also increases the required training time and computational complexity which is in general a limiting factor to neural networks. Another method that shows remarkable results is to classify different subsets of the training data, and fit a model which is based on these subsets. This approach is called bootstrap aggregating or "bagging", and is not limited to neural networks, but can be applied to all forms of statistical classification and regression tasks.(4). A further possible method to prevent overfitting is called "early stopping", which means that the training is stopped, ideally just before the validation error starts to rise. Of course finding this point in time where the network has the best possible generalization and "price-performance ratio" is not trivial. For further reading, I refer to Prechelts work on finding an early stopping criterion for training neural networks.[5]. The last approach is to create a network model, that has the right capacity.(2). If the capacity is too small, the network is not able to represent all features and regularities, that defines the training data. If the capacity of the network is too large, it may develop spurious regularities from the training data. Methods to limit the capacity of a neural network are weight-decay (large weights are penalized or constrained), limiting the number of hidden layers and units, or injecting noise into the network.(2). A successful way to prevent overfitting is to perform a dropout. Here units are randomly removed from the neural network, which can also be seen as a form of adding noise to the network.
Co-adaption
Co-adaption is a term, originating from biology and evolution theory.(3). Among others, it describes the process, when different species develop an interdependence. An example is the relationship between the plant Acacia hindsii and the and species Pseudomyrex ferruginea. Both of this species developed a habit which is unusual to closely related species. The ant is active 24 hours a day, to protect the the plant, while the plant grows leaves throughout the whole year, in order to provide food. While co-adaption may be an evolutionary advantage in nature, it can cause nuisance in convolutional neural networks. Hinton et al. [2] describe the co-adaption of feature detectors in neural networks.[2] It means, that a single feature detector is not able to describe a meaningful image feature on its own, but only combined with other feature vectors. They found out that through randomly dropping units from the neural network, co-adaption between the feature detectors can be prevented, as individual feature detectors start to detect specific, helpful features.
Image 2: [1] The graph shows a set of 2D datapoints with an approximately linear relationship. While the red curve, which is a sixth order polynomial has smaller modelling error than the green line, the green line will outperform the model accuracy of the red curve once more data is added. |
Image 3: [5] Vertical axis shows error, horizontal axis shows training time. This graph shows a training error of a neural network, which decreases steadily. The validation error is the error generated when using test data as an input. Here the error rises after a certain threshold, due to overfitting. Image source [5]. |
Method
Let
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
(a) Standard Neural Network | (b) Neural Network with Dropout | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
The above dropout stage with probability
LaTeX Math Inline | ||
---|---|---|
|
A successful way to reduce an error on the test set, is to perform model averaging.[2]. Here, the outcome of many different networks is taken into account for the final result. In order to receive the outcome of many different networks, all these models have to be pre-trained, which is very expensive. Hinton et al describe their dropout method as an efficient way, to train many different networks, as in each stage a randomly thinned subset of the network is trained. There exist
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
Image 4: [1] Comparison Comparison of a standard and a dropout NN. Image source [1].
Image 5: [1] During During training time, each unit in the network is present with a probability of
LaTeX Math Inline | ||
---|---|---|
|
Effect on Filters
Image 6, allows to develop a good intuition, why dropout is useful for training a neural network. It shows the 256 feature detectors, which were trained on the MNIST data set in a convolutional neural network with one hidden layer. In Image 6 (a) no dropout was applied, while in (b) units were removed from the network with a probability of
LaTeX Math Inline | ||
---|---|---|
|
Image 6: [1] Effects on a set of featuredetectors taken from a convolutional neural network. While the features in (a) are mostly indistinguishable for humans, and seem to contain a big portion of white noise, the features in (b) are the product of a training with dropout. As visible the detectors are able to filter meaningful features such as spots, corners and strokes in an image. Image source [1]. |
Effect on Performance
In their experiments Srivasta et al show, that best performance of the neural network i.e. lowest classification error, is achieved by removing units in hidden layers with a probability of
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
Effect on Sparsity
Another important effect when working with dropout networks is, the fact, that the activations (or the output values of each neuron) tend to become sparse, even without any sparsity enforcement, such as weight penalties applied applied.[1]. Having Having a sparse model means, that the number of units with high activations is very low, and most of the activations are close to zero. Further, the mean activations should be low as well. Image 8 shows a random subset which was taken from a dropout network (b) or the equivalent network without dropout (a). Image 8 (a) shows, that most units in the network have a mean activation which is about 2. The actual activation values are widely distributed. When applying dropout, the number of high activations decreases noticeable, while the average activation tends to be around 0.7, compared to 2.0 for the standard neural network.
Image 8: [1] The graphs show histograms of the mean activation and the activation value of a set of randomly selected units from the network. While the most units have a mean activation of 2.0 in (a), there is a shift twoards smaller mean activations when performing dropout. The right histograms of (a) and (b) show, that after performing dropout, most of the units have a low activation, while only few neurons in the network retain a high activation value. Image source [1]. |
Choosing the right Parameter
Srivastava et al found out, that the close to optimal parameter
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
LaTeX Math Inline | ||
---|---|---|
|
Literature
Anchor | ||||
---|---|---|---|---|
|
[1] Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014, N. Srivastava et. al)
[2] Improving Neural Networks by Preventing Co-adaption of Feature Detectors (2012, G.E. Hinton, N.Srivastava et. al)
[3] The Problem of Overfitting (2004, D. M. Hawkins)
[4] Bagging Predictors (1996, L. Breiman)
[5] Early Stopping - but when? (2012 L. Prechelt)