# Supervised Learning

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems[1]. The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object(typically a vector) and a desired output value (also called the supervisory signal).For instance, a training example in neural network could be like this,

 (1)

, where  $//$ is the  $//$ input vector and correspondingly  $//$ is the desired  $//$ output vector.

## Error Measures

What we'd like during learning process is an algorithm which let us find weights and biases so that the output from the network $//$ approximates $//$ for all training inputs $//$. To quantify how well we're achieving this goal we should use a method to measure the error which indicates the difference between $//$ and $//$(also called cost function). A widely used method is the Mean Squared Errors(MSE):

 (2)

Here,  $//$  denotes the collection of all weights in the network,  $//$ all the biases,  $//$ is the total number of training inputs,  $//$ is the vector of outputs from the network when  $//$ is the input,  $//$ is the desired output, and the sum is over all training inputs, $//$.

Besides the MSE, the cross-entropy error measure is also used in neural networks:

 (3)

where $//$ is the total number of items of training data, the sum over all training inputs, $//$, and   $//$ is the corresponding desired output.

## Training Protocols

Here, we mainly introduce four protocols:

• Batch training: batch training computes the gradient of the cost function w.r.t. to the parameters $//$ for the entire training dataset.
• Stochastic training: Stochastic training in contrast performs a parameter update for each training example $//$ and desired output $//$.
• Online training: In this protocol, each training data would be used only once to update the weights and biases.
• Mini-batch training: stochastic training works by randomly picking out a small number $//$ of randomly chosen training inputs. We'll refer each of those random training inputs as a mini-batch. It takes the best of all the three training protocols and performs an update for every mini-batch of $//$ training examples.

## Parameter Optimization

Recapping, our goal in training a neural network is to find weights and biases which minimize the cost function $//$, which we could also call Parameter Optimization.
Let's first consider the simplest situation: $//$, where $//$ and $//$ are 1-dimensional variables. Assume that $//$ has the form:

Fig. 1: Function $//$.(Source:2)

In Figure, what we'd like is to find where $//$ achieves its global minimum. Now, for the function plotted above, we can eyeball the graph and find the minimum. But actually we do not know where is the exact point at which $//$ reaches its minimum.
Theoretically, when the minimum of $//$ is reached, then the following condition should be satisfied:

 (4)

In practice, the cost function in a neural network is much more complicated so that it could be impossible to calculate the zero point of the derivative of $//$. Based on this fact, we start by thinking of $//$ as a kind of valley. Then we image a ball rolling down the slope of the valley and eventually it could roll to the bottom of the valley. So how to use this experience in our neural network?

From Calculus, we knew that:

 (5)

Recall that our purpose is to minimize the cost function $//$, so if we choose

 (6)

where  $//$ is the learning rate (a small, positive parameter). Thereby, we can infer from the cost function that

 (7)

which guarantees $//$ so that the "ball" could roll to the bottom of the "valley". This method is called Gradient Descent.

Besides this, Newton's Method is also widely used for optimizing the parameters.
Due to the use of second-order information of the cost function in Newton's Method, , the algorithm performs faster convergence.

 (8)

where $//$ is the Hessian matrix of   $//$. The main drawback of this method is its computationally expensive for evaluation and inversion of the Hessian matrix.

## Weights Initialization

Now, we've learned how to optimize the parameters. In this subsection, we will see the method to initialize the parameters. But firstly we should clarify if the parameters can all be initialized to zero?
The answer is no, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

• Small Random Numbers. As a solution, it is common to initialize the weights of the neurons to small numbers and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.
• Calibrating the variances with $//$. One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that we can normalize the variance of each neuron’s output to 1 by scaling its weight vector by the square root of its fan-in (i.e. its number of inputs).

## Error Backpropagation

In the previous section, we discussed how to initialize and optimize the parameters. Now, let's think more about the optimization of the parameters: How each parameter in a multi-layered neural network be optimized?

### Principle Part

To solve this problem, Error Backpropagation Algorithm[2] is the most famous method that is used.This Algorithm comprises of 4 equations.
First, assume the weighted input to the $//$ neuron in layer  $//$  is:

 (9)

where $//$ is the output of the k-th neuron in layer $//$ and $//$ is the corresponding weight.
Next, we define the error $//$ of neuron j in layer $//$ by

 (10)

• First Equation for Error in the Output Layer, $//$:
 (11)

where $//$ is the activation function.

• Second Function for Error $//$ in terms of the error in the next layer ,  $//$:

 (12)

where $//$ is the Hadamard product.

By combining these two equations above we can compute the error $//$ for any layer in the network. We start by using First Equation for Error to compute $//$, then apply the Second Equation for Error to compute $//$. Then use the Second Equation for Error  to compute $//$, and so on, all the way back through the network.

• Third Equation for the rate of change of the cost with respect to any bias in the network:

 (13)

That is, the error $//$ is exactly equal to the rate of change $//$.

• Fourth equation for the rate of change of the cost with respect to any weight in the network:
 (14)

This tells us how to compute the partial derivatives $//$ in terms of the quantities $//$ and  $//$, which we already know how to compute.

With these four equations, we could "easily" backpropagate the error to each parameter. Here, we only roughly introduced those equations.You may find more about the principle of error backpropagation, here, if you are interested.

### Practical Part

Now, let us see how these 4 equations could  be used in a neural network .

Once the neural network was mentioned, this picture could be the first view in our mind. This is a typical neural network with 3 layers:

• Layer 1: Input Layer
• Layer 2: Hidden Layer
• Layer 3: Output Layer

This is just a brief view of neural network which means, it is not enough to see how the backpropagation works. So let's go deeper and more detailed.

Fig. 2: A simple example of Neural Network.

In this concrete model on the right, we can clearly find out the inputs, weights, biases and outputs.

The first layer is the input layer which includes 2 neurons  $//$ and the bias  $//$. The second layer is the hidden layer with 2 neurons  $//$ and the bias  $//$. The last layer is the output layer with 2 outputs neurons  $//$$//$ between each layer is the corresponding weight. Here, we assume the sigmoid function is the activation function.

For the convenience of the demonstration on how the backpropagation algorithm actually works, we assign each variable an initial value:

• Input Data:  $//$
• Output Data:  $//$
• Initial weights and biases:  $//$
$//$

Out target is: With the input data, the output of the network must approximate to the output data as much as possible.

Fig. 3: Neural Network with 2 inputs and 2 outputs.(source: 3)

Fig. 4: Initial State of the Neural Network.(source: 3)

#### Step 1 Forward Propagation

1. Input Layer -----> Hidden Layer:
First, we calculate the weighted input to the  $//$ neuron in the hidden layer:

The output of the  $//$ neuron(with the sigmoid activation function) is:

Similarly, we can get the output of the  $//$ neuron is:

2. Hidden Lay ---→ Output Layer:
Repeat the same procedure from above to the output layer:

Here, the forward propagation is finished because we have the output of the network with the input and initial parameters we gave. Apparently, the output value  $//$ differs quite far from the desired output value $//$. So,we need to apply the back propagation algorithm to the network to update the parameters and re-calculate output.

#### Step 2 Back propagation

1.   Calculate the error(MSE):
From the first subsection, we knew that:

According to the equation above, we can easily calculate the output of the MSE:

2.   Update the parameters between hidden layer and output layer:

Combine the First equation and the Fourth equation (chain Rule), the derivative of MSE with respect to  $//$:

Fig. 5: Back propagate the error for the output layer.(source: 3)

Let's calculate the value of each part from the equation above one by one:

• $//$：

• $//$

• $//$

At last,  $//$.

You may find that we did not use the error variable  $//$, then why define it?

Actually we did use it. Relying on the previous equation of  $//$, and based on the calculation above, we can easily derive that:

Then,  $//$ can be written as:

So, the derivative of MSE in terms of  $//$ is:

In this part, the last step is to update the value of  $//$ to make our network "better":

where  $//$ is the learning rate, we take its value as 0.5 here.

Similarly,  $//$ could be updated to:

3.   Update the parameters between hidden layer and input layer:

The method we will use differs not that much from the last part, the only change is: for instance, when we calculate the derivative of total MSE in terms of $//$, the error in both outputs  $//$ should be considered, which means:

Based on this, and by the use of Fourth equation  $//$, and Second equation  $//$:

The same:

Fig. 6: Back propagate the error for the hidden layer.(source:3)

So far, the back propagation algorithm was accomplished once. The next, we need to apply it recursively until the error converges or the number of iteration reaches the limitation. Actually, after 10000 times iteration, the total error is 0.000035085 while the output is  $//$ which shows a satisfied performance.

## Literature

1. Bishop, Christopher M. "Pattern recognition." Machine Learning 128 (2006): 1-58.
2. Michael Nielsen. "Neural networks and deep learning." http://neuralnetworksanddeeplearning.com/index.html
3. http://www.tuicool.com/articles/YnQRjaq

• No labels

1. All the comments are SUGGESTIONS and are obviously highly subjective!

• Try to incorporate links to other pages in the wiki
• Link to Wikipedia for concepts, which are not described in the wiki (e.g., Gradient Descent)
• Check for redundant and missing spaces
• Give numbers to all figures and formulas and refer to them by number (e.g., Equation 1). Avoid "below" etc.
• Maybe provide a reference to the CIFAR-10 database
• Introduce abbreviations when you first use them (e.g., MLP). I would then always use them to save space (FCL, MLP, ...)
• Maybe use bullet points for the training protocols and decide, if you want to continue with capital letters after the colon or not
• Reduce the amount of commas!!! English uses very little of them...

• "square root of its fan-in" (I am not sure, if you can actually use fan-in in that context

Corrections:

• trainingdata consist

• training data consists

• Widely used method is Mean Squared Errors(MSE)

• A widely used method is the Mean Squared Error(MSE)

• when x⃗   is input,

• when x⃗   is the input,

• Besides MSE, cross-entropy error measure is also used in neural networks:

• Besides the MSE, the cross-entropy error measure is also used in neural networks:

• Batch trainin:

• Batch training:

• Recapping, our goal

• To recap, our goal

• Now, of course, for the function plotted above, we can eyeball the graph

• Now, for the function plotted above, we can examine the graph

• But actually we do not know where is the exact point at which E(θ1,θ2) reaches its minimum

• But actually, we do not know at which point exactly E(θ1,θ2) reaches its minimum

• where η is the learning rate which is a small, positive parameter.

• whereη  is the learning rate (a small, positive parameter).

• Then from cost equation we are clear that

• Thereby, we can infer from the cost equation that

• In Newton's Method, due to the use of second-order information of the cost function, the algorithm performs faster convergence.

• Due to the use of second-order information of the cost function in Newton's Method, the algorithm shows a faster convergence.

• The main drawback of this method is computationally expensive for evaluation and inversion

• The main drawback of this method is its computational expensiveness for evaluation and inversion

• But firstly we should make it clear: could the parameters all be initialized to zero?

• But first we should clarify if the parameters can all be initialized to zero?

• In above, we discussed

• In the previous section, we discussed

• How to optimize each parameter in a multi-layered neural networks

• How should each parameter in a multi-layered neural network be optimized?

• Firstly, assume the weighted input to thejth neuron in layer l:

• First, assume the weighted input to the jth neuron in layer l equals:

• output of k-th neuron

• output of the k-th neuron

• Now, let's see in a neural network how these 4 equations could  be used.

• Now, let us see how these 4 equations could be used in a neural network.

• This is just a brief view of neural network which means, it is not enough

• This is just a brief view of neural network, which means it is not enough

• In this concret model on the right

• In this concrete model on the right

•  Here, we assume sigmoid function is the activation function.

•  Here, we assume the sigmoid function as the activation function.

• For the convenience of demonstration how the backpropagation algorithm actually works,

• For the convenience of the demonstration on how the backpropagation algorithm actually works,

• First we calculate the weighted input to the h1 neuron in hidden layer:

• First, we calculate the weighted input to the h1 neuron in the hidden layer:

• output of the h2 neuron is:

• output of the h2 neuron which is

• and re-calculate output.

• and re-calculate the output.

• we know that:

• we knew that:

• output of the MES:

• output of the MSE:

• You may found that, we didn't use

• You may find that we did not use

• Rely on the previous equation of δ, and

• Relying on the previous equation of δ and

• learning rate, we take its value as 0.5 here.

• learning rate (here arbitrarily chosen to be 0.5)

• much from last part,

• much from the last part,

• The next, we need to do it recursively until the error converges or the number of iteration reached the limitation.

• As a net step, we need to apply it recursively until the error converges or the number of iteration reached a set limit.

• a satisfied performance.

• a satisfactory performance.

Confusion:

•  Individual fully connected layers function identically (???) to the layers of the multilayer perceptron with the only exception being the input layer.      (a verb is missing here)

Final remark:

• Maybe a little too much in detail but very hands-on
2. Hello,

1. I suggest to create a table of contents at the beginning of this wiki, so that people can have a clear view of the structure of this wiki and can also directly link to where they are interested in.
2. Maybe should add a small title to the first subsection, like introduction or definition of supervised learning.
3. I think it is better put a number for each figure, so reader can easily find them.
4. In weight initialization, I suggest to use weights, because there are multiple weights need to be initialized and also I suggest to use "small bullet" before Small Random Numbers, and Calibrating the variances with.
5. Here, we assume sigmoid function is the activation function.  I think should give the formula of the sigmoid function here.

Confusion:

When you talk about the optimization methods, there are Gradient descent, Stochastic gradient descent, momentum, adam and so on, these are optimization methods.

But Backpropagation is the way you update your weights and biases based on a given optimization methods. Backpropagation does not belong to a optimization method, this is my understanding.

You directly mention the optimization in subsection Error Backpropagation and do not mention Gradient descent, that is my confusion.

Suggested change and some errors:

• Here, w denotes the collection of all weights in the network, b  b all the biases.
• There is a space at the beginning.
•  In Newton's Method, due to the use of second-order information of the cost function, the algorithm performs faster convergence.
•  Because of the second-order information of the cost function in Newton's Method, the algorithm converges more faster.
• The main drawback of this method is computationally expensive for evaluation and inversion
• The main drawback of this method is its computational expensiveness for evaluation and inversion
•  In above, we discussed how to initialize and optimize the parameters. Now, let's think more about the optimization of the parameters  (did not talk optimization before)
• As we have already discussed how to initialize the parameters of a neural network, now let's talk more about how to optimize the parameters
• To solve this problem, Error Backpropagation Algorithm is the most famous method that is used.
• To solve this problem, Error Backpropagation  is the most famous algorithm.
• First Equation for Error in the Output Layer
• In the first equation, the error of the output layer is calculated:
• same as:
• Second Function for Error δ  l   δl in terms of the error in the next layer
• Third Equation for the rate of change of the cost with respect to any bias in the network
• Fourth equation for the rate of change of the cost with respect to any weight in the network
• This is just a brief view of neural network which means, it is not enough to see how the backpropagation works.
• From the illustrated figure we still don't know how backpropagation works.
• So let's go deep and more detailed.
• So let's go deeper and see more details.
• In this concret model on the right,
• In the concrete model on the right
• So far, the back propagation algorithm was accomplished. The next, we need to do it recursively until the error converges or the number of iteration reached the limitation.
• So far, the first iteration of back propagation was accomplished, the next thing we need to do is compute it recursively until the error converges or the number of iteration reaches the limitation.

Conclusion
This wiki explains the supervised learning in neural networks, the BP algorithm is in detailed introduced. This wiki is very friendly to the beginners.
Best greetings