Learning and neural networks

Single Layer Perceptrons
A Perceptron is a type of Feedforward neural network which is commonly used in Artificial Intelligence for a wide range of classification and prediction problems. Here, however, we will look only at how to use them to solve classification problems. Consider the problem below. Suppose you wanted to predict what someone's profession is based on how much they like Star Trek and how good they are at math. You gather several people into a room and you measure how much they like Star Trek and give them a math test to see how good they are at math. You then ask what they do for a living. After that you create a plot placing each person on in based upon their Star Trek and Math scores.



To summarize
The neural network starts out kind of dumb, but we can tell how wrong it is and based on how far off its answers are, we adjust the weights a little to make it more correct the next time. We do this over and over until its answers are good enough for us. It is important to note that we control the rate of learning, in this case via a constant learning rate $$\eta $$. This is because if we learn too quickly we can overshoot the answer we want to get. If we learn too slow, then the drawback is that it takes longer to train the neural network.


 * Note: The difference between $$t_k$$ and $$y_k$$ is that $$t_k$$ is what you want the network to produce while $$y_k$$ is what it actually outputs. If the network is well trained $$y_k \approx t_k$$

Steps in training and running a Perceptron:
These steps can also be applied to the multi layer Perceptron.
 * 1) Get samples of training and testing sets. These should include
 * 2) What the inputs $$x_i$$ (observations) are
 * 3) Inputs and outputs generally should be normalized so that the largest number is 1 and the smallest is either -1 or 0. This can be done with the basic formula [[Image:Basic_Normalization_Formula.gif]] Here normalizeConst would be 2 and offsetConst would be 1 is we normalized from -1 to 1 so we would have: $$ x_i ^\prime  = \frac  \cdot 2 - 1 $$
 * 4) What outputs $$t_k$$ (decisions) you expect it to make
 * 5) Set up the network
 * 6) Created input and output nodes
 * 7) Create weighted edges $$w_{ki}$$ between each node. We usually set initial weights randomly from 0 to 1 or -1 to 1.
 * 8) Run the training set over and over again and adjust the weights a little bit each time.
 * 9) When the error converges, run the testing set to make sure that the neural network generalizes a good answer.

Multi Layer Perceptrons

 * Note: The sigmoid activation function at the output is optional. Only the activation function following $$w_{ij}$$ must be used. The reason for using a sigmoid at the output $$y_k$$ is to force the output values to normalize between 0 and 1. It should be noted that other output functions can be used such as stepper functions and soft-max functions at the final output layer. For more on transfer functions see: Transfer Function

Training and Back Propagation
The standard way to train a multi layer perceptron is using a method called back propagation. This is used to solve a basic problem called assignment of credit, which comes up when we try to figure out how to adjust the weights of edges coming from the input layer. Recall that in the single layer perceptron, we could easily know which weights were producing the error because we could directly observe the weights and output from those weighted edges. However, we have a new layer that will pass through another layer of weights. As such, the contribution of the new weights to the error is obscured by the fact that the data will pass through a second set of weights or values.

To give a better idea about this problem and its solution, consider this toy problem:
A Mad scientist wants to make billions of dollars by controlling the stock market. He will do this by controlling the stock purchases of several wealthy people. The scientist controls information that can be given by wall street insiders and has a device to control how much different people can trust each other. Using his ability to input insider information and control trust between people, he will control the purchases by wealthy individuals. If purchases can be made that are ideal to the mad scientist, he can gain capital by controlling the market.



As a mad scientist, you will need to adjust this social network in order to create optimal actions in the market place. You do this using your secret Trust 'o' Vac 2000. With it you can increase or decrease each trust weight how you see fit. You then observe the trades that are made by the rich dudes. If the trades are not to your liking, then we consider this error. The more to your liking the trades are, the less error they contain. Ideally, you want to slowly adjust the network so that it gets closer and closer to what you want and contains less error. In general terms this is referred to as gradient descent.



There are many ways in which we can adjust the trust weights, but we will use a very simple method here. Each time we place some insider information, we watch the trades that come from our rich dudes. If there is a large error coming from one rich dude, then they are getting bad information from someone they trust too much or are not getting good information from someone they should trust more. When the mad scientist sees this, he uses the Trust 'o' Vac 2000 to weaken a strong trust by a little and strengthen a weak trust by a little. Thus, we try to slowly cut off the source of bad information and increase the source of good information going to the rich dudes.



We can take the ideas above and make them more mathematically formal
One should notice that while the feedforward network uses sigmoid activation functions for the non-linearity, when we propagate the error backwards, we use the derivative of the activation function. This way we adjust the weights at a rate that reflects the curvature of the sigmoid. In the center of the sigmoid, more information is passed through the layer. As a result, we can assign more credit reliably from values that pass through the center. First recall the activation function used for each neuron: $$ g\left( a \right) \equiv \frac{1}

$$ which has the very nice property that the derivative can be expressed simply as: $$ g'\left( a \right) = g\left( a \right)\left( {1 - g\left( a \right)} \right) $$ Next we need to know the error. While we can compute it in many different ways, it is most common to simply use the sum of squared error function: $$ E^n = \frac{1} {2}\sum\limits_{k = 1}^c {\left( {y_k - t_k } \right)^2 } $$ Where $$y_k$$ is the output we got, while $$t_k$$ is what we wanted to get. Thus, we compute how different the network output was from what we wanted it to be. Additionally, in this example, we use a simple update based on the general error. This leads to a straight forward computation for $$\delta _k$$ which is simply: $$ \delta _k = g'\left( {a_k } \right)\left( {y_k  - t_k } \right) $$ where: $$ a_k = \sum\limits_{j = 0}^M {w_{kj}^{\left( 2 \right)} g\left( {\sum\limits_{i = 0}^d {w_{ij}^{\left( 1 \right)} x_i } } \right)} $$ If we decided to omit the sigmoid activation on the output layer it's even simpler as: $$ \delta _k = \left(y_k  - t_k \right) $$ In its general form, the adjustment of weights can be described as: $$ \Delta {\mathbf{w}}^{\left( \tau \right)}  =  - \eta \nabla E\left| {_{{\mathbf{w}}^{\left( \tau  \right)} } } \right. $$

Momentum learning
We can speed this up greatly by introducing a momentum term that toggles the learning rate to be faster by adding in some of the last steps error adjustments. $$ \Delta {\mathbf{w}}^{\left( \tau \right)}  =  - \eta \nabla E\left| {_{{\mathbf{w}}^{\left( \tau  \right)} } } \right. + \mu \Delta {\mathbf{w}}^{\left( {\tau - 1} \right)} $$ Notice that at its maximum we can see the momentum term as: $$ \Delta {\mathbf{w}} = - \eta \nabla E\left\{ {1 + \mu  + \mu ^2  + ...} \right\} $$ which leads to: $$ \Delta {\mathbf{w}} = - \frac{\eta } \nabla E $$ Here $$\mu$$ is a number from 0 to 1 that is a momentum controlling constant. This causes the learning rate $$\eta$$ to range as: $$ \eta \le learningRate \le \frac{\eta } $$ Additionally, there are many other advanced learning rules that can be used such as Conjugate Gradients and Quasi-Newtonian methods. These and other methods like them use Heuristics to speed up the learning rate, but have the downside of using more memory. Additionally, they make more assumptions about the topology of your sample space, which can be a drawback if your space has an odd shape.

From this example observe that we can keep adding more and more layers. We need not stop with only two layers but we can add three, four or however many we want. $$ y_k = g''\left( {\sum\limits_{k = 0}^N {w_{lk}^{(3)} } g'\left( {\sum\limits_{j = 0}^M {w_{kj}^{(2)} } g\left( {\sum\limits_{i = 0}^d {w_{ji}^{(1)} } x_i } \right)} \right)} \right) $$ In general, two layers is usually sufficient. Adding extra layers is only helpful if the topology of our sample space becomes more complex. Adding extra layers allows us to fold space more times.

Another way of looking at this
One can also look at a feedforward neural network trained with back propagation as a Simulink circuit diagram.

The Well Behaved Robot
This project has been used at the University of Southern California for teaching core concepts on Back Propagation Neural Network Training.

Your great uncle Otto recently passed away leaving you his mansion in Transylvania. When you go to move in, the locals warn you about the Werewolves and Vampires that lurk in the area. They also mention that both vampires and werewolves like to play pool, which is alarming to you since your new mansion has a billiard room. Being a savvy computer scientist you come up with a creative solution. You will buy a robot from Acme Robotics (of Walla Walla Washington). You’re going to use the robot to guard your billiard room and make sure nothing super natural finds its way there.

To train your robot you need to select a set of features which the robot can detect and which also can be used to tell the difference between humans, vampires and werewolves. Further, after having your nephew Scotty ruin one of your priceless antique hair dryers, which you keep in the billiard room you decide that the robot should also detect children entering the room. After reading up on the nature of the undead and after taking careful measurement, you realize that the two best features for detection are how tall a person is who is entering the room is and how hairy they are. This works because vampires are tall and completely bald, werewolves are either short and totally covered in fur, or they are the mutant type that are extremely tall, but no more hairy than a human. An adult human is taller than a child and slightly hairier. The chart below shows samples that you took to validate your hypothesis. The next thing your robot will need to do in addition to detecting what creatures enter your billiard room is take an action that is appropriate for the situation. Since you want your robot to be polite, it will greet every human that enters the room. Additionally, if a child enters the room, when it greets the child it will scream so that you know to look in your closed circuit television and see what is happening in the room. When the robot detects a vampire, it will scream and impale it with a stake. Since robots are no match for werewolves, if the robot detects a werewolf, it will scream and then run away. Thus, your robot can take any of four actions, it can impale something entering the room, it can scream, it can run away and it can greet people. Any of these action can be performed following the detection of anything entering the room. Your job is to train the robot so that it performs the correct actions whenever it detects something entering the room.

Part 1
Your first task is to train your robot. Take the training data marked train1.dat and plug it into bpt1.nsls located in 2layer. Compile the model and run it. After you train the model, test it with bpr.nsls. The output from testing can then be found in out.bin.dat or out.dec.dat. These are tab-delimited files with the test results. You will want to take this data and make a scatter plot which shows a map of how the robot will react when it observes different heights and different amounts of hairiness.

The way to interpret the output is as follows, there are four actions the robot can take, if the robot will take that action, the output is a 1, if the robot will not take that action, the output is a 0. The four actions in order are Impale, Scream, Run Away and Greet. Thus, if the output is 0,1,0,1 then that means the robot will scream and greet.


 * 1) Take and plot the actions the robot takes over the space of possible inputs. Out.bin.dat and out.dec.dat contain the same information. However, out.bin.dat contains the binary coarse code for the output while, out.dec.dat contains the decimal equivalent. You can use either for creating the plot. The decimal version may be easier to use. It’s up to you. For the plot, make the x-axis the height of the visitor and the y-axis the amount of hair measured. Each point on the plot should show the robots action for that input. Note: you may create the plot with any method you choose, just so it is neat and clear.
 * 2) Compare the plot of the robot’s test actions against the training data. Does the network do a good job of generalizing over the training data? Why or why not?
 * 3) Does the robot always behave as programmed or does it commit actions that do not fit the patterns for people, children, werewolves or vampires? Explain.
 * 4) Notice that it reacts to things entering the room as if they were vampires in two regions of space not visibly connected to the vampire training data. Why is that?

Part 2
Being an inquisitive lad or lass you decide you would like to find out how your robot would perform if you added a third layer to your back prop. Recall the equations from the NSL back prop lecture. Derive equations for a 3 layer back prop. This can be done by extending a two layer perceptron to three layers the same way a one-layer perceptron is extended to two layers. The figure below shows the schematic of the three-layer perceptron. Define: $$D_{wij}$$, $$D_{wjk}$$, $$D_{wkl}$$, $$d_j$$, $$d_k$$, $$d_l$$ and $$y_l$$ using the same notation from the NSL slides.
 * 1) Using figure 2 as a guide and your results from 3.a, extend the 2 layer back prop model to a three layer back prop model. Do this on the model in the folder 3layer. Some parts have been filled in already to help guide the process. When you are finished, run the model on the same testing and training data as question 2. Create a scatter plot in the same manner and compare the two.
 * 2) How good a job does the three layer network do on generalizing on the problem?
 * 3) How does it compare to the results from the two layer network?
 * 4) Does it do a better job? Why or why not?
 * 5) As it turns out the evil Dr. Moriarty has created vampires that are similar in height and hairiness as adult humans. You have discovered his fiendish plan and must now train a new network. The figure below shows a scatter plot of the new training data. How fortunate for you that almost no one in Transylvania is the same height and hairiness as the mutant vampires, but which neural network should you use? Train both the two layer network and the three layer network on the new data. Create scatter plots for both results in the same manner as before.
 * 6) How well do the two networks perform on the new data?
 * 7) Which of the two networks performs better on generalization?
 * 8) Specifically, why does the one that performs better do so?
 * 9) Having analyzed the outcome of two networks on two different sets of data list several pro’s and con’s to using either network and explain with each one why it is the case that it is either a pro or a con.

Part 3
The Royal Society of Devious Werewolves has figured what method you use to train your robot. So before coming over to play pool, they all put on disguises. However, not being as bright as they think they are, they all wear the same disguise. Your faithful servant Igor having infiltrated the Royal Society can phone ahead to tell you what the features the werewolves disguises have so you can retrain your robot before they come over. The only problem is that Werewolves drive Italian sports cars, and since they are not known to drive with caution, time is of the essence in training. You decide to augment the training error in your network to use momentum.
 * 1) Plug the extra momentum term into the training error. Train the two layer network on the first data set WolfData1.txt both with and without the momentum term. Give a print out of the networks error.
 * 2) Does the network train faster with the new momentum term? If so, what is your intuition as to why this is or is not the case?[[Image:Learning_and_Neural_Networks_Material_Chart3.png|center]]

Project Materials and Data

 * 1) Full Werewolf Data Files
 * 2) Augmented Back Propagation Model
 * 3) Neural Simulation Language Package

Additional Links

 * 1) Other Projects in Artificial Intelligence at Cool-ai.org
 * 2) The NSL (Neural Simulation Language) Home Page
 * 3) Old but Relevant CSCI564 Course Web Page - This is the course in which this project was first used

Mundhenk 07:21, 1 February 2007 (UTC)