for backpropagation is very easy to implement in practice, and only moderately more difficult to understand. The main goal is to gradually adjust the network to produce the correct result when given a particular input. If we are training a neural network
to perform optical character recognition
, for instance, we want to train the network to recognize an H when presented with one. This is accomplished through an iterative process by which we alter the weights
between the various nodes until the network
as a whole performs correctly.
As jean-yves rightly pointed out in his writeup about neural networks, the actual theory behind how to train the networks was a difficult task. The problem is that once we know the amount of error in the network's output, how do we divide that error amongst all the nodes? Every node is responsible for a bit of that error, but it isn't clear exactly how much.* As it turns out, the answer is to be found through the magic of a little bit of calculus and a little bit of cleverness.
The essential framework of the algorithm in broad strokes is:
- Choose random weights for the network
- Feed in an example and obtain a result
- Calculate the error for each node (starting from the last node and propagating the error backwards)
- Update the weights
- Lather, rinse, repeat with other examples until the network converges on the target output
First let's look at the process of calculating each node's error (Step 3). I am assuming you already have constructed a neural network and have chosen your training data. Run it with your first example and obtain the result.
For the purposes of this writeup,
- Δi denotes the value of a node's error, where i is the particular node
- g(z) denotes the activation function of the network
- g'(z) is the derivative of the activation function
- ai is the activation of node i, in other words, the value of its output when it fires
- ini is the sum of the weighted input coming into node i. Specifically, ini= Σjajwji where wji is the weight from node j to i
The first thing we must do is calculate the error for the output node, the final node in the network which ultimately gives us the network's output
. The formula for this is:
Δi = g'(ini)(T-ai)
Basically, the error for the output node is the difference of its output ai from the target output T, multiplied by the derivative of the activation function. We know what the target output is because the examples used to train network are labelled -- this allows us to determine how far off it is from the correct answer. The derivative is used here in order to solve the problem mentioned earlier of assigning the correct amount of error to this particular node. As it turns out, the derivative is an accurate way to determine how much of a role each node plays in the network.
Now that we know the error for the final node in the network, we can now propagate that error backwards to determine all of the other nodes' error. For reference, these nodes are also called the Hidden Units. The formula for these nodes is:
Δi = g'(ini)ΣkwikΔk
It's not as scary as it sounds! Let's look at that scary sigma
part first. K ranges over the nodes connected to the outpu
t of node i, so these are all of the nodes that come after our node. wik
signifies the weight from node i to node k, and Δk
we already know -- it's the amount of error for the nodes that are one layer ahead of node i. Remember, we know this because we are working backwards through the network. So basically we take the weight from node i to k and multiply it by k's delta
. Then we do the same with each node connected out from i, and sum the total. Now we calculate g' for node i, based on all of the weighted inputs from the nodes that feed into i. Voila! Now you can repeat this step with all of the nodes one layer in front of i and so on until you have gone through the whole network.
Now that we have calculated the amount of error (Δ) for each node, we can update the weights between the nodes (Step 4). This is our ultimate goal because it is the weights which determine how the network performs. The formula for this is now trivial:
wji = wji + ηajΔi
The only part of this equation that we haven't talked about is η (eta) which is the learning rate for the network. When we initially construct the network, we choose a learning rate
constant to control the amount of change that occurs in each iteration
of the backpropagation. The learning rate is chosen from the range 0 < η < 1 -- a higher learning rate means greater change each cycle. In any case, the learning rate is just a coefficient
, so the equation above says that the weight from j to i equals the weight's current value plus j's activation multiplied by Δi
(calculated earlier) and η (constant).
So again, the basic steps of error propagation are to determine the error for the output node using the first formula, then work backwards through the network, caculating each node's error using formula 2. Finally we update the weights based on the error, using the third formula.
This is technically known as one epoch of the training process. Once this has occurred we simply supply another example to the network, calulate the error, and update the weights and repeat until the network has converged on the target output, meaning that it responds correctly to given input.
*A poor example of this might be the citizens of a country
. If the country is experiencing a recession
, every single person is somehow a cause of the downturn, simply by being a part (however small) of the economy
. Certain individuals may play larger roles (a bank going bankrupt
) than others, but everyone is ultimately implicated. In the same way, each node contributes to the overall success of the network