Reinforcement learning refers to learning by receiving feedback or reinforcement. It's a concept from psychology that's also been applied to computer learning, for example in neurogammon.

Recent Advances in Reinforcement Learning in Neural Networks

As processing power continues to increase, researchers of neural networks are making progress in the mathematical modeling of reinforcement learning. Recent advancements tackle problems such as meta-parameters, independent critics, task distribution, and the importance of memory storage in learning. The researchers who accomplished these feats begin to theoretically link the reinforcement learning shown to be capable of neural networks with the biological systems found in life on earth.

In an attempt to verify a proposed theory from 1973, cited in the study, Yamakawa and Okabe (1995) built a neural network using a neural critic. The researchers designed the neural critic to recursively strengthen connections within the neural network in response to reinforcement signals. The neural critic itself uses static criteria, and does not change during the course of learning. During the solution routine, the neural critic learned from mistakes and modified the parameters of the neural network. Yamakawa and Okabe used three different types of neural critics in their research. A nonadaptive critic worked as a control by only avoiding past mistakes, without evaluating the mistake any further than a Boolean judgment (Yamakawa & Okabe, p. 368). The second type of critic used was a first-stage adaptive critic. This critic used one iteration of mistake evaluation during the critique of the system’s progress. Finally, the researchers used a recursive adaptive critic. This type used recursion and more advanced summation to modify the system’s parameters. As predicted by Yamakawa and Okabe, the recursive adaptive critic was most efficient in solving the maze task.

Extending the concept of this critic model, the researchers proposed that the human brain uses a similar critic in reinforcement learning. “The adaptive recursive critic can be converted into a conventional neural network model, so we can compare this critic with the part of the brain that controls the sense of values (maybe around the limbic system, especially the amygdala.)” (Yamakawa & Okabe, p. 373) Yamakawa and Okabe used computerized neural networks to test the mathematical power of a theory for how the brain learns via reinforcement.

Similar to the concept of a critic, meta-learning involves using meta-parameters to define the parameters of a neural network. Schweighofer and Doya (2003) suggested a method of reinforcement learning designed to solve a Markov decision problem, a mathematical task requiring the deduction of an optimal solution. The neural network used in the study involves the neural network itself, the parameters of the neural network, and the parameters of the parameters. Meta-learning involves evolving higher-level parameters to solve a problem. The researchers proposed an equation to govern the way the meta-parameters evolved during the course of the task. For every case, “the algorithm did not only find appropriate values of the meta-parameters, but also controlled the time course of these meta-parameters in a dynamic, adaptive manner.” (Schweighofer & Doya, p. 7) The “time course” referred to one of the static attributes of the meta-meta-parameters, T, which Scheighofer and Doya suggest to be genetic constants in the organism, although they differ between organisms. “As the algorithm is extremely robust regarding the T meta-meta-parameter, we can assume that its value is genetically determined, and perhaps related to the wake-sleep cycle of the animal.” (Schweighofer & Doya, p. 8)

However, the entire system seemed to be more dependent to variation of the other meta-meta-parameter, A. Schweighofer and Doya leave open the idea that a meta-meta-algorithm would be needed to control this variable A. “As the value of the A parameter is more sensitive, it is not impossible that a meta-meta-learning algorithm operates to tune it.” (Schweighofer & Doya, p. 8) The study used various random seed values for the A and T values in order to show that their algorithm would work under various genetic differences.


In addition to these above learning techniques, neural networks come across a memory problem during reinforcement learning. Neural networks designed to simulate reinforcement learning behavior often suffer from the problem of path interference. Path interference is the inability for a neural network to remember, or store, previously learned relationships between their input and output. In their study, Bosman, van Leeuwen, and Wemmenhove (2003) incorporated a memory function into a neural network in order to take advantage of previous input-output relationships. Their incorporation of a memory system with a reinforcement learning algorithm picks up where a former study leaves off. Chialvo and Bak (2001) used a neural network to solve a reinforcement task, however their model fails to work with a low level of neurons because of the path interference problem. “As the number of neurons in the hidden layer decreases, learning, at a certain moment, becomes impossible: path interference is the phenomenon which causes this effect.” (Bosman, et al, p. 3) The study overcomes this problem with the addition of a memory mechanism, which “has a positive influence on the learning time of the neural net.” (Bosman, et al, p. 3)

Proposing that memory feedback is crucial for learning, the researchers suggest this combination of memory and learning “might be biologically realizable. Without the addition of any feedback-signal, learning of prescribed input-output relations—whether in reality or in a model—is, of course, impossible.” (Bosman, et al, p. 1) They predict biological systems would use a memory-store to enhance reinforcement learning.

Although processing speed continues to improve, it would always be beneficial to distribute a neural network into various modules. A problem arises with the distribution of a neural networking task into modules (responsible for their own sub-tasks), and then recombine the sub-tasks into “the composite policy for the entire task.” (Samejima, et al, p. 1) Samejima, Doya, and Kawato (2003) proposed a technique for distributing the reinforcement reward to the various modules using a credit assignment. Neural networks that are not subdivided into modules are subject to a single reinforcement reward for each iteration.  For a neural networking task that is subdivided into sub-tasks, the sub-tasks must also respond to the reinforcement reward, but the reinforcement reward must be summarized and tailored for the sub-task. The study points out that “it is necessary to design appropriate ‘pseudo rewards’ for sub-tasks.” (Samejima, et al, p. 1)

Using two different tasks, this study proved the efficiency of their task distribution techniques in reinforcement learning environments. First, the researchers designed a target-pursuit task. The task was divided into four sub-tasks, each of which would be subject to their own ‘pseudo rewards’ during the reinforcement stage of the task. The modular reward system outperformed a simpler weighted reward system.  “The MMRL (multiple-model based reinforcement learning) with modular reward achieved near-optimal policy faster than the MMRL with weighted total TD (temporal difference) error.” (Samejima, et al, p. 6) Also, the researchers devised a pendulum swing paradigm, for which they utilized a neural network to find the best way to swing the pendulum based on an arbitrarily assigned torque value. Results showed similar success when using their inter-module credit system of reinforcement. “We can see that the value was more effectively propagated with the backing-up modular reward equation, which enabled faster learning.” (Samejima, et al, p. 8)

The researchers successfully implemented a credit-based system of reinforcement distribution across sub-routines that allow for distribution of sub-tasks for a neural network without sacrificing the potency and accuracy of the reinforcement reward. “We introduced a new concept of modular reward, which enables the learning of modular policies directed toward the optimization of an entire task.” (Samejima, et al, p. 8) These advances in neural network distribution, meta-learning, higher level critiquing, and memory improvements to aid reinforcement learning represent major advances in the understanding of how animal brains might learn via reinforcement.

APA style rocks your world.


Bak, P. & Chialvo, D. (2001) Physical Review E, 63, 031912.

Bosman, R., van Leeuwen, W., Wemmenhove, B. (2003, November). Combining Hebbian and reinforcement learning in a minibrain model. Neural Networks, xx, 1-8 (Article in press).

Samejima, K., Doya, K, Kawato, M. (2003, November). Inter-module credit assignment in modular reinforcement learning. Neural Networks, xx, 1-10 (Article in press).

Schweighofer, N. & Doya, K. (2003). Meta-Learning in Reinforcement Learning. Neural Networks, 16, 5-9.

Yamakawa, H. & Okabe, Y. (1995). A Neural Network-Like Critic for Reinforcement Learning. Neural Networks, 8, 363-373.

Node your homework

Log in or register to write something here or to contact authors.