Initial Programming Exercises On Simple Neural Networks
This page gives some programming exercises to start to learn to program neural networks. Like the rest of these pages on "learning paths", this page heavily relies on other pages in the web and is mainly meant as guidance: the student should be able to study and learn by him/herself on the basis of indications here, internet pages, and some additional instructions on things not covered by this web-pages when needed.
The steps to follow are these, expanded in the exercises below:
- Read the following web-page of this website to learn the programming concepts that you need to learn to work with LOCEN (and in general to be able to build and work with computational models of brain and behaviour, or artificial intelligence/robotic systems): web-page on programming concepts to learn
- Depending on the programming language that you have to learn (agreed with your LOCEN supervisor), refer to suitable websites teaching such programming language. For the different programming languages, such websites are indicated in the web pages linked from this web page under ''Programming and simulations --> Specific languages".
- Installing an IDE (Integrated Development Environment) on your portable computer (or the computer at LOCEN) to program in the selected language; indications on this are given in the web-pages dedicated to the specific languages and mentioned in the previous point
- Understand that there are various learning paradigms related to neural networks (see below for more details): this page gives some exercises related to supervised learning and reinforcement learning
- Do the exercises illustrated in detail below, in particular involving:
- A first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm
- The exercise of the previous point, but using linear algebra data structures and operations of the language you use (or libraries expanding the language)
- Implementation, based on linear algebra, of a multilayer perceptron able to learn the difficult X-OR function
- Implementation of two leaky neurons
- Implementation of an echo-state network
- Implementation of a reinforcement learning model based on Q-Learning
- Implementation of a reinforcement learning model based on an actor-critic (still under construction)
Understand (study) that there are various learning paradigms related to neural networks: see below for more details
Understand that there are various learning paradigms (classes of algorithms) through which neural networks can learn. The major ones are these:
- Supervised learning. Two very important sub-classes:
- Linear regression
- Unsupervised learning
- Reinforcement learning
Search the internet for further information and understand the differences between these different learning paradigms, e.g.:
Also, ask your LOCEN supervisor for the book of Floreano on Neural Networks: an old book but one of the best for learning the basics on neural networks.
Firing-rate neural-network models
Exercise to program a first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm
This section gives a programming exercise on a simple neural network and some indications on how to accomplish it.
The goal of the exercise is to program a one-layer perceptron that is able to learn, in a supervised fashion, to produce the logical AND and OR functions. In particular the network should respond with the following 'desired output' if given the following 'input' examples:
Input1 Input2 Bias Output
0 0 1 0
0 1 1 1
1 0 1 1
1 1 1 1
Input1 Input2 Bias Output
0 0 1 0
0 1 1 0
1 0 1 0
1 1 1 1
To better understand what the neural network should do, see these web pages:
There are various versions of such a simple perceptron. You should implement one where the output unit uses a sigmoid transfer function, and the network is considered to output 1 if the activation of the output unit is > 0.9, and to answer 0 if the activation of the output unit is < 0.1. This leads you to create a more general neural net, e.g. useful when you pass to multi-layer perceptrons (exercise below); plus is more correct/general from a machine learning point of view.
Your networks should have 3 input units (the first being a 'bias') and one output unit.
The network should learn online (i.e., after the presentation of each example of the training set) not batch (i.e., for all examples together).
Note that neural networks (usually) learn progressively, rather than by 'one shot', by being exposed to the same training examples several times. This affects the main program structure indicated below.
Your program should be structured as follows:
- Creation and initialisation of data structures
- Creation of data structures and data for training the neural network
- Loop for multiple epochs (one epoch is one learning pass of the net):
- Loop for examples of data-set
- Network spreading
- Network learning
- Collection of data for following data analysis
- Loop for examples of data-set
- Data analysis: plots on the learning curve of neural network
- Data analysis: numerical output of network performance (e.g., desired and actual net output, errors, etc.)
Same previous exercise, but using linear algebra data structures and operations of the chosen language
Do the exercise above but:
- Use the programming language-dependent data structures to implement vectors (e.g., of input and output) and matrixes (e.g., for connection weights)
- Use the programming language-dependent linear algebra operators (e.g., to implement the whole network spreading of activation, you should have something like:
vOutp = mConnWeig * vInpu
- Implement a neural network that performs the AND and the OR functions at the same time. The network will hence have 3 input and two output units.
Exercise to implement, using linear algebra programming, a multilayer perceptron learning the X-OR function
Implement a neural network that performs the X-OR function below. For this purpose, you will need a multilayer perceptron with 3 input units, 1 output unit, and a variable number of hidden units (initially, implement a network with 1 hidden layer only).
Input1 Input2 Bias Output
0 0 1 1
0 1 1 0
1 0 1 0
1 1 1 1
Parity check function
This is a generalisation of the x-or function, where the output is 0 if the elements of the input pattern have an odd number of digits equal to 1, and 1 if such number is equal. So for example, for 3 input you have:
Input1 Input2 Input3 Bias Output
0 0 0 1 1
0 0 1 1 0
0 1 0 1 0
0 1 1 1 1
1 0 0 1 0
1 0 1 1 1
1 1 0 1 1
1 1 1 1 0
The new network to solve either task has a different architecture with respect to the perceptron. In particular, it has a second layer of two units, called "hidden units", between the input layer and the output layer. So the inputs and the bias fire on the hidden units and then the hidden units (and the bias) fire on the output units.
The learning algorithm for the weights between the output and the hidden units is the same as for the simple perceptron. A second algorithm is needed to modify the weights between the input and the hidden units, called the "error backpropagation" algorithm. Search the formulas of the error-back propagation algorithm on the internet. Make sure that you understand it very well the algorithm as it is not very simple.
Change your program so that after learning it plots:
- The curve of the Square Error of the output with respect to the desired output during learning (put "epochs" at the x-axis, i.e. plot the error averaged over all the input patterns)
- The two 2D weights matrices (from input to hidden; from hidden to output) of your multilayer perceptron: this plot represents every value of a connection weight as a coloured square, with each colour representing the size of the weight, from most negative values to largest positive values.
- Also, plot a similar graph with: on 1 row = the activation of 1 input pattern; the related activation of the hidden units; the activation of the output unit(s); on different rows: those activations corresponding to all patterns.
Further, change the program so that at the end of the simulation all data to make the previous graphs are downloaded into separated file .txt files, and then the graphs are plotted by a second program.
Leaky-neuron neural-network models
What are leaky neurons and how to simulate them
"Leaky neurons" are different from the "firing rate" neurons used so far. In particular, these are neurons that store information in time, e.g. they have an activation that accumulates, with a fixed input, or progressively fades away, with a zero activation. They are indeed behaving like a ''container of water with a leak when you put water in it": they slowly fill in when you put water, the progressively get empty when you stop.
First, see the part of the slides related to the leaky neuron and the way to simulate it based on Euler approximation.
Use those slides to understand how leaky units work:
--> leaky element for computing the activation potential of the neural unit
--> tanh function, truncated to positive values, for computing the activation of the neural units
Also, understand from the slides (and internet) how to implement the leaky units (or any other dynamical system) in a computer program through the "Euler approximation". Euler approximation is necessary as now we will simulate time more in detail as a continuous flow. So, a neural network will be a dynamic system that works in continuous time and the Euler approximation allows the approximated simulation of its dynamics in the discrete time-steps of a computer program.
To suitably take account of continuous-time, in the simulations establish these critical variables:
- sfSimuTimeDura: duration of the whole simulation, in seconds, e.g. set to 10 seconds.
- sfDeltaTime: the duration of the time step used to approximate continuous time, for example, set to 0.01 (seconds). For a certain duration of the simulation, the smaller the time step, the more accurate the approximation of the dynamics of the system. Each cycle of the simulation will now correspond to one sfDeltaTime.
- sfSimuStepDura: duration of the simulation in time steps, set equal to: sfSimuTimeDura/sfDeltaTime
Another very important thing to do in the implementations. With continuous time, causation between phenomena can happen only in different time steps (from t-1 to t). So, for any variable that affects another variable, e.g. a neuron that affects another neuron (or itself in the future), you have to do this:
- Store the "old value of the variable" and the "current value of the variable"
- At the beginning of the simulation cycle representing the elapsing of time (in sfDeltaTime steps), you have to save the old value of all the variables affecting other variables the next time step as in this example of three units:
sfUnit1Old = sfUnit1
sfUnit2Old = sfUnit2
sfUnit3Old = sfUnit3
- Then, for any causal dependency between the variables you have to consider the old values of the variables, for example:
sfUnit1 = sfUnit1Old + (sfDeltaTime/Tau) * (-sfUnit1Old + sfWeig2to1 * sfUnit2Old + sfWeig3to1 * sfUnit1Old)
sfUnit2 = sfUnit2Old + (sfDeltaTime/Tau) * (-sfUnit2Old + sfWeig1to2 * sfUnit1Old + sfWeig3to2 * sfUnit3Old)
sfUnit3 = sfUnit3Old + (sfDeltaTime/Tau) * (-sfUnit3Old+ sfWeig1to3 * sfUnit1Old + sfWeig2to3 * sfUnit3Old)
Notes on good practices to organise the code of the simulations:
- Note that it is a good practice to represent the state of the system (e.g.: neurons' activation potential, potential, old activation, old activation potential, etc.) with variables that do not take into consideration time. For example, it is useful to represent the activation of a neuron with a variable sfUnit rather than as a vector vfUnit that encodes the neuron activation in time. The advantage of doing this is that the core of the simulation remains very clean and compact. Instead, use another vector or matrices to collect data of interest during the simulation (e.g., how a neuron activates during the simulation), to perform data analysis, and plotting of results. An exception to this is for simulation where one wants to do batch learning, which is rare in this context but might happen (e.g., for the echo state network discussed below).
- The example just reported uses single variables to represent the different neurons forming the model; a more compact and better way is to use a vector to represent them (still not considering time, otherwise one would have a matrix).
Based on these indications, do the exercises below.
How one leaky neuron responds to an input
Make a simulation that lasts for 20 seconds in total and involves a network formed by 1 leaky unit.
0 for 1 second
1 for 1 second
0 for 2 second
1 for 2 second
0 for 3 second
1 for 3 second
0 for 4 second
1 for 4 second
Simulate how the leaky neuron responds to that input. Plot a graph in time, showing the input and the neuron activation. Try to change the tau of the neurons to see what happens. Try to change sDeltaTime and see what changes.
A neural network of two leaky units representing an onset circuit
Make a simulation that lasts for 25 seconds in total and involves a network formed by 2 leaky units.
The input the network during the simulation is as follows:
0 for 5 seconds
0.5 for 5 seconds
1 for 5 seconds
0.5 for 5 seconds
0 for 5 seconds
The architecture of the network is as follows:
- An input unit (sUnit1) is formed by a first leaky neuron, and an input/output unit, is formed by a second leaky neuron (sUnit2).
- Both sUnit1 and sUnit2 take the (external) input
- sUnit1 is connected to sUnit2 with a connection weight of -1
- sUnit2 activation gives the output of the network
Plot both the input and the activation of sUnit1 and sUnit2 in time.
Try to set the tau values of the two units (initially set them to 1) with this objective: sUnit2 shows a behaviour detecting the ''derivative'' of the signal in time.
Set sfDeltaTime initially to 0.01. Then set it to different values. You will see that these values only set the accuracy of the model, not its overall dynamics.
A model of Amygdala
This is a simulation of the basic functionalities of the amygdala, important for effective regulation.
The model can:
- produce an innate behavioural response to a biologically salient event (ingestion of food)
- learn to anticipate such response if a cue is paired with the biologically salient event (conditioning)
- make those two responses dependent on the internal hunger/satiety of the organism.
The overall simulation lasts 16 seconds.
Set sfDeltaTime = 0.01.
The neural network undergoes a phase of Pavlovian conditioning (8 seconds) and then a phase of recall showing the effects of the conditioning.
In particular, the network receives three inputs during the 16 seconds of the simulation:
1) Cue (use vector vfInpuCue to encode the values of the Cue in the steps of the simulation):
= 1 in the time intervals 2-4 and 10-12
= 0 otherwise
2) Food (use vector vfInpuFood to encode the values of the Food in the steps of the simulation)
= 1 in the time intervals 4-6
= 0 otherwise
3) Satiety (use vector vfInpuSati to encode the values of Satiation in the steps of the simulation).
You will run two tests where this unit is activated in these ways:
Test 1: = 0 always (indicating the animal is hungry)
Test 2: = 1 always (indicating the animal is satiated)
The architecture of the network is as follows.
Units representing different neural populations, encoding the input stimuli and the motor response:
- A unit (vfActi) formed by a leaky unit encoding the stimulus Cue, with tau = 0.3
- A unit (vfActi) formed by a leaky unit encoding the stimulus Food, with tau = 0.1
- A unit (vfActi) formed by a leaky unit encoding the internal states of Satiety, with tau = 0.1
- A unit (vfActi) formed by a leaky unit encoding neuromodulatory neurons of Dopamine, with tau = 0.1
- A unit (vfActi) is formed by a leaky unit encoding a Motor action, with tau = 0.
For such units create these data structures:
Connections from the external and internal stimuli to the model units:
- Cue-->Cue Unit (the input is simply injected into the unit)
- Food-->Food Unit (the input is simply injected into the unit)
- Food-->Dopamine Unit (the input is simply injected into the unit)
- Satiety-->Satiety Unit (the input is simply injected into the unit)
Connections between the model units (connections internal to the model):
- CueUnit-->FoodUnit (call the variable: sfWeigCueFood , initialise it with 0): represents the association between the cue and the food, undergoing a Hebbian learning process
- FoodUnit-->MotorUnit (call the variable: sfWeigFoodMoto, initialise it with +1): this innate excitatory unit implies that the activation of the Food Unit causes the production of the behaviour
- SatietyUnit-->FoodUnit (call the variable: sfWeigSatiFood, initialise it with -1): this innate inhibitory unit implies that the presence of satiety prevents the Food Unit to activate even in the presence of the Food
- Dopamine Unit--> affects the Hebbian learning (call the variable: sfWeigDA, initialise it with +1) of the plastic connection
The Hebbian learning rule is as follows:
sfWeigCueFood = sfWeigCueFood + sfLearRate * (sfWeigDA * vfActi) * vfActi * max(0.0, (vfActi - sfLearThrePost))
sfWeigCueFood = min(sfWeigMax,sfWeigCueFood)
sfLearRate = 2.0
sfLearThrePost = 0.1
sfWeigMax = 2.0
Collect the history of activation of the units in matrix mfHistActi
Collect the history of the connection weights in matrix mfHistWeig
Figure 1: the graph of the three input stimuli in time
Figure 2: the graph of the units Cue, Food, Satiety in time
Figure 3: the graph of Dopamine and Motor action in time
Figure 4: the graph of the plastic connection weight
A neural network formed by two leaky units that perform a decision-making circuit
Make a simulation that lasts for 60 seconds in total and involves a network formed by 2 leaky units.
The input to the two units is as follows:
(0, 0) for 5 seconds
(0, 0) for 5 seconds
(0, 1) for 5 seconds
(0, 0) for 5 seconds
(0.1, 0.9) for 5 second
(0, 0) for 5 seconds
(0.2, 0.8) for 5 second
(0, 0) for 5 seconds
(0.3, 0.7) for 5 second
(0, 0) for 5 seconds
(0.4, 0.6) for 5 second
(0, 0) for 5 seconds
The architecture of the network:
- The network is formed by two leaky input/output units: sUnit1 and sUnit2
- The sUnit1 and sUnit2 take the respective input and each gives an output, so the network returns two output values
- sUnit1 is connected to sUnit2 with a connection weight of -1
- sUnit2 is connected to sUnit1 with a connection weight of -1
Plot both the two input values and the activation of sUnit1 and sUnit2 in time in one graph (so 4 curves).
Try to set to different values: the Tau of the two units (initially equal to 1), the lateral connections between the two units, and the intensity of the leak,
with the objective of making sfUnit2 tend to go to a high value (this is the "selected option", the "decision") while sfUnit1 tends to go to zero.
Echo State Network
The Echo State Network (ESN) consists of a network of leaky neurons (see above) forming an all-to-all connected recurrent neural network.
Simulate the units of the ESN models below by using the same technique used to simulate the leaky neurons (see above), i.e. use the variables:
sSimuStepDura = sSimuTimeDura / sDeltaTime
At each step of the simulation, each unit receives the weighted sum of inputs from each other unit of the network and from itself:
ap[j,t] = SUM_i [ W[j,i] * a[i,t-1]]
a[j,t] = f[ap[j,t]]
where ap[j,t] is the activation potential of unit j at time t,
SUM_i[.] computes the sum over the elements of the argument running the index i,
W[j,i] is the connection weight from unit i to unit j belonging to the matrix W, and
a[i,t-1] is the activation of unit i at time t-1 computed on the basis of the transfer function f[a[i,t-1]] (f is a Tanh[.] function explained in the above exercises).
If you use the matrix and vector notation, and functions in your program, you can express an ESN activation very compactly as:
ap[t] = W * a[t-1]
ap[t] = W * Tanh[ap[t-1]]
First run of the network
As first exercise you have to try to build a neural network formed by two components.
The first component is an input layer of m units: each input unit has to be connected to all leaky units of the second ESN component through connection weights drawn from a uniform distribution ranging between 0 and 1.
Run a certain number of steps of the simulation, e.g. 400.
Initially, the input units are each activated with a random value drawn from a uniform distribution ranging between 0 and 1.
Keep the units activated for a quarter of time of the simulation to such values, and then set them to 0 for the rest of the time.
The second component is an ESN, where all inner weights W have been randomly drawn from a uniform distribution ranging between - 0.1 and 0.1.
Record the activation of the units of the ESN in time during the simulation, and plot them at the end of it to see their dynamics.
Repeat the simulation trying a different range within which the W weights are randomly drawn, e.g. [-1, +1] or [-10, +10].
As you can see, high values of W make the network units' activation chaotic and crazy.
Instead, low values of W make the network units' activation to:
- first have a transient dynamics
- then to set to a steady-state value
- then to fade to zero after the input is switched off
Eigenvalues / Eigenvectors
To avoid the chaotic behaviour of the ESN, you should modify the connection weights W to given them "the echostate property" implying that:
- with any input, the network will not explode
- if the input units are turned to zero, the network will progressively relax to zero
- for a given sequence of input patterns, the network will go through a specific transient dynamics: this is the most important property for which the network tends to respond to the current input, and also to the recent ones (with a decaying memory), in a specific deterministic way
- moreover, if the input changes a bit the dynamics will change a bit, rather than a lot as in chaotic systems
Using an existing programming function (search it from the math libraries of your programming language), you can extract from W its eigenvalues (you do not need to know how this is done or what eigenvalues precisely are: roughly speaking, they indicate the size of some important dimensions, the eigenvectors, of the space covered by W column vectors).
The function you found usually returns eigenvalues in form of a vector of n elements. Extract the highest eigenvalue from them (in absolute value): this is called the spectrum of the matrix. The function usually also returns a matrix of the eigenvectors, but you do not need it.
Now you should divide W for the spectrum, and use the resulting matrix W' as the weight matrix of your ESN. In this way, the ESN will have the echostate property.
After you do this, repeat the exercise above: you should see a richer dynamics of the network with respect to the W set by hand. Indeed, now the W are the maximum ones just before the network has a chaotic behaviour: the system is said to exhibit an interesting behaviour as it works "at the edge of order and chaos".
Echo State with Perceptron as Output
In this exercise, add a third component to the model of the previous exercise: a perceptron network (see the exercises above on this) that takes as input the n units of the ESN component, and gives its output with 1 linear unit. To do this use the following architecture of the network.
Neural network architecture:
- Use as an input layer to the ESN the same layer and random input you used in the previous exercise
- Connect the input layer to an ESN as the ESN of the previous exercise
- All the ESN units are connected to the linear output units
The goal of the exercise is to train the perceptron to "read out" the dynamic states of the ESN (i.e. the values assumed by the ESN units in time) by reproducing a value that corresponds to a sinusoid function in time. To this purpose, generate a sequence of desired output of the perceptron through a sin(time) function.
Train the network-perceptron for 50 times with the perceptron delta rule. To this purpose train the system as follows:
- Run 50 epochs of training
-- For each epoch, you have a run of 40 seconds during which:
--- The input is random as in the above exercises, and constant for all 40 seconds and all 50 epochs
--- The desired output is the sinusoid one
Then try to test the network giving the random input for 40 seconds.
You should see that:
- The ESN units dynamically respond to the input in time as usual
- The perceptron reads out the inner activations of the network and reproduces the sinusoid function output during the initial transient-dynamic phase
- If you try to use another desired output (e.g., the sum of different sinusoids with different phases and period) the system should be quite good to reproduce it during the initial transient phase
Reinforcement-learning neural network models
Q-LearningReinforcement learning is an important family of algorithms that allow the simulation of intelligent agents (e.g., simulated animals or intelligent robots) that learn to accomplish a task by trial-and-error while interacting with the world. Q-Learning is one of the two most important and basic reinforcement learning algorithms, the other being the actor-critic model. Q-learning is based on a series of states s (s ∈ S, S = all states), a series of actions a (a ∈ A, A = all actions), and a series of “valuations” q (q ∈ Q, Q = all valuations) for each state-action to pass to a new state.
There are two different ways to implement Q-learning:
- Tabular Q-Learning: This is based on a 2-entry table, not a neural-network implementation and so it is not useful as a biological model. However, we suggest taking a look to this page related to tabular Q-learning because it helps you start to understand which kind of problems Q-learning can solve and how:
- Neural-network Q-Learning. This is the target of this tutorial. The model is based on a simple “perceptron” neural-network architecture, i.e., a 2-layer feedforward linear network formed by input and an output layer. The system takes as input the current state of the world and gives as output the q values one for each action: the actions are encoded in a “localistic way” (non “distributed way”), i.e. the system encodes a finite number of discrete actions where 1 action correspond to 1 output unit.
We now explain how to implement neural-network Q-learning in detail.
Simulation of the “world”
At first, you have to program the simulation of the world with which the agent interacts. Here we assume a simple “grid world”, i.e. a world formed by "positions", or "cells", where the agent can be at each step, and where the agent can navigate with its actions. You can consider a 1D arena (or "world": this is formed by a number of contiguous cells, or positions, forming a 1D sequence, e.g. imaging a corridor divided in contiguous squares), with two possible actions for moving leftward or rightward along the positions, or a 2D world (a number of contiguous positions organised in terms of a 2D grid), with four possible action actions to move north, est, west, south. In the case of the 1D world, you represent the state of the world, i.e. the position where the agent is, with a vector of 0s and a 1 in correspondence to the position occupied by the agent. In the case of the 2D world, you represent the state of the world with a matrix of zeros and a 1 in correspondence to the position occupied by the agent. We recommend to implement both the 1D and 2D world in parallel, and to parameterise your program to use one or the other at will, because this will be very instructive and help you debugging the program.
Choose a starting position in which you will position the agent at the beginning of each trial: this is represented as the initial “1” in your vector or matrix of zeros representing the state of the world.
Moreover, decide where to put the reward (r) in your world model. To this purpose, use another vector/matrix with the same dimensions of the world verctor/matrix, with a 1 in the element corresponding to the rewarded position.
Importantly, some of the actions that the agent might choose would bring it to hit the ''walls'' of the arena where it navigates. For example, if it is at the top-left corner of the 2D arena, if it selects the action "west" or the action "north", these actions lead the agent to hit respectively the west and north walls of the arena and so it remains where it is. In general, the agent is always left free to select any one of the actions of its repertoire (so at each time all actions of its repertoire of actions are available for selection), but if it selects an action that leads it against a wall, the effect of this action execution is simply that the agent is left in the same position from which it selected and performed the action. This situation is similar to the one of a robot that tries to move against a wall and the effect is that it remains where it is.
Activation of the neural network
The agent is based on a simple neural network formed by 2 layers of units:
The input layer is activated with the state of the world representing the position of the agent in the world (the input units as many as the possible positions in the world). In the case of the 2D world, you should “unroll” the world matrix into the vector representing the input units.
- The connection weights represent the Q values.
- At the beginning, you can set them to zero values or small random values.
- The output units represent the q values of the possible actions (2 or 4 actions).
The output values do not have to pass through a step or sigmoid transfer function, as usual in neural networks, but through a SoftMax function (also known as “normalized exponential function”). The SoftMax function allows the normalisation of the output values so they sum up to 1 while keeping a difference between them. This allows considering the output of the SoftMax as probability values, used to extract the action to execute. It means that, with zero initial connection weights, at the beginning, you will have two outputs with two 0.5 values; if you have four outputs you will have four 0.25 values. The SoftMax function SM[.], allowing the computation of the probabilities p[a] of selecting the different actions a, is defined as follows:
For q ∈ Q and a ∈ A: p(a) = SM(q)=(e^(q/T)) / (Σ_j e^(q_j/T))
T is called the “temperature parameter”, and e is the Euler number. For high temperatures, all output values tend to assume the same probability, instead of for low temperatures, the values tend to have great differences (to the extreme: one value is close to 1 and all the others values are close to 0). So the temperature allows setting how much the probabilities of selecting the different actions differ between them in correspondence to the Q values.
We now explain the core equation of the algorithm, allowing the update of the network connection weights corresponding to the q values, and then the whole program of the simulation. As we are facing reinforcement learning, the neural network cannot be trained in a supervised learning fashion as we lack a vector of desired output values. Thus, we use a different learning rule to update step by step the Q values, depending on the quality of the chosen action in terms of reward. At each time step t, the learning rule updates only the connection weights that correspond to the action a[t-1] selected and executed at time t-1 (this action corresponded to the q value: q[s[t-1], a[t-1]]).
To update the connection weights w[j,i] connecting input unit s_i to the output unit q_j corresponding to the selected a[t-1], is as follows:
Delta w_ji = η * ( (r[t] + γ * Max_q[q[s[t], a[t]]]) – q[s[t-1], a[t-1]] ) * s_i[t-1]
Let us examine this equation closely:
- As mentioned, w[j,i] connects input unit s_i to the output unit of the network corresponding to the q-value q_j corresponding to the previously selected a[t-1]
- η is the learning rate, usually set equal to 0.1
- r(t) is the reward achieved in the current state s_tγ is a “discount factor” usually set to 0.9: this has the function to decrease the value now attributed to future q values and rewards
- Max_q[q[s[t], a[t]]] represents the maximum value within the set of 2 or 4 q values corresponding to the 2 or 4 actions
- s(t-1) is the previous states vector, and s_i is an element of it
This is how the simulation will broadly work. In the first trial, you will set your agent in the world and it will start exploring randomly the environment until it finds the reward. When it reaches the reward, the q value of the last action performed increases according to the learning equation. The first trial ends and the second one starts with the agent set again in the initial position. The agent will move again and will learn nothing until it reaches the position to which it assigned a value in the first trial (the one visited before the reward). Thus, the q-values/weights matrix are progressively updated moving away from the rewarded position. Therefore, the larger the word, the larger the number of trials and steps the agent will need to learn to know what to do at each position.
Here the pseudo-code of how the algorithm works
- Initialise the relevant variables
- Loop over trials
- Reset relevant variables
- Set the world to the initial state (agent at the starting position)
- Loop over the steps of the trial
- Store values (which become the "values at t-1") of critical variables before updating them: input, q-values, action
- Update the state of the world based on the action a[t-1]
- Compute the reward r[t] returned to the agent by the world
- Activate the input units of the network based on s[t] (this represent the activation of sensors)
- Implement the spreading of the net to compute the q[t] values
- Compute the probabilities of action, p[a], with the SoftMax function
- Select action a based on p[a]: this action will be executed at the next time step
- If this is not the first or the last step of the trial:
- update the network weights based on the equation above
- If this is the last step of the trial (r=1)
- update the network weights based on the equation above but considering zero instead of Max_q as "there is no future" in the trial
As you can see, regarding the learning rule:
- In the first step of each trial, the agent does not learn: the reason is that the algorithm needs to succeed time steps to learn as its learning is based on the consequences (in terms of the reward) that the action selected at t-1 causes at t
- In the last step of the trial, the algorithm does not consider the maximum current q, corresponding to an estimate of the future rewards, but rather a zero value as the trial ends and so “there is no future” for the agent, hence no future rewards
To see if your program works well, try this simulation:
- Set a 1D world of nine positions (two actions)
- Set the reward for the last position on the right
- Set the start in the first position on the left
- Set a maximum of 200 Trials and a maximum of 50 steps per trial
- Set LearningRate = 0.1 and DiscountFactor = 0.9
- Set a random selection of actions (so not following the SoftMax probabilities, but rather a 0.5 probability for each action)
- Let it run!
If the program is correct, you should obtain q-values (i.e., the network connection weights) similar to the following ones (horizontal 0-8: positions; vertical 0-1: the two actions, East and West; note how the values of the West action are equal to 0, γ^0=0.9^0=1, γ^1=0.9^1=0.9, γ^2=0.9^2=0.81, etc.).