LOCEN Resources Initial Programming Exercises On Simple Neural Networks

Initial Programming Exercises On Simple Neural Networks

 

Introduction

This page gives some programming exercises to start to learn to program neural networks. As the rest of these pages on "learning paths", this page heavily relies on other pages in the web and is mainly meant as a guidance: the student should be able to study and learn by him/herself on the basis of indications here, internet pages, and some additional instructions on things not covered by this web-pages when needed.
The steps to follow are these, expanded in the exercises below:
  • Read the following web-page of this web-site to learn the programming concepts that you need to learn to work with LOCEN (and in general to be able to build and work with computational models of brain and behaviour, or artificial intelligence/robotic systems): web-page on programming concepts to learn
  • Depending on the programming language that you have to learn (agreed with your LOCEN supervisor), refer to suitable web-sites teaching such programming language. For the different programming languages, such web-sites are indicated in the web-pages linked from this web-page under ''Programming and simulations --> Specific languags".
  • Installing an IDE (Integrated Development Environment) on your portable computer (or the computer at LOCEN) to program in the selected language; indications on this are given in the web-pages dedicated to the specificic languages and mentioned in the previous point
  • Understand that there are various learning paradigms related to neural networks (see below for more details): this page gives some exercises related to supervised learning and reinforcement learning
  • Do the exercises illustrated in detail below, in particular involving:
    • A first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm
    • The exercise of the previous point, but using linear algebra data structures and operations of the language you use (or libraries expanding the language)
    • Implementation, based on linear algebra, of a multilayer perceptron able to learn the difficult X-OR function
    • Implementation of two leaky neurons
    • Implementation of an echo-state network
    • Implementation of a reinforcement learning model based on Q-Learning
    • Implementation of a reinforcement learning model based on an actor-critic (still under construction)

Understand (study) that there are various learning paradigms related to neural networks: see below for more details

Understand that there are various learning paradigms (classes of algorithms) through which neural networks can learn. The major ones are these:

  • Supervised learning. Two very important sub-classes:
    • Linear regression
    • Classification
  • Unsupervised learning
  • Reinforcement learning

Search the internet for further information and understand the differences between these different learning paradigms, e.g.:
https://en.wikipedia.org/wiki/Artificial_neural_network#Learning_paradigms

https://en.wikipedia.org/wiki/Machine_learning

Also, ask your LOCEN supervisor for the book of Floreano on Neural Networks: an old book but one of the best for learning the basics on neural networks.

Exercise to program a first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm

This section gives a programming execise on a simple neural network and some indications on how to accomplish it.

The goal of the exercise is to program a one-layer perceptron that is able to learn, in a supervised fashion, to produce the logical AND and OR functions. In particualr the network should respond with the following 'desired output' if given the folling 'input' examples:

OR function
Input1  Input2  Bias     Output

0          0           1         0
0          1           1         1
1          0           1         1
1          1           1         1

 AND function
Input1  Input2  Bias     Output
0          0           1         0
0          1           1         0
1          0           1         0
1          1           1         1

To better understand what the neural network should do, see these web-pages:

https://en.wikipedia.org/wiki/Perceptron
http://computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html
http://francesco-mannella.github.io/neunet-basics/perceptron-simple-simu...
http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/

There are various versions of such simple perceptron. You should implement one where the output unit uses a sigmoid transfer function, and the network is considered to output 1 if the actvation of the output unit is > 0.9, and to asnawer 0 if the activation of the output unit is < 0.1. This leads you to create a more general neural net, e.g. useufl when you pass to multi-layer perceptrons (exercise below); plus is more correct/general from a machine learning point of view.

Your networks should have 3 input units (the first being a 'bias') and one output units.

The network should learn on-line (i.e., after the presentation of each example of the training set) not batch (i.e., for all exaples together).

Note that neural networks (usually) learn progressively, rather than by 'one shot', by being exposed to the same training examples several times. This affects the main program structure indicated below.

Your program should be structured as follows:

  • Creation and initialisation of data structures
  • Creation of data structurs and data for training the neural network
  • Loop for multiple epochs (one epoch is one learning pass of the net):
    • Loop for examples of data-set
      • Newtork spreading
      • Network learning
      • Collection of data for following data analysis
  • Data analysis: plots on learning curve of neural network
  • Data analysis: numerical output of network performance (e.g., desired and actual net output, errors, etc.)

 

Same exercise of the previous point, but using linear algebra data structures and operations of the language

Do the exercise above but:

  • Use the programming language-dependent data-structures to implement vectors (e.g., of input and output) and matrixes (e.g., for connection weights)
  • Use the programming language-dependent linear algebra operators (e.g., to implement the whole network spreading of activation, you should have something like:
    vOutp = mConnWeig * vInpu
  • Implement a neural network that performs the AND and the OR functions at the same time. The network will hence have 3 input and two output units. 

Exercise to implement, using linear algebra programming, a multilayer perceptron learning the X-OR function

Implement a neural network that performs the X-OR function below. To this purpose you will need a multilayer perceptron with 3 input units, 1 output units, and a variable number of hidden units (initially, implement a network with 1 hidden layer only).

X-OR function
Input1  Input2  Bias     Output
0          0           1         1
0          1           1         0
1          0           1         0
1          1           1         1

Parity check function

This is a generalisation of the x-or function, where the output is 0 if the elements of the input pattern have an odd number of digits equal to 1, and 1 if such number is equal. So for example, for 3 input you have:
Input1  Input2 Input3  Bias     Output

0          0          0           1         1
0          0          1           1         0
0          1          0           1         0
0          1          1           1         1
1          0          0           1         0
1          0          1           1         1
1          1          0           1         1
1          1          1           1         0

The new network to solve either tasks has a different architecture with respect to the perceptron. In particular, it has a second layer of two units, called "hidden units", between the input layer and the otput layer. So the inputs and the bias fire on the hidden units and then the hidden units (and the bias) fire on the output units.

The learning algorithm for the weights between the output and the hidden units is the same as for the simple perceptron. A second algorithm is needed to modify the weights between the input and the hidden units, called "error backpropagation" algorithm. Search the formulas of the error-back propagation algorithm in internet. Make sure that you understand very well the algorithm as it is not very simple.

Change your program so that after learning it plots:

  • The curve of the Square Error of the output with respect to the desired output during learning (put "epochs" at the x-axis, i.e. plot the error averaged over all the input patterns)
  • The two 2D weights matrices (from input to hidden; from hidden to output) of your multilayer perceptron: this plot represents every value of a connection weight as a coloured square, with each colour representing the size of the weight, from most negative values to largest positive values.
  • Also plot a similar graphs with: on 1 row = the activation of 1 input pattern; the related activation of the hiddent units; the activation of the output unit(s); on different rows: those activations corresponding to all partterns.

Further change the program so that at the end of the simulation all data to make the previous graphs are downloaded into separated file .txt files, and then the graphs are plotted by a second program.

Leaky neurons

"Leaky neurons" are different form the "firing rate" neurons used so far. In particular, these are neurons that store information in time, e.g. they have an activation that accumulates, with a fixed input, or progressively fades away, with a zero activation. They are indeed behave like a ''container of water with a leak when you put water in it": they slowly fill in when you put water, the progressively get empty when you stop.

First, see the the part of these slides related to the leaky neuron and the way to similate it based on Euler approximation here: 
http://www.im-clever.eu/documents/courses/computational-embodied-neurosc...

Use those slides to understand how a leaky units work:
--> leaky element for computing the activation potential of the neural unit
--> tanh function, truncated to positive values, for computing the activation of the neural units

Also understand from the slides (and internet) how to implement the leaky units (or any other dynamical system) in a computer program through the "Euler approximation". Euler approximation is necessay as now we will simulate time more in detail as a continuous flow. So, a neural network will be a dynamic system that works in continous time and the Euler approximation allows the approximated simulation of its dynamics in the discrete time-steps of a computer program.

To suitably take account of continous time, in the simulations establish these critical variables:

  • sSimuTimeDura: duration of the whole simulation, in seconds, e.g.  set to 10 seconds. 
  • sDeltaTime: the durantion of the time step used to approximate continous time, for example set to 0.01 (seconds). For a certain duration of the simulatin, the smaller the time step, the more accurate the approximation of the dynamics of the system. Each cycle of the simulation will now correspond to one sDeltaTime.
  • sSimuStepDura: duration of the simulation in time steps, set equal to: sSimuTimeDura/sDeltaTime

Another very important thing to do in the implementations. With continous time, causation between phenomena can happen only in different time steps (from t-1 to t). So, for any variable that affect another variable, e.g. a neuron that affects another neuron (or itself in the future), you have to do this:

  • Store the "old value of the variable" and the "current value of the variable"
  • At the beginning of the simulation cycle representing the elapsing of time (in sDeltaTime steps), you have to save the old value of all the variables affecting other variabels the next time step as in this example of three units:
    sUnit1Old = sUnit1
    sUnit2Old = sUnit2

    sUnit3Old = sUnit3
  • Then, for any causal dependency between the variabels you have to consider the old values of the variabels, for example:
    sUnit1 = sUnit1Old + (sDeltaTime/Tau) * (-sUnit1Old + sWeig2to1 * sUnit2Old + sWeig3to1 * sUnit1Old)
    sUnit2 = sUnit2Old + (sDeltaTime/Tau) * (-sUnit2Old + sWeig1to2 * sUnit1Old + sWeig3to2 * sUnit3Old)
    sUnit3 = sUnit3Old + (sDeltaTime/Tau) * (-sUnit3Old+ sWeig1to3 * sUnit1Old + sWeig2to3 * sUnit3Old) 

Based on this general rules, do the exercise below.

How a leaky neuron responds to an input

Make a simualtion that lasts for 20 seconds in total and involves a network formed by 1 leaky unit.

Input I:
0 for 1 second

1 for 1 second
0 for 2 second
1 for 2 second
0 for 3 second
1 for 3 second
0 for 4 second
1 for 4 second

Simulate how the leaky neuron responds to that input. Plot a graph in time, showing the input and the neuron activation. Try to change the tau of the neurons to see what happens. Try to change sDeltaTime and see what changes.


A neural network formed by two leaky units representing an onset circuit

Make a simualtion that lasts for 25 seconds in total and involves a network formed by 2 leaky units.

The input the network during the simualtion is as follows:
0 for 5 second
1 for 5 seconds

2 for 5 second
1 for 5 second

0 for 5 second
 

Architecture of the network:

  • An input unit (sUnit1) formed by a first leaky neuron, and an input/output unit, formed by a second leaky neuron (sUnit2).
  • Both sUnit1 and sUnit2 take the (external) input
  • sUnit1 is connected to sUnit2 with a connection weight of -1
  • sUnit2 activation gives the output of the network

Plot both the input and the activation of sUnit1 and sUnit2 in time.

Try to set the tau values of the two units (initially set them to 1) with this objective: sUnit2 shows a behaviour detecting the ''derivative'' of the signal in time.

Set sDeltaTime initially to 0.01. Then to different values. You will see that these values only set the accuracy of the model, not its overal dynamics.

A neural network formed by two leaky units that perform a decision making circuit

Make a simualtion that lasts for 60 seconds in total and involves a network formed by 2 leaky units.

The input to the two units is as follows:
(0, 0) for 5 seconds
  (0, 0) for 5 seconds

(0, 1) for 5 seconds
  (0, 0) for 5 seconds
(0.1, 0.9) for 5 second
  (0, 0) for 5 seconds

(0.2, 0.8) for 5 second
  (0, 0) for 5 seconds

(0.3, 0.7) for 5 second
  (0, 0) for 5 seconds

(0.4, 0.6) for 5 second
  (0, 0) for 5 seconds

Architecture of the network:

  • The network is formed by two leaky input/output units: sUnit1 and sUnit2
  • The sUnit1 and sUnit2 take the respective input and each gives an output, so the network returns two output values
  • sUnit1 is connected to sUnit2 with a connection weight of -1
  • sUnit2 is connected to sUnit1 with a connection weight of -1

Plot both the two input values and the activation of sUnit1 and sUnit2 in time in one graph (so 4 curves).

Try to set the values Tau of the two units (initially equal to 1) with the objective of:
- see how in the seconds 1-6 the sUnit2 fires and then relaxes (notwithstanding the constant input)
- see how in the seconds 7-12 the sUnit2 sums the input in time

Echo State Network

The Echo State Network (ESN) consists of a network of leaky neurons (see above) forming an all-to-all connected recurrent neural network.

Simulate the units of the ESN models below by using the same technique used to simulate the leaky neurons (see above), i.e. use the variables:
sSimuTimeDura

sDeltaTime
sSimuStepDura = sSimuTimeDura / sDeltaTime
 
At each step of the simulation, each unit receives the weighted sum of inputs from each other unit of the network and from itself:
 
ap[j,t] = SUM_i [ W[j,i] * a[i,t-1]]
a[j,t] = f[ap[j,t]]
 
where ap[j,t] is the activation potential of unit j at time t,
SUM_i[.] computes the sum over the elements of the argument running the index i,
W[j,i] is the connection weight from unit i to unit j belonging to the matrix W, and
a[i,t-1] is the activation of unit i at time t-1 computed on the basis of the transfer function f[a[i,t-1]] (f is a Tanh[.] function explained in the above exercises).
 
If you use the matrix and vector notation, and functions in your program, you can express an ESN activation very compactly as: 
ap[t] = W * a[t-1]
i.e.:
ap[t] = W * Tanh[ap[t-1]]
 
First run of the network
 
As first exercise you have to try to build a neural network formed by two components.
 
The first component is an input layer of m units: each input unit has to be connected to all leaky units of the second ESN component through connection weights drawn from a uniform distribution ranging between 0 and 1.
 
Run a certain number of steps of the simulation, e.g. 400.
 
Initially, the input units are each activated with a random value drawn from a uniform distribution ranging between 0 and 1.
 
Keep the units activated for a quater of time of the simulation to such values, and then set them to 0 for the rest of the time.
 
The second component is an ESN, where all inner weights W have been randomoly drawn from a uniform distribution ranging between - 0.1 and 0.1.
 
Record the activation of the units of the ESN in time during the simulation, and plot them at the end of it to see their dynamics.
 
Repeat the simulation trying a different range within which the W weights are randomly drawn, e.g. [-1, +1] or [-10, +10].
 
Expected results
 
As you can see, high values of W make the network units' activation chaotic and crazy.
Instead, low values of W make the network units' activation to:
- first have a transient dynamics
- then to set to a steady-state value
- then to fade to zero after the input is switched off
 
Eigenvalues / Eigenvectors
 
To avoid the chaotic behaviour of the ESN, you should modify the connection weights W to given them "the echostate property" implying that:
  • with any input, the network will not explode
  • if the input units are turned to zero, the network will progressively relax to zero
  • for a a given sequence of input patterns, the network will go through a specific transient dynamics: this is the most important property for which the network tends to respond to the current input, and also to the recent ones (with a decaying memory), in a specific deterministic way
  • moreover, if the input changes a bit the dinamics will change a bit, rather than a lot as in chaostic systems
Using an existing programming function (search it from the math libraries of your programming language), you can extract from W its eigenvalues (you do not need to know how this is done or what eigenvalues precisely are: roughly speaking, they indicate the size of some important dimentions, the eigenvectors, of the space covered by W column vectors).
 
The function you found usually returns eigenvalues in form of a vector of n elements. Extract the highest eigenvalue from them (in absolute value): this is called the spectrum of the matrix. The function usually also returns a matrix of the eigenvectors, but you do not need it.
 
Now you should divide W for the spectrum, and use the resulting matrix W' as the weight matrix of your ESN. In this way the ESN will have the echostate property.
 
After you do this, repeat the exercise above: you should see a richer dynamics of the network with resepect to the W set by hand. Indeed, now the W are the maximum ones just before the network has a chaotic behaviour: the system is said to exhibit an interesting behaviour as it works "at the edge of order and chaos".
 
Echo State with Perceptron as Output
 
In this exercise, add a third component to the model of the previous exercise: a perceptron network (see the exercises above on this) that takes as input the n units of the ESN component, and gives its output with 1 linear unit. To do this use the following architecture of the network.
 
Neural network architecture:
- Use as input layer to the ESN the same layer and random input you used in the previous exercise
- Connect the input layer to an ESN as the ESN of the previous exercise
- All the ESN units are connected to the linear output units
 
The goal of the exercise is to train the perceptron to "read out" the dynamic states of the ESN (i.e. the values assumed by the ESN units in time) by reproducing a value that corresponds to a sinusoid function in time. To this purpose, generate a sequence of desired output of the perceptron through a sin(time) function.
 
Train the network-percetron for 50 times witht he perceptron delta rule. To this purpose train the system as follows:
- Run 50 epochs of training
-- For each epoch, you have a run of 40 seconds during which:
--- The input is random as in the above exercise, and constant for all 40 seconds and all 50 epochs
--- The desired output is the sinusoid one
 
 
Then try to test the network giving the random input for 40 seconds.
 
Expected results

You should see that:

  • The ESN units dynamically responds to the input in time as usual
  • The perceptron reads out the inner activations of the network and reproduces the sinusoid function output during the initial transient-dynamic phase
  • If you try to use another desired output (e.g., the sum of different sinusoids with different phase and period) the system should be quite good in reproduce it during the initial transient phase

 

Reinforcement learning: Q-Learning

Q-LearningReinforcement learning is an important family of algorithms that allow the simulation of intelligent agents (e.g., simulated animals or intelligent robots) that learn to accomplish a task by trial-and-error while interacting with the world. Q-Learning is one of the two most important and basic reinforcement learning algorithms, the other being the actor-critic model. Q-learning is based on a series of states s (s ∈ S, S = all states), a series of actions a (a ∈ A, A = all actions), and a series of “valuations” q (q ∈ Q, Q = all valuations) for each state-action to pass to a new state.

There are two different ways to implement Q-learning:

  • Tabular Q-Learning: This is based on a 2-entry table, not a neural-network implementation and so it is not useful as a biological model. However, we suggest to take a look to this page  related to tabular Q-learning because it helps you starting to understand which kind of problems Q-learning can solve and how:
    http://mnemstudio.org/path-finding-q-learning-tutorial.htm
  • Neural-network Q-Learning. This is the target of this tutorial. The model is based on a simple “perceptron” neural-network architecture, i.e., a 2-layer feedforward linear network formed by an input and an output layer. The system takes as input the current state of the world and gives as output the q values one for each action: the actions are encoded in a “localistic way” (non “distributed way”), i.e. the system encodes a finite number of discrete actions where 1 action correspond to 1 output unit.

We now explain how to implement neural-network Q-learning in detail.

Simulation of the “world”

At first you have to program the simulation of the world with which the agent interacts. Here we assume a simple “grid world”, i.e. a world formed by "positions", or "cells", where the agent can be at each step, and where the agent can navigate with its actions. You can consider a 1D arena (or "world": this is formed by a number of contiguous cells, or positions, forming a 1D sequence, e.g. imaging a corridor divided in contiguous squares), with two possible actions for moving leftward or rightward along the positions, or a 2D world (a number of contiguous positions organised in terms of a 2D grid), with four possible action actions to move north, est, west, south.In the case of the 1D world, you represent the state of the world, i.e. the position where the agent is, with a vector of 0s and a 1 in correspondence to the position occupied by the agent. In the case of the 2D world, you represent the state of the world with a matrix of zeros and a 1 in correspondence to the position occupied by the agent. We recommend to implement both the 1D and 2D world in parallel, and to parameterise your program to use one or the other at will, because this will be very instructive and help you debugging the program. 

Choose a starting position in which you will position the agent at the beginning of each trial: this is represented as the initial “1” in your vector or matrix of zeros representing the state of the world. 

Moreover, decide where to put the reward (r) in your world model. To this purpose, use another vector/matrix with the same dimensions of the world verctor/matrix, with a 1 in the element corresponding to the rewarded position.

Importantly, some of the actions that the agent might choose would bring it to hit the ''walls'' of the arena where it navigates. For example, if it is at the top-left corner of the 2D arena, if it selects the action "west" or the action "north", these actions lead the agent to hit respectively the west and north walls of the arena and so it remains where it is. In general, the agent is always left free to select any one of the actions of its repertoire (so at each time all actions of its repertoire of actions are available for selection), but if it selects an action that leads it against a wall, the effect of this action execution is simply that the agent is left in the same position from which it selected and performed the action. This istuation is similarly to the one of a robot that tries to move against a wall and the effect is that it remains where it is.

Activation of the neural network

The agent is based on simple neural network formed by 2 layers of units:

The input layer is activated with the state of the world representing the position of the agent in the world (the input units as as many as the possible positions in the world). In the case of the 2D world, you should “unroll” the world matrix into the vector representing the input units.

  • The connection weights represents the Q values.
  • At the beginning you can set them to zero values  or small random values.
  • The output units represent the q values of the possible actions (2 or 4 actions).

The output values do not have to pass through a step or sigmoid transfer functoin, as usual in neural networks, but through a SoftMax function (also known as “normalized exponential function”). The SoftMax function allows the normalisation of the output values so they sum up to 1 while keeping a difference between them. This allows to consider the output of the SoftMax as probability values, used to extract the action to execute. It means that, with zero initial connection weights, at the beginning you will have two outputs with two 0.5 values; if you have four outputs you will have four 0.25 values.The SoftMax function SM[.], allowing the computation of the probabilities p[a] of selecting the different actions a,  is defined as follows:

For q ∈ Q and a ∈ A:   p(a) = SM(q)=(e^(q/T)) / (Σ_j e^(q_j/T)) 

T is called the “temperature parameter”, and e is the Euler number. For high temperatures, all output values tends to assume the same probability, instead for low temperatures the values tends to have great differences (to the extreme: one value is close to 1 and all the others values are close to 0). So the temperature allows setting how much the probabilities of selecting the different actions differ between them in correspondence to the Q values.

We now explain the core equation of the algorithm, allowing the update of the network connection weights corresponding to the q values, and then the whole program of the simulation. As we are facing reinforcement learning, the neural network cannot be trained in a supervised learning fashion as we lacks a vector of desired output values. Thus, we use a different learning rule to update step by step the Q values, depending on the quality of the chosen action in terms of reward. At each time step t, the learning rule updates only the connection weights that correspond to the action a[t-1] selected and executed at time t-1 (this action corresponded to the q value: q[s[t-1], a[t-1]]).

To update the connection weights w[j,i] connecting input unit s_i to the output unit q_j corresponding to the selected a[t-1], is as follows:

Delta w_ji = η * ( (r[t] + γ * Max_q[q[s[t], a[t]]]) – q[s[t-1], a[t-1]] ) * s_i[t-1]

Let us examine this equation closely:

  • As mentioned, w[j,i] connects input unit s_i to the output unit of the network corresponding to the q-value q_j corresponding to the previously selected a[t-1]
  • η is the learning rate, usually set equal to 0.1
  • r(t) is the reward achieved in the current state s_tγ is a “discount factor” usually set to 0.9: this has the function to decrease the value now attributed to future q values and rewards
  • Max_q[q[s[t], a[t]]] represents the maximum value within the set of 2 or 4 q values corresponding to the 2 or 4 actions
  • s(t-1) is the previous states vector, and s_i is an element of it

Q-Learning algorithm

This is how the simulation will broadly work. In the first trial, you will set your agent in the world and it will start exploring randomly the environment until it finds the reward. When it reaches the reward, the q value of the last action performed increases according to the learning equation. The first trial ends and the second one starts with the agent set again in the initial position. The agent will move again, and will learning nothing until it reaches the position to which it assigned a value in the first trial (the one visited before the reward). Thus, the q-values/weights matrix are progressively updated moving away from the rewarded position. Therefore, the larger the world, the larger the number of trials and steps the agent will need to learn to know what to do at each position. 

Here the pseudo-code of how the algorithm works

  1. Initialise the relevant variables
  2. Loop over trials
  3.     Reset relevant variables    
  4.     Set the world to the initial state (agent at the starting position)
  5.          Loop over the steps of the trial
  6.             Store values (which become the "values at t-1") of critical variables before updating them: input, q-values, action
  7.             Update the state of the world based on the action a[t-1]
  8.             Compute the reward r[t] returned to the agent by the world
  9.             Activate the input units of the network based on s[t] (this represent the activation of sensors)
  10.             Implement the spreading of the net to compute the q[t] values
  11.             Compute the probabiliteis of action, p[a], with the SoftMax function
  12.             Select action a based on p[a]: this action will be executed at the next time step 
  13.             If this is not the first or the last step of the trial:
  14.                 update the network weights based on the equation above
  15.             If this is the last step of the trial (r=1)
  16.                 update the network weights based on the equation above but considering zero instead of Max_q as "there is not future" in the trial

As you can see, regarding the learning rule:

  • In the fist step of each trial the agent does not learn: the reason is that the algorithm needs to succeeding time steps to learn as its learning is based on the consequences (in terms of reward) that the action selected at t-1 causes at t
  • In the last step of the trial, the algorithm does not consider the maximum current q, corresponding to an estimate of the future rewards, but rather a zero value as the trial ends and so “there is not future” for the agent, hence no future rewards

To see if your program works well, try this simulation:

  • Set a 1D world of nine positions (two actions)
  • Set the reward on the last position on the right
  • Set the start at the first position on the left
  • Set a maximum of 200 Trials and a maximum of 50 steps per trial
  • Set LearningRate = 0.1 and DiscountFactor = 0.9
  • Set a random selection of actions (so not following the SoftMax probabilities, but rather a a 0.5 probability for each action)
  • Let it run!

If the program is correct, you should obtain q-values (i.e., the network connection weights) similar to the following ones (horizontal 0-8: positions; vertical 0-1: the two actions, East and West; note how the values of the West action are equal to 0, γ^0=0.9^0=1, γ^1=0.9^1=0.9, γ^2=0.9^2=0.81, etc.):