Description
Exercises Reinforcement Learning
In this exercise you will apply various basic reinforcement learning methods to a toy example. Exercises 1-5 can be done based on the lecture slides, for the other questions either use the resources provided on Blackboard or literature you found on your own. Cite all the references you employed.
8 2 f |
1 |
h |
||||
Exercise 1 (10 points) |
P |
|||||
Proof that the return Rt = h=0 rt+h+1 is bounded for 0< 1 and |
||||||
for bounded rewards 10 |
rt+h+1 |
10 |
h |
0; 1; : : : ; 1g. Note: 00 = 1. |
Scenario Our robot found a new job as cleaning robot. The robot only has the actions \left” and \right”. It is working in a corridor with 6 states, the two end-states are terminal, i.e., the episode ends immediately once the robots reaches them. It gets a reward of 1 when reaching the left state (charger) and a reward of 5 when reaching the right state (trash bin).
states
s=1 s=2 s=3 s=4 s=5 s=6
actions a=-1 a=1
-
= f1; 2; 3; 4; 5; 6g
st |
if st is terminal (st = 1 or st = 6) |
|
st+1 = (st + at |
otherwise, where at |
2 A |
A = f 1; 1g = fleft, rightg
8
>5 if st+1 = 6 and st 6= 6
<
rt+1 = 1 if st+1 = 1 and st 6= 1
>
:0 otherwise
Exercise 2 (10 points) Implement Q-iteration. For = 0:5 the optimal Q-function is given in the table below. Show the values after each iteration. What is the optimal policy ?
-
PPPP
state
1
2
3
4
5
6
action
PP
PP
left
0.0
1.0
0.5
0.625
1.25
0.0
right
0.0
0.625
1.25
2.5
5.0
0.0
Exercise 3 (10 points) Show the optimal value functions Q for = 0, = 0:1, = 0:9, and
= 1. Discuss the in uence of the discount factor . Explain why = 1 will work here in contrast to Exercise 1.
Exercise 4 (15 points) Implement Q-learning. For the rest of the exercises use = 0:5. Try di erent values for the exploration ” and the learning rate . Plot the di erence (2-norm) between the value function estimated by Q-learning and the true value function (table above) over the number of interactions with the system. Provide plots for di erent values of ” and . Describe and explain the di erences in behavior.
1
Exercise 5 (15 points) Now the robot is partially broken and stays at the same state with a probability of 30%, else it works correctly. Test this scenario with Q-iteration and Q-learning. What happens? Can you still use the same Q-learning parameters?
Exercise 6 (40 points) Somebody tried to x the robot. At least it is not stopping any longer but now we have a di erent problem: Rather than moving s0 = s + a the robot now moves s0 = s+a+N (0; 0:01), where N (0; 0:01) is a Gaussian distribution with mean = 0 and variance 2 = 0:01. Now we need to consider continuous states and non-deterministic transitions. The terminal states are reached for s < 1:5 and s 5:5 respectively. Implement value function approximation with radial basis functions (we still have 2 discrete actions) for Q-learning or another algorithm of your choice. Provide pseudo-code. How many basis functions do you need? Do the widths have an in uence on the performance? What happens to the learning performance compared to the other approaches? Illustrate with plots and discuss.
Hints: As you cannot use Q-iteration any longer to nd the ground truth you will need to show how quickly the learning converges and how the performance evolves (e.g., expected return). Start with 6 basis functions centered at f1; 2; 3; 4; 5; 6g and test your implementation with 2 = 0. Often normalizing the RBF network1 is a good idea.
Exercise 7 (max 40 points) Bonus points2: implement one additional RL algorithm (e.g., policy iteration, TD( ), SARSA, actor critic, policy search) or an additional variant of the above algorithms (e.g., state discretization, di erent function approximator, eligibility traces). Provide pseudo-code. Discuss the the advantages and disadvantages of the chosen approach.
2You can get a maximum of 100 points (which corresponds to a grade of 10) for Exercises 1-6. Points you missed in Exercises 1-6 can be compensated for by Exercise 7. The overall maximum grade is 10.
2