Description
-
Optimal Policy and Value Function [15 points]
Consider a simple 2-state MDP shown in Figure 1, S = fS1; S2g. From each state, there are two available actions A = fstay; gog. The reward received for taking an action at a state is shown in the figure for each (s; a) pair. Also, taking action go from state S2 ends the episode. All transitions are deterministic.
Figure 1: 2-state MDP
(a) Consider a policy that always chooses the stay action at every state. Compute the sum of discounted rewards obtained by this policy starting at state S1, assuming a discount of i.e.
1
compute P trt(st; at). [4 points]
t=0
-
What is the optimal policy? Write it as a tuple (a1; a2) of the optimal actions at (s1; s2) respectively. Compute the sum of discounted rewards obtained by this policy, given that the start state is S1, with discount . [4 points]
-
Recall that one update of the value iteration algorithm performs the following operation, where r(s; a) denotes the reward for taking action a at state s.
-
-
-
For each s 2 S :
V i+1(s) max X p(s0js; a) r(s; a) + V i(s0)
(1)
a
s0
-
-
Note: Since taking action go at s2 will terminate the episode, the value of taking the go action at s2 is just r(s2; go) instead of r(s2; go) + V (s1) (since you don’t care about future values after the episode has ended). Thus the update for state V (s2) will look like:
-
-
-
-
-
V i+1(s2) max r(s2; stay) + V i(s2); r(s2; go)
(2)
-
-
-
-
The value function V is stored as an array of size jSj ( jSj = 2 in our case). For example, the array V 0 = [0; 1] denotes that V 0(s1) = 0; V 0(s2) = 1. Starting with a initial V 0 = [0; 0], peform the value iteration update to compute V 1; V 2; V 3 assuming a discount factor = 1. What is the optimal V ? [7 points]
-
Value Iteration Convergence [25 points + 5 bonus points]
In part (c) of the previous question, we computed V i for a few updates of value iteration and also V , the optimal value function. What happened to the “error” in the value estimate at every update i.e. did jV i(s) V (s)j every increase for a particular state s? The answer is yes, the value estimate
x2k1,
for any s may fluctuate before eventually converging to V (s). If this is the case, do we have any hope that the value iteration algorithm will converge to V ? In this question, we will prove that value iteration indeed converges due to the monotonic decrease in a different error – the maximum
error over all states |
s2S j |
j or the |
1 norm |
V i |
1. |
||||||||||
max V i(s) |
V |
(s) |
L |
V |
|||||||||||
(a) |
For V 1; V 2; V 3 obtained in 1(c), compute |
V i |
V |
||||||||||||
monotonically. [5 points] |
1 and verify that this error decreased |
||||||||||||||
-
Let T : RjSj ! RjSj be the function that takes the vector V i as input and applies the value iteration update to produce V i+1. We will show that for any two vectors V; V 0 2 RjSj, this operator decreases the L1 norm between them.
Prove that: kT(V ) T(V 0)k1 kV V 0k1 for any V; V 0 2 RjSj, [10 points]
-
Now consider the sequence of value functions fV ng obtained by iteratively performing the value iteration update as per Eq. 1. This sequence has the interesting property of being a cauchy sequence i.e. the elements of the sequence become arbitrarily close to each other as the
sequence progresses. Formally, we say a sequence is cauchy w.r.t. the L1 norm if for every positive number , there is a positive integer N s.t. for all positive integers m; n greater than N, kxm xnk1 < . Equipped with this fact, show that the error of V n+1 w.r.t the optimal value function can be bounded as: 8 > 0; 9N > 0 s.t. 8n > N
-
V n+1
V
1
1
;
(3)
[10 points]
, then apply the Cauchy
Hint: First try to prove that
V n+1
V
V n
V n+1
sequence condition on V n
1
1
1
Discussion: Given Equation 3, we now have a guarantee that value iteration converges to V . However, we made the assumption that fV ng is a Cauchy sequence. As a bonus question below, we will prove that the condition in (a) is sufficient to conclude that fV ng is indeed a cauchy sequence.
(d) [BONUS] Given a function T : Rn ! Rn such that kT(x1) T(x2)k1 kx1
2 [0; 1), prove first that T has a fixed point i.e. 9x : T(x ) = x and second that the fixed point of T is unique. [5 bonus points]
-
Learning the Model. [20 bonus points]
This is a challenging question with all sub-parts as bonus credit.
While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1) requires the transition function T(s; a) := p(s0 j s; a) 1 and the reward function R(s; a). In many practical situations, these quantities are unknown. However, we can step through the environment and observe transitions to collect data of the form (s; a; s0; r). Using this data, if we can obtain an estimate of the transition and reward functions, we can hope to still find the optimal policy via value iteration. Let M denote the true (unknown to us) model of the world (a model consists of
1we will denote T(s; a; s0) as the probability value of s0 give (s; a), whereas T(s; a) denotes the entire probability distribution p(s0 j s; a)
both a transition and reward function), and let ^ represent our approximate model of the world M
that we estimate from data.
In this question, given estimates of both the transition function T(s; a) = p(s0 j s; a) 2 jSj 1 and the reward function R(s; a), we want to study the error of acting optimally in this estimated MDP as opposed to acting optimally in the “true” MDP. Naturally, we should expect this error to be smaller as we get better estimates which in turn is proportional to how many observations we make.
(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning
-
^
^
method to obtain estimates of the transition function T and the reward function R. Given
^
^
T(s; a)jj1 P , then for any policy
that maxs;a jR(s; a) R
(s; a)j R and maxs;a jjT(s; a)
: S ! A; 8s 2 S, show that [10 bonus points]
-
-
-
-
-
-
jjVM^ VM jj1
R
+
P
(4)
1
(1 )2
-
-
-
-
-
Assume that the reward function is in [0; 1].
Hint: 1) Expand VM (s) using the Bellman equation 2) Holder’s Inequality: jju vjj1 jjujj1 jjvjj1
(b) [BONUS] We just bounded the error of acting according to any policy : S ! A under the
-
^
. Now, we use this to bound the value error in
estimated model, M vs. the true model, M
acting optimally vs. acting optimally according to the estimated model ( ^ ).
M
Using (4), show that [3 bonus points]
V (s)
^
2 sup
V
V
(5)
V M (s)
M
M
jj
M^
M jj1
:S!A
(c) [BONUS] Given
-
-
-
-
-
-
R
r
(6)
2n
jS
=
1
log
4
Aj
P
= jSj r
(7)
2n log
4
jS
A Sj
1
-
-
-
-
-
of observations required for each state action pair, n? [2 bonus points]
-
[BONUS] Given a dataset Ds;a = f(s; a; s01; r1); : : : (s; a; s0n; rn)g we estimate the unknown transition and reward function using empirical frequencies as:
-
-
-
-
-
-
^(s; a; s0) =
1
n
1(s0 = s0)
Xi
(8)
T
n
i
=1
^
1
n
R(s; a) = n
Xi
r
(9)
=1
-
-
-
-
-
For notational convenience, we will use ^ to denote a vector of size j j whose element is
T(s; a) S
the probability computed in (8) for each next state, s0 2 S. Prove that R and P take values as in (6) and (7). [5 bonus points]
Hint: Hoeffding’s inequality2
-
Policy Gradients Variance Reduction [10 points]
In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this question), we derived the policy gradient as
-
-
-
-
-
r J( ) = r E [R( )]
(10)
= E [R( )r log ( )]
(11)
1
N
Xi
(12)
N
R( i)r log ( i)
=1
-
-
-
-
where is a trajectory3, R( ), the associated reward and N, the number of trials to estimate the expected reward, J( ), the expected reward and , the parameters of the policy. These gradients are often very noisy and lead to unstable training. In this question, we show a simple method to reduce the variance in the gradients.
(a) Show that adjusting the reward as R( ) := R( ) b does not change the gradient estimate in Eq. 10. [2 points]
-
Compute the variance of r J( ) and notice how subtracting b helped reduce it. What value of b will lead to the least variance? [8 points]
Hint: The variance of J( ) is written as V ar(x) = E[x2] (E[x])2 where x is substituted as R( )r log ( )
-
Paper Review: [4 points]
In this course’s final paper review section, we examine an interesting project that upends con-ventional research in reinforcement learning. While most RL problems are aimed at maximizing rewards, this work proposes ’upside down reinforcement learning’, where the objective is remodelled as a supervised learning problem. Limited experiments demonstrate promising and competitive results.
The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.
-
Coding: Dynamic Programming and Deep Q-Learning [60 points + 5 bonus points]
Follow the instructions at this link to the HW4 coding webpage.
3A trajectory is defined as a sequence of states and actions i.e. (s0; a0; s1; a1; : : : sn 1; an 1; sn)
5