Let’s first describe the main task we will be handling; continuity. Continuous problems are tasks that has no specific terminal state, therefor will go on forever. As simple as it sounds, it is not a piece of cake to tackle the issues it brings with itself. Some examples could be the stockmarket, where there is no end and you keep getting data. Or as book suggests accesscontrol queuing task (Example 10.2).
I will follow a simple format so that we all can stay on the same page and everything is clear cut:
So let’s start.
First of all, we should know that discounting works well for tabular cases. The issue we will be talking about rises when we start to use approximations.
We have a sequence of episodes that has no beginning or end, and no way to clearly distinguish them. As the book suggests, we have the feature vectors to maybe have a use of, but then the issue of clearly seperable arises. We might have two feature vectors that has no to little difference between them, which won’t be possible to be able to distinguish.
Since we have no start point or end point, and since there is no clear line in between episodes, using discounting is not possible. Well it actually is possible. But it is not needed. Actually using $\gamma = 0$ will give the same results as any other one. That’s because the discounted rewards are proportional to average reward. That’s why instead we will only use average reward. Here I will put the proof that both will results in the same order (discounted and without discounting):
The main issue with discounting in the approximation cases is that, since we have states depending on the same features, we do not have the policy improvement theorem anymore. Which was stating that we can get the optimal policy, just by changing all the action selections to the optimal ones for each state. Since we could choose the probabilities for one state without effecting the others it was pretty easy to handle. Now that we lost that property there is no guaranteed improvement over policy.
As Rich puts it “This is an area with multiple open theoretical questions”. If you are interested.
Average reward is a pretty popular technique used in dynamic programming. Later on included into the Reinforcement Learning setting. We use average reward for approximated continual setting as we discussed above. Without discounting means that we care about each reward equally without thinking of if it occurs in far future etc.
We denote it as $r(\pi)$. Not much detail but for the intuition part I will give the main definition for it: $$ r(\pi) \doteq \sum_{s}\mu_\pi\sum_{a}\pi(as)\sum_{r, s’}p(r, s’s, a) r $$ Basically we consider the best policy as the policy which has the most $r(\pi)$. For average reward we define returns as the difference between the $r(\pi)$ and the reward received at that point, this is called the differential return: $$ G_t = R_{t+1}  r(\pi) + R_{t+2}  r(\pi) + \ldots $$ I believe differential return holds almost all the properties normal returns had. Only change we will do is to replace the reward with the difference i.e. $R_{t+1}  r(\pi)$. This goes for TD errors, bellman equations etc.
So we already saw the formula for $r(\pi)$ but we didn’t actually see how it came to existence or what all those things mean. $$ r(\pi) \doteq \lim_{h\rightarrow\infty} \frac{1}{h} \sum_{t=1}^{h}\mathbb{E}[R_tS_0, A_{0:t1} \sim \pi] $$ Let’s explain what’s happening here. We are assuming we have $h$ number of rewards, we are summing expected value of all the rewards given the first state and the action trajectory following the policy $\pi$, and we are dividing it to $h$ to get to the average of these rewards. So we simply had $h$ many rewards and we got the average. Then; $$ = \lim_{t\rightarrow\infty} \mathbb{E}[R_tS_0, A_{0:t1} \sim \pi] $$ Since I have the expectation inside the summation, we can actually simplify the summation with the division. We do have to put $t\rightarrow\infty$ to ccorrect the formula, as we will have number of samples approaching infinity. Next jump on the book seems fuzzy, but when you open it up it is extremely easy to see how it happens.
So if we have a randomness over something, what we want to do is to get the expectation of it. If we get the expectation that means we can formulate it, therefor no more randomness. In an MDP we have three kind of randomness possibly can happen.
What does this mean? It means we can be in a state, and we don’t know what state that might be, and from there we will take an action, but we don’t know for sure which action will that be. And the last one is that we take that action but since we don’t know the dynamics of the environment (if stochastic even if we do know) we don’t know which state we will end up in. So actually this formula goes like; $$ \mathbb{E}[\mathbb{E} [ \mathbb{E}[R_tS_t, A_t]]] $$ Where the inner most is for the states and in the middle its the actions, the last one is the dynamics. So we know from bellman equations how to write this down; $$ \mathbb{E}[R_t] = \sum_{s’,r}p(s’,rs, a) r $$ This is the expected reward is it not ? Now lets add the action selection on top: $$ \mathbb{E}[R_tA_t] = \sum_{a}\pi(as)\sum_{s’,r}p(s’,rs, a) r $$ One last thing left is the state selection. We are using $\mu_\pi(s)$ to specify state distribution given the state (which the book covered earlier  Chapter 9). So the last piece of the puzzle; $$ \mathbb{E}[R_tA_t, S_t] = \sum_{s}\mu_\pi(s)\sum_{a}\pi(as)\sum_{s’,r}p(s’,rs, a) r $$ That’s all, we therefor have the average reward formula covered.
In practice we will be using moving mean to calculate average reward.
Well, I don’t really have much to add. If you read the SemiGradient SARSA post, this is mostly just changing the update rule for the continuous setting. That will be the change for $G_{t:t+n}$.
$$G_{t:t+n}=R_{t+1}\bar{R}_{t+1}$$
$$+R_{t+2}\bar{R}_{t+2}$$
$$+\ldots+ R_{t+n}\bar{R}_{t+n}$$
$$+\hat{q}(S_{t+n},A_{t+n},w_{t+n1})$$
The TD error then will be like:
$$ \delta_t = G_{t:t+n}  \hat{q}(S_t, A_t, w) $$
and we will use another step size parameter $\beta$ to update the average reward value. Here is the pseudocode:
And here is my implementation of it, which does not require much explanation I assume:


It is basically almost the same with the previous version. We are first checking if we have more elements than $n$ which means we need to remove the first elements from the storage. Then we have a check which sees if we have enough elements, because we won’t be making any updates if there is not at least $n$ elements in the trajectory. The rest is the same update as in the pseudocode.
Again we run an experiment using the same settings as before which results in a high varience learning, thought it does learn which is the point here right now 😄.
I have a blog series on RL algorithms that you can check out. Also you can check BetterRL where I share raw python RL code for both environments and algorithms. Any comments are appreciated!
]]>If you read the prediction part for the semi gradient methods, it is pretty easy to extend what we know to the control case. We know that control is almost all the time just adding policy improvement over the prediction case. That’s exactly the case for us here for semigradient control methods as well.
We already have describe and understood a formula back in prediction part (if you read it somewhere else that’s also fine), and now we want to extend our window a little.
For prediction we were using $S_t \mapsto U_t$ examples, now since we have actionvalues instead of statevalues (because we will pick the best action possible), we will use examples of form $S_t, A_t \mapsto U_t$ meaning that instead of $v_\pi(S_t)$ we will be using estimations for $q_\pi(S_t, A_t)$.
So our general update rule would be (following from the formula for prediction);
$$ w_{t+1} = w_t + \alpha [U_t  \hat{q}(S_t, A_t, w_t)] \nabla\hat{q}(S_t, A_t, w_t) $$
As we always do, you can replace $U_t$ with any approximation method you want, so it could have been a Monte Carlo method (Though I believe this does not count as semigradient, because it will be a direct stochastic gradient since it does not use any bootstrapping, but the book says otherwise so I am just passing the information 😄). Therefor we can implement an $n$step episodic SARSA with an infinite option, which will correspond to MonteCarlo (We will learn a better method to do this in future posts).
The last piece of information to add is the policy improvement part, since we are doing control, we need to update our policy and make it better as we go of course. Which won’t be hard cause we will just be using a soft approximation method, I will use the classic $\epsilon$greedy policy.
One more thing to note, which I think is pretty important, for continuous action spaces, or large discrete action spaces methods for the control part is still not clear. Meaning we don’t know what is the best way to approach yet. That is if you think of a large choices of actions, there is no good way to apply a soft approximation technique for the action selection as you can imagine.
For the implementation, as usual we will just go linear, as it is the best way to grasp every piece of information. But first I will as usual give the pseudocode given in the book.
I only took the pseudocode from chapter 10.2 because we don’t really the one before, as it is only the one step version. We are interested in the general version therefor nstep.


Initialize We start by initializing the necessary things; we need step size $\alpha$ also $\gamma$ and $\epsilon$. Other then these we need to initialize our weight vector. We will have a weight vector that is for each action concatenated after one another. So if we assume that we have 4 observations lets say [1 0 1 0], meaning weights 0 and 2 are active, and if want to update the weights for action 0, we will have [1 0 1 0 0 0 0 0 0 0 0 0] if we had 3 possible actions in total. After when we are using $\epsilon$greedy this will make more sense.
Let’s move next thing is to take a step, meaning we will pick the action according to our actionvalues at hand. We take the observations as input, this will come from the environment, and assuming we get an array of the probabilities for each action given the observations from _act(obs)
. Then all we have to do is to roll the die and decide if we will choose a random action or we will choose the action that has the most value for the current time, and thats exactly what we do here ($\epsilon$greedy action selection).
Best $\hat{q}$value now we need to fill the function _act(obs)
. Which basically will call $\hat{q}(s, a, w)$ for each action and store them in an array and return it.
Continuing from there we have the $\hat{q}(s,a,w)$ to implement. Which is just writing down the linear formula since we are implementing it linearly. Therefor $\hat{q}(s,a,w) = w^Tx(s, a)$ where $x(s,a)$ is the state action representation. In our case as I already mention this will just be the one hot vector, all the observations are added after one another for each action.
Finally $x(s, a)$  as I already mentioned twice 😄 we create the $x$ in a vector that everything 0 other than the active action.
That was the last thing for us to be able to choose the action for a given state. So let’s have a broader respective and assume that we are using the step(obs)
here is how it would be like:
Now we see what is left ? Update… 🤦♂️ Yeah without update there is no change basically. Which will also be the one differs for the $n$. Let’s remember the formula;
$$ w_{t+1} = w_t + \alpha[R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^n\hat{q}(S_{t+n},A_{t+n},w_{t})  \hat{q}(S_{t},A_{t},w_{t})] \nabla\hat{q}(S_t, A_t, w_t) $$


There is a bit of a change here, from the pseudocode I provided. Since we want a full seperation between the agentenvironmentexperiment we need a class system for the algorithms therefor we won’t be following what is on the pseudocode.
Update what happens here is actually not that different, since we only need $n+1$ elements to make the update happen we won’t keep the rest of the trajectory. Whenever we use n numbered trajectory the first element becomes useless for the next update. Therefor we remove the first element from the trajectory and use the rest to make our update.
Terminal we also have a terminal state, and as can be seen in the pseudocode there are some differences that should be changed for the updates when we reach the terminal state. Logical enough, we do not have n+1 element left to complete the calculation we were doing therefor we will just use the rewards rather than $\hat{q}(s,a,w)$ . Therefor we need another function to handle this, which we call end()
in our structure;


Here as we can see we are not doing something too different. It is just that we are using the last elements we have left and we will remove all the elements from the trajectory while making the last updates to our weights.
Yeah and we are almost done, exept that I didn’t show the grad_q_hat()
yet, which basically gives the $\nabla\hat{q}(s,a,w)$.
Surprise.. Yeah since we are using linear functions, $\nabla w^Tx(s, a) = x(s,a)$. That’s all.
Let’s see how would be the experiment part and run the code to get some results then.


I used tile coding and the grid world environment in our library. If you want you can modify a little to use another state representation or Rich Sutton’s tile coding library, or for environment gym.
Anyways, what we do is pretty simple if you read through, and you can ask for clarification on any point if looks weird.
Main point here are the agent functions and how we use them, all three are used as we said, on each step we have the agent.step()
, for each step we have the update()
called except the terminal state. Which we will call end()
instead.
I will give only one graph as result as usual, here is 100 runs on the stochastic grid world environment.
If you liked this post follow BetterRL, and keep a like down below. I have a blog series on RL algorithms that you can check out. Also you can check the repo where I share raw python RL code for both environments and algorithms. Any comments are appreciated!
]]>