Figure 17.1 on page 614 of Russell and Norvig shows the state space for a Markov Decision Process in part (a). The state transition probabilities defined with respect to some policy, PI, are shown in part (b) of the figure. So, for example, if the agent is in state q(i) then action PI(i) is taken with probability 0.8 whereas alternative actions are taken with probability 0.2.
Use the temporal difference learning algorithm to determine the utility function, U(i), as follows. First, choose some policy PI(i). The obvious one will suffice. Based on this policy and the indicated state transition probabilities, generate 100 trials or training sequences using a Monte Carlo technique. Next, define a reward function, R(i), by assigning the values indicated in part(a) to the final states and the constant -0.04 to all other states. Then run your program for the Temporal Difference Algorithm on your training data until some reasonable convergence criterion is satisfied. Finally, compare your values of the U(i) to those given in figure 17.3, p. 619 and explain any differences.