Why does Value iteration pseudocode obtain similar policies as policy iteration?

 I am trying to implement value and policy iteration algorithms. My value function from policy iteration looks vastly different from the values from value iteration, but the policy obtained from both is very similar. How is this possible? And what could be the possible reasons for this?

Answered by Ankit yadav

Both value iteration pseudocode (VI) and policy iteration (PI) algorithms are guaranteed to converge to the optimal policy, so it is expected that you get similar policies from both algorithms (if they have converged).


However, they do this differently. VI can be seen as a truncated version of PI. Let me first illustrate the pseudocode of both algorithms (taken from Barto and Sutton's book), which I suggest you get familiar with (but you are probably already familiar with them if you implemented both algorithms). Policy iteration updates the policy multiple times, because it alternates a step of policy evaluation and a step of policy improvement, where a better policy is derived from the current best estimate of the value function.

On the other hand, value iteration updates the policy only once (at the end). In both cases, the policies are derived from the value functions in the same way. So, if you obtain similar policies, you may think that they are necessarily derived from similar final value functions. However, in general, this may not be the case, and this is actually the motivation for the existence of value iteration, i.e. you may derive an optimal policy from an non-optimal value function. Barto and Sutton's book provides an example. See figure 4.1 on page 77 (p. 99 of the pdf).



Your Answer

Interviews

Parent Categories