Why is td lambda returns used rarely in policy ingredients?

255 Asked by AmitSinha in Artificial intelligence , Asked on May 16, 2022

I've seen the Monte Carlo return Gt being used in REINFORCE and the TD(0) target rt+γQ(s′,a′) in vanilla actor-critic. However, I've never seen someone use the lambda return Gλt in these situations, nor in any other algorithms. Is there a specific reason for this? Could there be performance improvements if we used Gλt?

Answered by Alison Kelly

Regarding the td lambda and Gλt, That can be done. For example, Chapter 13 of the 2nd edition of Sutton and Barto's Reinforcement Learning book (page 332) has a pseudocode for "Actor Critic with Eligibility Traces". It's using Gλt returns for the critic (value function estimator), but also for the actor's policy gradients. Note that you do not explicitly see the Gλt returns mentioned in the pseudocode. They are being used implicitly through eligibility traces, which allow for an efficient online implementation (the "backward view").

I do indeed have the impression that such uses are fairly rare in recent research though. I haven't personally played around with policy gradient methods to tell from personal experience why that would be. My guess would be that it is because policy gradient methods are almost always combined with Deep Neural Networks, and variance is already a big enough problem in training these things without starting to involve long-trajectory returns.

If you use large λ with λ-returns, you get low bias, but high variance. For λ=1, you basically get REINFORCE again, which isn't really used much in practice, and has very high variance. For λ=0, you just get one-step returns again. Higher values for λ (such as λ=0.8) tend to work very well in my experience with tabular methods or linear function approximation, but I suspect the variance may simply be too much when using DNNs. Note that it is quite popular to use n-step returns with a fixed, generally fairly small, n in Deep RL approaches. For instance, I believe the original A3C paper used 5-step returns, and Rainbow uses 3-step returns. These often work better in practice than 1-step returns, but still have reasonably low variance due to using small n.

Why is td lambda returns used rarely in policy ingredients?

Your Answer