Explain the intuition behind TD Lambda.

1.1K Asked by AndrewJenkins in QA Testing , Asked on May 10, 2022

I'd like to better understand temporal-difference learning. In particular, I'm wondering if it is prudent to think about TD(λ) as a type of "truncated" Monte Carlo learning?

Answered by ananya Pawar

TD(λ) can be thought of as a combination of TD and MC learning, so as to avoid choosing one method or the other and to take advantage of both approaches.

More precisely, TD Lambda is temporal-difference learning with a λ-return, which is defined as an average of all n-step returns, for all n, where an n-step return is the target used to update the estimate of the value function that contains n future rewards (plus an estimate of the value function of the state n steps in the future). For example, TD(0) (e.g. Q-learning is usually presented as a TD(0) method) uses a 1-step return, that is, it uses one future reward (plus an estimate of the value of the next state) to compute the target. The letter λ actually refers to a parameter used in this context to weigh the combination of TD and MC methods. There are actually two different perspectives of TD(λ), the forward view and the backward view (eligibility traces). The blog post Reinforcement Learning: Eligibility Traces and TD(lambda) gives a quite intuitive overview of TD(λ), and, for more details, read the related chapter of the book Reinforcement Learning: An Introduction.

Explain the intuition behind TD Lambda.

Your Answer