How is model free reinforcement learning different from model based reinforcement learning?

225    Asked by Amitraj in QA Testing , Asked on May 10, 2022

What's the difference between model-free and model-based reinforcement learning? It seems to me that any model-free learner, learning through trial and error, could be reframed as model-based. In that case, when would model-free learners be appropriate?

Answered by Amit raj

In Reinforcement Learning, the terms "model based reinforcement learning" and "model free reinforcement learning" do not refer to the use of a neural network or other statistical learning model to predict values, or even to predict next state (although the latter may be used as part of a model-based algorithm and be called a "model" regardless of whether the algorithm is model-based or model-free).


Instead, the term refers strictly as to whether, whilst learning or acting, the agent uses predictions of the environment response. The agent can use a single prediction from the model of next reward and next state (a sample), or it can ask the model for the expected next reward, or the full distribution of next states and next rewards. These predictions can be provided entirely outside of the learning agent - e.g. by computer code that understands the rules of a dice or board game. Or they can be learned by the agent, in which case they will be approximate.

Just because there is a model of the environment implemented, does not mean that a RL agent is "model-based". To qualify as "model-based", the learning algorithms have to explicitly reference the model:

Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning, Actor-Critic are "model free" RL algorithms. They rely on real samples from the environment and never use generated predictions of next state and next reward to alter behavior (although they might sample from experience memory, which is close to being a model).

The archetypical model-based algorithms are Dynamic Programming (Policy Iteration and Value Iteration) - these all use the model's predictions or distributions of next state and reward in order to calculate optimal actions. Specifically in Dynamic Programming, the model must provide state transition probabilities, and expected reward from any state, action pair. Note this is rarely a learned model.

  Basic TD learning, using state values only, must also be model-based in order to work as a control system and pick actions. In order to pick the best action, it needs to query a model that predicts what will happen on each action, and implement a policy like π(s)=argmax∑s′,rp(s′,r|s,a)(r+v(s′)) where p(s′,r|s,a) is the probability of receiving reward r and next state s′ when taking action in state s. That function p(s′,r|s,a) is essentially the model.

The RL literature differentiates between "model" as a model of the environment for "model-based" and "model-free" learning, and use of statistical learners, such as neural networks.

In RL, neural networks are often employed to learn and generalize value functions, such as the Q value which predicts total return (sum of discounted rewards) given a state and action pair. Such a trained neural network is often called a "model" in e.g. supervised learning. However, in RL literature, you will see the term "function approximator" used for such a network to avoid ambiguity.

It seems to me that any model-free learner, learning through trial and error, could be reframed as model-based.

I think here you are using the general understanding of the word "model" to include any structure that makes useful predictions. That would apply to e.g. table of Q values in SARSA.

However, as explained above, that's not how the term is used in RL. So although your understanding that RL builds useful internal representations is correct, you are not technically correct that this can be used to re-frame between "model-free" as "model-based", because those terms have a very specific meaning in RL.

Generally with the current state of art in RL, if you don't have an accurate model provided as part of the problem definition, then model-free approaches are often superior.

  There is lots of interest in agents that build predictive models of the environment, and doing so as a "side effect" (whilst still being a model-free algorithm) can still be useful - it may regularise a neural network or help discover key predictive features that can also be used in policy or value networks. However, model-based agents that learn their own models for planning have a problem that inaccuracy in these models can cause instability (the inaccuracies multiply the further into the future the agent looks). Some promising inroads are being made using imagination-based agents and/or mechanisms for deciding when and how much to trust the learned model during planning.

Right now (in 2018), if you have a real-world problem in an environment without an explicit known model at the start, then the safest bet is to use a model-free approach such as DQN or A3C. That may change as the field is moving fast and new more complex architectures could well be the norm in a few years.



Your Answer

Interviews

Parent Categories