Reinforcement Learning: The Q Learning Algorithm with an Illustrative example
Q learning is a reinforcement learning algorithm. Q-learning is a temporal difference learning method that uses an Off policy RL algorithm. Comparing temporally consecutive predictions are done using temporal difference learning techniques.
In this blog, we shall have a look at the algorithm of Q learning and understand it with an example.
Knowing the Q function is the same as learning the best policy.
The main challenge is determining a trustworthy method for estimating training values for Q when all you have are a series of instantaneous rewards r spaced out over time. The iterative approximation can be used to achieve this.
→ 13.6
This repeatedly approximated Q method is based on this recursive definition of Q. (Watkins 1989).
To refer to the learner’s estimate, or hypothesis, of the real Q function, use the symbol
The learner’s hypothesis
Initially, the table can be populated with random values (though it is easier to understand the algorithm if one assumes initial values of zero).
The agent repeatedly observes its present state s, picks an action a, performs that action, and then observes the resultant reward r = r(s, a) and the new state s’ = δ(s, a).
Following each such transition, it changes the table entry for
→ 13.7
This training rule refines the agent’s estimate of
Equation (13.6) motivates this training rule, despite the fact that the training rule concerns the agent’s approximation
Although Equation (13.6) expresses Q in terms of the functions 6(s, a) and r(s, a), the agent does not need to be familiar with these functions in order to use Equation (13.7).
Instead, it does the action in its surroundings, then observes the new state s’ and rewards r that results. As a result, it may be thought of as sampling these functions at their present s and a values.
Q learning algorithm, assuming deterministic rewards and actions. The discount factor y may be any constant such that 0 ≤ γ < 1.
- For each s, a initialize the table entry
(s,a) to zero. - Observe the current state s
- Do forever:
- Select an action a and execute it
- Receive immediate reward r
- Observe the new state s’
- Update the table entry for
(s, a) as follows:
- s ← s’
If the system can be described as a deterministic Markov decision process, the reward function r is limited, and actions are chosen so that every state-action pair is visited infinitely often, the agent’s estimate
Consider a single action done by an agent and the associated modification to
In this example, the agent travels one cell to the right in its grid universe and obtains a zero-valued instant reward.
It then uses Equation (13.7)’s training rule to fine-tune its estimate for the state-action transition it just completed.
The new
Q learning propagates
At the same time, the agent’s immediate reward for the transition is utilized to supplement these propagated
Consider the reward function and grid environment given in Figure 13.2, where the reward is zero everywhere except when entering the target state.
We’ll suppose that training consists of a series of episodes because this world has an engaging target state. Each episode starts with the agent in a random state and allows it to do activities until it reaches the absorbing goal state.
The episode finishes when it does, and the agent is moved to a new, randomly determined, beginning state for the following one.
Because all of the
As a consequence, the
Given a sufficient number of training episodes, information will propagate from nonzero reward transitions back through the entire state-action space available to the agent, eventually resulting in a
The Q learning algorithm of Table 13.1 will converge to the correct Q function under certain assumptions.
Consider two general features of this Q learning process that apply to any deterministic MDP with non-negative rewards, provided all values are set to zero.
The first feature is that the