/    /  Machine Learning- Reinforcement Learning: The Q Learning Algorithm with an Illustrative example

Reinforcement Learning: The Q Learning Algorithm with an Illustrative example

 

Q learning is a reinforcement learning algorithm. Q-learning is a temporal difference learning method that uses an Off policy RL algorithm. Comparing temporally consecutive predictions are done using temporal difference learning techniques.

 

In this blog, we shall have a look at the algorithm of Q learning and understand it with an example.

 

Knowing the Q function is the same as learning the best policy.

 

The main challenge is determining a trustworthy method for estimating training values for Q when all you have are a series of instantaneous rewards r spaced out over time. The iterative approximation can be used to achieve this.

→ 13.6

 

This repeatedly approximated Q method is based on this recursive definition of Q. (Watkins 1989).

 

To refer to the learner’s estimate, or hypothesis, of the real Q function, use the symbol .

 

The learner’s hypothesis  is represented in this method by a big table with a single entry for each state-action pair. The value for (s, a)-the learner’s present hypothesis about the real but unknown value Q is stored in the database entry for the pair (s, a) (s, a).

 

Initially, the table can be populated with random values (though it is easier to understand the algorithm if one assumes initial values of zero).  

 

The agent repeatedly observes its present state s, picks an action a, performs that action, and then observes the resultant reward r = r(s, a) and the new state s’ = δ(s, a).

 

Following each such transition, it changes the table entry for (s,a) according to the rule:

  → 13.7

 

This training rule refines the agent’s estimate of (s,a) for the prior state s by using the agent’s current values for the new state s’.

 

Equation (13.6) motivates this training rule, despite the fact that the training rule concerns the agent’s approximation , and Equation (13.6) pertains to the real Q function.

 

Although Equation (13.6) expresses Q in terms of the functions 6(s, a) and r(s, a), the agent does not need to be familiar with these functions in order to use Equation (13.7). 

 

Instead, it does the action in its surroundings, then observes the new state s’ and rewards r that results. As a result, it may be thought of as sampling these functions at their present s and a values.

 

Q learning algorithm, assuming deterministic rewards and actions. The discount factor y may be any constant such that 0 ≤ γ < 1.

 

  • For each s, a initialize the table entry (s,a) to zero. 
  • Observe the current state s
  • Do forever:
  • Select an action a and execute it
  • Receive immediate reward r
  • Observe the new state  s’
  • Update the table entry for (s, a) as follows: 

  • s ← s’

 

If the system can be described as a deterministic Markov decision process, the reward function r is limited, and actions are chosen so that every state-action pair is visited infinitely often, the agent’s estimate converges in the limit to the actual Q function using this approach.

 

Consider a single action done by an agent and the associated modification to to demonstrate the functioning of the Q learning algorithm.

 

In this example, the agent travels one cell to the right in its grid universe and obtains a zero-valued instant reward.

 

It then uses Equation (13.7)’s training rule to fine-tune its estimate for the state-action transition it just completed. 

 

The new  estimate for this transition, according to the training rule, is the sum of the received reward (zero) and the highest  value associated with the resultant state (loo), discounted by y (.9).

 

Q learning propagates estimates backward from the new state to the old each time the agent moves forward from an old state to a new one. 

 

At the same time, the agent’s immediate reward for the transition is utilized to supplement these propagated values.

Consider the reward function and grid environment given in Figure 13.2, where the reward is zero everywhere except when entering the target state. 

 

We’ll suppose that training consists of a series of episodes because this world has an engaging target state. Each episode starts with the agent in a random state and allows it to do activities until it reaches the absorbing goal state. 

 

The episode finishes when it does, and the agent is moved to a new, randomly determined, beginning state for the following one.

Because all of the values are set to zero, the agent will not update any table entries until it reaches the goal state and receives a nonzero payment. 

 

As a consequence, the value for the single transition leading to the objective state will be refined. If the agent passes through this state near to the target state in the future episode, its nonzero  value will allow refining the value for some transition two steps from the goal, and so on.

 

Given a sufficient number of training episodes, information will propagate from nonzero reward transitions back through the entire state-action space available to the agent, eventually resulting in a table containing the  values shown in Figure 13.2. 

 

The Q learning algorithm of Table 13.1 will converge to the correct Q function under certain assumptions.

 

Consider two general features of this Q learning process that apply to any deterministic MDP with non-negative rewards, provided all values are set to zero. 

 

The first feature is that the  values never drop during training under these conditions. Let ,(s, a) signify the learned (s,a) value after the nth iteration of the training method is more formal terms (i.e., after the nth state-action transition taken by the agent). Then,