/    /  Machine Learning- Reinforcement Learning: Learning Task and Q Learning

Reinforcement Learning: Learning Task and Q Learning

 

Reinforcement learning is a method for teaching an autonomous agent that observes and acts in its surroundings to pick the best actions to achieve its objectives. In this blog, we’ll discuss learning tasks and Q learning in reinforcement learning.

 

THE LEARNING TASK:

  • Define the problem of sequential control techniques learning
  • Assume that the agent’s behaviors are either deterministic or nondeterministic.
  • Presume that the agent can foresee the next state that will occur as a result of each action, or that it cannot be educated by an expert who can show it instances of optimum action sequences, or that it must train itself by executing acts of its own choosing.

 

Q Learning:

Q learning is a reinforcement learning algorithm. Q-learning is a temporal difference learning method that uses an Off policy RL algorithm. Comparing temporally consecutive predictions are done using temporal difference learning techniques.

 

Because the supplied training data does not give training instances of the kind(s, a)., it is impossible to learn the function π *: S -> A directly.

 

The sequence of instantaneous rewards r(si, ai) for I = 0, 1,2,… is the sole training information provided to the learner.

 

It is simpler to learn a numerical evaluation function defined over states and actions, then implement the best policy in terms of this evaluation function, given this sort of training information.

 

V* is an evaluation function that the agent should try to learn.

 

When V*(sl) >nV*(s2), the agent should choose state sl over state s2, since the cumulative future reward from s1 will be bigger.

 

The agent’s policy must pick between actions, not states; nevertheless, in some circumstances, it can utilize V* to choose between actions.

 

In state s, the optimal action is a, which maximizes the sum of the immediate reward r(s, a) and the value V* of the immediate successor state, discounted by y.

  → Equation 13.3

 

δ(s, a) denotes the state resulting from applying action a to state s.

 

If the agent has perfect knowledge of the immediate reward function r and the state transition function, it can learn V* and get the best policy. When the agent understands the functions r and that the environment utilizes to respond to its activities, Equation (13.3) may be used to identify the optimum course of action for any state s.

 

Unfortunately, learning V* is only beneficial for determining the best policy if the agent has perfect knowledge of and r. This necessitates it being able to accurately forecast the immediate consequence (i.e., the immediate reward and successor) for each state-action transition. 

 

In explanation-based learning, this assumption is equivalent to the assumption of a perfect domain theory.

 

It is hard for the agent or its human programmer to foresee the exact effect of applying an arbitrary action to an arbitrary state in many real issues, such as robot control. 

 

When the resulting state contains the locations of the dirt particles, it becomes difficult to describe the state of a robot arm shoveling the dirt.

 

Learning V* is useless for selecting optimum actions when either δ or r is unknown because the agent is unable to assess Equation (13.3).

 

The Q Function:

Let’s define the evaluation function Q(s, a) so that its value is the maximum discounted cumulative reward that can be obtained by starting from state s and taking action an as the first action. 

 

 

 

The reward earned immediately after acting from state s, plus the value (discounted by y) of adopting the best policy thereafter, is the value of Q.

In Equation (13.3), Q(s, a) is the quantity that is maximized in order to determine the best action in state s.

 

Equation (13.3) should be rewritten in terms of Q(s, a) as,

The agent will be able to choose optimum actions if it learns the Q function rather than the V* function, even if it is unfamiliar with the functions r and δ.

 

It merely needs to think about each potential action in its present state s and pick the one that maximizes Q(s, a).

 

It may appear strange at first that by continually reacting to the local values of Q for the present state, one may identify globally optimum action sequences.

 

Without ever undertaking a lookahead search to explicitly assess what state follows from the action, the agent may determine the best course of action. 

 

The value of Q for the present state and action summarises all of the information needed to compute the discounted cumulative reward that will be obtained in the future if action is picked in state s, which is part of the beauty of Q learning.

 

The r-value for this transition plus the V* value for the subsequent state discounted by γ is equal to the Q value for each state-action transition. It’s also worth noting that the figure’s optimum policy corresponds to picking actions with the highest Q values.

 

In the next blog, we’ll have a look at the Q learning algorithm and an illustrative example.

 

Reference

Reinforcement Learning: Learning Task and Q Learning