# Convergence of Reinforcement Learning

 Not Residual Residual Fixeddistribution(on-policy) Fixeddistribution Usually-greedydistribution Fixeddistribution(on-policy) Fixeddistribution Usually-greedydistribution Markov Chain Lookup table Y Y Y Y Averager Y Y Y Y Linear Y N Y Y Nonlinear N N Y Y MDP Lookup table Y Y Y Y Y Y Averager Y Y N Y Y Y Linear N N N Y Y Y Nonlinear N N N Y Y Y POMDP Lookup table N N N Y Y Y Averager N N N Y Y Y Linear N N N Y Y Y Nonlinear N N N Y Y Y

This table gives convergence results for incremental RL algorithms such as TD(lambda), Q-learning, Advantage Learning, incremental value iteration, and SARSA. A green "Y" means the algorithm is guaranteed to converge in the same sense as Backprop. A red "N" means that counterexamples are known where it will fail to converge and will learn nothing useful, even when given infinite training time and slowly-decreasing learning rates. Blank entries are for nonsensical combinations, where the policy changes even though there is no policy.

This table suggests that residual algorithms might be useful in cases where more general function approximators are needed. It is also interesting that for most cases in the table, linear and nonlinear function approximators have the same properties, and on-policy and fixed distributions have the same properties.

The categories have the following definitions:

• Residual - using either residual gradient algorithms or residual algorithms to guarantee convergence.
• Fixed distribution - Each state-action pair is updated with a fixed frequency. This frequency may be arbitrary, not corresponding to any particular exploration policy.
• Fixed distribution (on-policy) - Each state action pair is updated with a frequency proportional to the frequency it is seen while following a fixed exploration policy.
• Usually-greedy distribution - Each state action pair is updated with a frequency proportional to the frequency it is seen while following the current exploration policy. The exploration policy changes over time, so that it performs the actions with the highest learned value most often. For most algorithms it is defined as epsilon-greedy or Boltzman. Since a greedy policy doesn't make sense for Markov chains, there are no table entries for that combination.
• Lookup table - A separate parameter is stored for each state-action pair.
• Averager - A class of function approximator defined by Geoff Gordon including KNN and memory-based systems.
• Linear - A function approximator which is linear in the weights.
• Nonlinear - An arbitrary, smooth, parameterized function approximator.
• Markov chain - This is the problem of just learning a value function for a given, fixed, stochastic policy.
• MDP - Markov Decision Process. The problem of learning a policy (mapping from state to action), assuming the state has the Markov property (memory doesn't improve the policy).
• POMDP - Partially Observable Markov Decision Process. This is the general reinforcement learning problem where the state is not observable. The optimal policy (mapping from observations and memory to action) sometimes must be stochastic. This column refers to applying Q-learning etc. directly to a POMDP, rather than reducing it to an MDP first through the use of belief states or tapped delay lines.