next up previous
Next: State Generalization Up: Using Learned State Features Previous: Introduction

Team-Partitioned, Opaque-Transition RL

 

Formally, a policy is a mapping from a state space S to an action space A such that the agent using that policy executes action a whenever in state s. At the coarsest level, when in state s, an agent compares the expected, long-term rewards for taking each action tex2html_wrap_inline1905 , choosing an action based on these expected rewards. These expected rewards are learned through experience.

Designed to work in real-world domains with far too many states to handle individually, TPOT-RL constructs a smaller feature space V using action-dependent feature functions. The expected reward tex2html_wrap_inline1909 is then computed based on the state's corresponding entry in feature space.

In short, the policy's mapping from S to A in TPOT-RL can be thought of as a 3-step process:

s_description1345

While these steps are common in other RL paradigms, each step has unique characteristics in TPOT-RL.





Peter Stone
Fri Feb 27 18:45:43 EST 1998