#This file was created by Sun May 12 14:38:40 2002 #LyX 1.0 (C) 1995-1999 Matthias Ettrich and the LyX Team \lyxformat 2.15 \textclass slides \options dvips \language default \inputencoding default \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 1 \use_amsmath 1 \paperorientation landscape \paperwidth 40cm \paperheight 20cm \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle empty \layout Slide ad \layout Standard \align center Approximately Optimal Approximate Reinforcement Learning \layout Standard \align center Sham Kakade and John Langford \layout Standard \align center @ ICML 2002 \layout Slide adfa \layout Standard \align center The Goal: Reinforcement Learning on LARGE state spaces \layout Standard What algorithms work? \layout Slide afa \layout Standard \align center Current Methods \layout Enumerate Approximate value functions \layout Enumerate Policy Gradient \layout Slide afa \layout Standard \align center Questions \layout Enumerate Will the policy always improve? \layout Enumerate Will the program halt? \layout Enumerate Will the final policy be near-optimal? \layout Slide afaf \layout Standard \align center Approximate Value function \layout Enumerate Approximate state-action values. \layout Enumerate Derive policy from approximate values. \layout Standard \begin_inset Formula \( \max _{s}|V(s)-\hat{V}(s)|\leq \epsilon \textrm{ }\,\,\Rightarrow \) \end_inset each iteration doesn't \emph on lose \emph default much \layout Slide afaf \layout Standard \align center Policy Iteration \layout Standard Let \begin_inset Formula \( \gamma \) \end_inset = discount factor. \layout Standard Let \begin_inset Formula \( \epsilon =\max _{s}|V(s)-\hat{V}(s)| \) \end_inset = approximation error \layout Standard Theorem: The average reward of the next policy is no more than \begin_inset Formula \( O(\frac{\epsilon \gamma }{1-\gamma }) \) \end_inset worse \layout Standard Note: some guarantees on (near) optimality exist. \layout Slide adfaaf \layout Standard \align center Policy Gradient \layout Standard \begin_inset Formula \( \eta (\pi )= \) \end_inset average reward \layout Standard Evaluate \begin_inset Formula \( \frac{\partial \eta (\pi )}{\partial w} \) \end_inset and optimize \layout Standard Fundamentally: intertwine explore and exploit \layout Slide adf \layout Standard \align center Directed exploration required \layout Standard \added_space_top 0.3cm \align center \begin_inset Figure size 282 81 file paper/figs/thrun_game.eps width 4 95 flags 9 \end_inset \layout Slide afdfa \layout Standard \align center Policy Gradient Theorems \layout Standard Theorem: The policy always improves with infinitesimally small steps. \layout Standard Theorem: The gradient \emph on magnitude \emph default can be estimated quickly. \layout Standard Theorem: The gradient \emph on direction \emph default can not always be estimated quickly, even far from the optima. \layout Slide adf \layout Standard \align center Next \layout Enumerate RL terminology \layout Enumerate Conservative Policy Iteration \layout Enumerate Theorems \layout Slide adfa \layout Standard \align center Terminology \layout Enumerate \begin_inset Formula \( \mu (s) \) \end_inset = start state distribution \layout Enumerate \begin_inset Formula \( R(s,a) \) \end_inset = reward recieved for action \begin_inset Formula \( a \) \end_inset in state \begin_inset Formula \( s \) \end_inset \layout Enumerate \begin_inset Formula \( V_{\pi }(s)=(1-\gamma )E\sum _{t=0}^{\infty }\gamma ^{t}R(s_{t},a_{t}) \) \end_inset = expected reward from state \begin_inset Formula \( s \) \end_inset \layout Enumerate \begin_inset Formula \( Q_{\pi }(s,a)=(1-\gamma )R(s,a)+\gamma E_{s',a\sim P(s';s,a)\pi (a|s)}V_{\pi }(s') \) \end_inset = expected reward from state \begin_inset Formula \( s \) \end_inset and action \begin_inset Formula \( a \) \end_inset \layout Enumerate \begin_inset Formula \( A_{\pi }(s,a)=Q_{\pi }(s,a)-V_{\pi }(s) \) \end_inset = advantage of taking one action over another \layout Slide adf \layout Standard \align center Consider update rules \layout Standard \begin_inset Formula \[ \pi _{\textrm{new}}\leftarrow \alpha \pi '+(1-\alpha )\pi \] \end_inset \layout Standard \begin_inset Formula \( \alpha \rightarrow 0 \) \end_inset = policy gradient \layout Standard \begin_inset Formula \( \alpha =1 \) \end_inset = policy iteration \layout Slide adfad \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 241 file alpha.eps width 4 90 flags 9 \end_inset \layout Slide adfa \layout Standard \align center The idea \layout Enumerate Use uniform state distribution \begin_inset Formula \( \mu \) \end_inset \begin_inset Formula \( \Rightarrow \) \end_inset avoid exploration problems \layout Enumerate Use finite \begin_inset Formula \( \alpha >0 \) \end_inset . \layout Enumerate \begin_inset Formula \( l_{1} \) \end_inset metric instead of \begin_inset Formula \( l_{\infty } \) \end_inset metric \layout Slide asdfd \layout Standard \align center Definitions \layout Standard \begin_inset Formula \[ d_{\pi ,\mu }(s)\equiv \Pr _{\pi ,\mu ,\textrm{world}}(s)\textrm{ }=\textrm{ natural metric on states}\] \end_inset \begin_inset Formula \[ \mathcal {A}_{\pi ,\mu }(\pi ')\equiv E_{s\sim d_{\pi ,\mu }(s)}E_{a\sim \pi '(a;s)}A(s,a)=\textrm{ policy advantage}\] \end_inset \layout Standard As \begin_inset Formula \( \alpha \rightarrow 0 \) \end_inset , \begin_inset Formula \[ \eta _{\mu }(\pi _{\textrm{new}})-\eta _{\mu }(\pi )\propto \mathcal {A}_{\pi ,\mu }(\pi ')\] \end_inset \layout Slide adf \layout Standard \align center Ingredient: Supervised Learning \layout Standard Input space: \begin_inset Formula \( s,a \) \end_inset \layout Standard Output space: \begin_inset Formula \( Q(s,a) \) \end_inset \layout Standard Input distribution: \begin_inset Formula \( d_{\pi ,\mu }(s),\pi (a|s) \) \end_inset \layout Standard Output distribution: \begin_inset Formula \( \hat{Q}(s,a) \) \end_inset s.t. \begin_inset Formula \( E\hat{Q}(s,a)=Q(s,a) \) \end_inset \layout Standard Use greedy evaluation for new policy, \begin_inset Formula \( \pi ' \) \end_inset . \layout Slide dafd \layout Standard \align center Conservative Policy Iteration \layout Standard Let \begin_inset Formula \( G_{\mu }(\pi )= \) \end_inset regression algorithm finding \begin_inset Formula \( \pi ' \) \end_inset with large policy advantage. \layout Enumerate \begin_inset Formula \( \pi '=G_{\mu }(\pi ) \) \end_inset \layout Enumerate if \begin_inset Formula \( \mathcal {A}_{\pi ,\mu }(\pi ')>\epsilon \) \end_inset \begin_deeper \layout Enumerate then \begin_inset Formula \( \pi \leftarrow \alpha \pi '+(1-\alpha )\pi \) \end_inset , goto 1 \layout Enumerate else output \begin_inset Formula \( \pi \) \end_inset \end_deeper \layout Slide adfa \layout Standard \align center Analysis \layout Standard Theorem: (progress) The policy always improves: \layout Standard \begin_inset Formula \[ \eta _{\mu }(\pi _{\textrm{new}})-\eta _{\mu }(\pi )\geq \frac{\alpha }{1-\gamma }\left( \mathcal {A}_{\pi ,\mu }(\pi ')-\frac{2\alpha \gamma \epsilon }{1-\gamma (1-\alpha )}\right) \] \end_inset \layout Slide adfd \layout Standard Theorem: (termination) CPI halts in \begin_inset Formula \( O\left( R^{2}/\epsilon ^{2}\right) \) \end_inset iterations. \layout Standard Proof: Take derivative w.r.t. \begin_inset Formula \( \alpha \) \end_inset , find extrema, \layout Standard \begin_inset Formula \( \Rightarrow \eta _{\mu }(\pi _{\textrm{new}})-\eta _{\mu }(\pi )\geq \frac{\mathcal {A}_{\pi ,\mu }(\pi ')^{2}}{8R} \) \end_inset \layout Slide dfd \layout Standard Theorem: (optimality) The output policy satisfies: \begin_inset Formula \[ \eta _{D}(\pi ^{*})-\eta _{D}(\pi )\leq \frac{\epsilon }{1-\gamma }\max _{s}\frac{d_{\pi ^{*},D}(s)}{d_{\pi ,\mu }(s)}\] \end_inset \begin_inset Formula \[ \leq \frac{\epsilon }{(1-\gamma )^{2}}\max _{s}\frac{d_{\pi ^{*},D}(s)}{\mu (s)}\] \end_inset \layout Standard Note: back to \begin_inset Formula \( l_{\infty } \) \end_inset norm. \layout Standard Prior information \begin_inset Formula \( \rightarrow \) \end_inset tune \begin_inset Formula \( \mu (s) \) \end_inset \layout Slide adf \layout Standard \align center Needed \layout Enumerate (progress) Supervised learning (regression) algorithm \layout Enumerate (progress & termination) Evaluation of Policy advantage \layout Enumerate (optimality) \begin_inset Formula \( \mu \) \end_inset restart distribution \layout Slide adfa \layout Standard \align center Conclusion \layout Standard RL = supervised learning + restart distribution \layout Slide da \layout Standard \align center Work in progress \layout Enumerate Algorithms for directed exploration \begin_deeper \layout Enumerate finite, MDP: solved ( \begin_inset Formula \( E^{3} \) \end_inset + Sven/Thrun/Simmons) \layout Enumerate Large/infinite MDP: unsolved \end_deeper \layout Enumerate Approximate model building \layout Enumerate Experimental validation \the_end