Lecture 21: Sequential decision making (part 2): The algorithms

An introduction for max-entropy RL algorithms

Recap: Control as Inference

In Control setting:

In Inference setting:

In the classical deterministic RL setup, we have:

\begin{aligned} & V_\pi(s) := \mathbb{E}_\pi \displaystyle \sum_{k=0}^T \left[\gamma^k r_{t+k+1} \mid s_t = s \right]\\ & Q_\pi (s, a) := \mathbb{E}_\pi \displaystyle \sum_{k=0}^T \left[\gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right]\\ & \pi_*(a \mid s) := \delta(a = \underset{a}{\text{argmax}} Q_*(s,a))\\ \end{aligned}

Let . Denote as the full trajectory. Denote $p(\tau) = p(\tau \mid \mathcal{O}_{1:T})​$. Running inference in this GM allows us to compute:

\begin{aligned} & p(\tau \mid \mathcal{O}_{1:T}) \propto p(s_t) \displaystyle \prod_{t=2}^T p(s_{t+1} \mid s_t, a_t) \times exp(\displaystyle \sum_{t=1}^T r(s_t, a_t)) \\ & p(a_t \mid s_t, \mathcal{O}_{1:T}) \propto exp(Q_t(s_t, a_t) - V_t(s_t))​ \\ \end{aligned}

From the perspective of control as inference, we optimize for the following objective:

Types of RL Algorithms

The objective of RL learning is to find the optimal parameter s.t. maximize the expected reward.

Policy gradients

In policy gradient, we directly optimize the target expected reward w.r.t the policy $\pi_\theta$ itself.

The log-derivative trick is applied to the above equation.

The reinforce algorithm:

Q-Learning

Q-learning does not explicitly optimize the policy ; it optimize the estimation of the functions. The optimal policy can then be calculated by

Policy iteration via dynamic programming:

The approach still involves explicit optimization of . We can rewrite the iteration as:

Fitted Q-learning:

If the state space is high-dimensional or infinite, it is not feasible to represent in a tabular form. In this case, we use two parameterized functions to denote them. Then, we adopt fitted Q-iteration as stated in this paper:

Soft Policy Gradients

From the perspective of control as inference, we optimize for the following objective:

\begin{aligned} J(\theta) &= -D_{KL}(\hat{p}(\tau) || p(\tau)) \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p(s_t, a_t)}}[r(s_t, a_t)] + \mathbb{E}_{s_t \sim \hat{p}(s_t)}[\mathcal{H}(\pi(a_t|s_t))]\\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ \end{aligned}​

Now following the policy gradient method, such as REINFORCE, we just need to add a bonus entropy term to the rewards.

Soft Q-learning

Next, we connect the previous policy gradient to Q-learning. We can rewrite the policy gradient as follows:

\begin{aligned} J(\theta) &= -D_{KL}(\hat{p}(\tau) || p(\tau)) \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim \hat{p(s_t, a_t)}}[r(s_t, a_t)] + \mathbb{E}_{s_t \sim \hat{p}(s_t)}[\mathcal{H}(\pi(a_t|s_t))]\\ &= \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ \end{aligned}

Note that

\begin{aligned} \nabla_\theta \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[\log \pi(a_t|s_t)] &= \int \nabla_\theta \big[ p(\tau) \sum_{t=1}^T \log \pi(a_t|s_t)\big] d\tau\\ &= \int \nabla_\theta p(\tau)\sum_{t=1}^T \log \pi(a_t|s_t) + p(\tau) \nabla_\theta\sum_{t=1}^T \log \pi(a_t|s_t) d\tau\\ &= \int p(\tau) \nabla_\theta \log p(\tau)\sum_{t=1}^T \log \pi(a_t|s_t) + p(\tau) \nabla_\theta \log p(\tau) d\tau\\ &= \int p(\tau) \nabla_\theta \log p(\tau)\Big [ \sum_{t=1}^T \log \pi(a_t|s_t) + 1 \Big] d\tau.\\ \end{aligned}

Recall from the previous lecture,

Now combine these and rearrange the terms, we get:

\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t) - \log \pi(a_t|s_t)]\\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \log\pi_\theta(a_t|s_t) \Big[ r(s_t, a_t) + \big(\sum_{t'=t+1}^T r(s_{t'}, a_{t'}) - \log \pi_\theta(a_{t'}|s_{t'})\big) - \log \pi_\theta (a_t|s_t) - 1 \Big] \\ &= \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \Big(\nabla_\theta Q_\theta(s_t, a_t) - \nabla_\theta V_\theta(s_t)\Big) \Big[ r(s_t, a_t) + Q_\theta(s_{t'}, a_{t'}) - Q_\theta(s_t, a_t) + V(s_t) \Big] \\ &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta Q_\theta(s_t, a_t) \Big[ r(s_t, a_t) + \text{soft} \max_{a_{t'}} Q_\theta(s_{t'}, a_{t'}) - Q_\theta(s_t, a_t) \Big] \\ \end{aligned}

Now the soft Q-learning update is very similar to Q-learning: where Additionally, we can set the temperature in softmax to control the tradeoff between entropy and rewards.

To summaize, there are a few benefits of soft optimality: