Lecture 21: Sequential decision making (part 2): The algorithms
An introduction for max-entropy RL algorithms
Recap: Control as Inference
In Control setting:
- Initial State
- Transition
- Policy
- Reward
In Inference setting:
- Initial State
- Transition
- Policy
- Reward
- Optimality
In the classical deterministic RL setup, we have:
Let . Denote as the full trajectory. Denote $p(\tau) = p(\tau \mid \mathcal{O}_{1:T})$. Running inference in this GM allows us to compute:
From the perspective of control as inference, we optimize for the following objective:
- For deterministic dynamics, we get this objective directly.
- For stochastic dynamics, we obtain it from the ELBO on the evidence.
Types of RL Algorithms
The objective of RL learning is to find the optimal parameter s.t. maximize the expected reward.
- Policy gradients: directly optimize the above stochastic objective
- Value-based: estimate V-function or Q-function of the optimal policy (no explicit policy; the policy is derived from the value function)
- Actor-critic: estimate V-/Q-function under the current policy and use it toimprove the policy (not covered)
- Model-based methods: not covered
Policy gradients
In policy gradient, we directly optimize the target expected reward w.r.t the policy $\pi_\theta$ itself.
The log-derivative trick is applied to the above equation.
The reinforce algorithm:
Q-Learning
Q-learning does not explicitly optimize the policy ; it optimize the estimation of the functions. The optimal policy can then be calculated by
Policy iteration via dynamic programming:
-
Policy iteration
-
Policy evaluation
The approach still involves explicit optimization of . We can rewrite the iteration as:
Fitted Q-learning:
If the state space is high-dimensional or infinite, it is not feasible to represent in a tabular form. In this case, we use two parameterized functions to denote them. Then, we adopt fitted Q-iteration as stated in this paper:
Soft Policy Gradients
From the perspective of control as inference, we optimize for the following objective:
Now following the policy gradient method, such as REINFORCE, we just need to add a bonus entropy term to the rewards.
Soft Q-learning
Next, we connect the previous policy gradient to Q-learning. We can rewrite the policy gradient as follows:
Note that
Recall from the previous lecture,
Now combine these and rearrange the terms, we get:
Now the soft Q-learning update is very similar to Q-learning: where Additionally, we can set the temperature in softmax to control the tradeoff between entropy and rewards.
To summaize, there are a few benefits of soft optimality:
- Improve exploration and prevent entropy collapse
- Easier to specialize (finetune) policies for more specific tasks
- Principled approach to break ties
- Better robustness (due to wider coverage of states)
- Can reduce to hard optimality as reward magnitude increases