Lecture 21: Sequential decision making (part 2): The algorithms
An introduction for maxentropy RL algorithms
Recap: Control as Inference
In Control setting:
 Initial State
 Transition
 Policy
 Reward
In Inference setting:
 Initial State
 Transition
 Policy
 Reward
 Optimality
In the classical deterministic RL setup, we have:
Let . Denote as the full trajectory. Denote $p(\tau) = p(\tau \mid \mathcal{O}_{1:T})$. Running inference in this GM allows us to compute:
From the perspective of control as inference, we optimize for the following objective:
 For deterministic dynamics, we get this objective directly.
 For stochastic dynamics, we obtain it from the ELBO on the evidence.
Types of RL Algorithms
The objective of RL learning is to find the optimal parameter s.t. maximize the expected reward.
 Policy gradients: directly optimize the above stochastic objective
 Valuebased: estimate Vfunction or Qfunction of the optimal policy (no explicit policy; the policy is derived from the value function)
 Actorcritic: estimate V/Qfunction under the current policy and use it toimprove the policy (not covered)
 Modelbased methods: not covered
Policy gradients
In policy gradient, we directly optimize the target expected reward w.r.t the policy $\pi_\theta$ itself.
The logderivative trick is applied to the above equation.
The reinforce algorithm:
QLearning
Qlearning does not explicitly optimize the policy ; it optimize the estimation of the functions. The optimal policy can then be calculated by
Policy iteration via dynamic programming:

Policy iteration

Policy evaluation
The approach still involves explicit optimization of . We can rewrite the iteration as:
Fitted Qlearning:
If the state space is highdimensional or infinite, it is not feasible to represent in a tabular form. In this case, we use two parameterized functions to denote them. Then, we adopt fitted Qiteration as stated in this paper:
Soft Policy Gradients
From the perspective of control as inference, we optimize for the following objective:
Now following the policy gradient method, such as REINFORCE, we just need to add a bonus entropy term to the rewards.
Soft Qlearning
Next, we connect the previous policy gradient to Qlearning. We can rewrite the policy gradient as follows:
Note that
Recall from the previous lecture,
Now combine these and rearrange the terms, we get:
Now the soft Qlearning update is very similar to Qlearning: where Additionally, we can set the temperature in softmax to control the tradeoff between entropy and rewards.
To summaize, there are a few benefits of soft optimality:
 Improve exploration and prevent entropy collapse
 Easier to specialize (finetune) policies for more specific tasks
 Principled approach to break ties
 Better robustness (due to wider coverage of states)
 Can reduce to hard optimality as reward magnitude increases