Be able to work through multiple iterations of particle filtering.
Implement the Forward-Backward Algorithm for HMMs.
Implement particle filtering for a variety of Bayesian Networks.
Apply smoothing to HMM queries for each time step.
Hidden Markov Models are used to describe time or spatial series data; i.e for situations that dynamically change over space or time such as speech recognition. To model such situations, we look at the states at each point in time. Another important aspect to consider while modeling the world is that not all states are observable. Rather, we may only be able to observe some proxy for the state, known as evidence. For example, if you are tracking the movement of an object on a grid, you may only be able to observe the general region it is located in.
\(X_t\) denotes the set of state variables at time \(t\) that are not observable, or hidden. \(E_t\) likewise denotes the set of observable state variables.
            We define HMMs with three components:
\(P(X_0)\): Initial Distribution
\(P(X_t | X_{t - 1})\) Transition (Dynamics) Model
\(P(E_t | X_t)\): Sensor (Observation) Model

Consistent with Bayes Nets, we can represent the joint probability of HMMs as: \[P(X_0, E_1, X_1, ..., E_t,
            X_t) =
            P(X_0)\prod_t P(X_t | X_{t-1})P(E_t | X_t)\] Consider the example discussed in lecture. You want to know
            whether or
            not it rains every day. However, being quarantined in a windowless home, the only outside information about
            outside
            world is whether your friend comes to your home with an umbrella or not. In this case, the time interval
            \(t\) will be
            each day, \(X_t\) is set of variables denoting whether it rains in day \(t\), and \(E_t\) is set of
            variables denoting
            whether you see umbrella in day \(t\).
HMMs specifically describe the process with single discrete random variables. If you happen to have more than one state variable, for instance \(Rain_t\), and \(Sunny_t\), to make it consistent with HMM framework, we would combine them to a single variable by using tuple as a value. Although this is out of scope for this class, the reason for this specific structure is for simple matrix implementation for inference tasks.
The following section is a brief overview of inference tasks that could be done on HMMs. More mathematical details about the filtering and smoothing algorithms will be discussed in the next section.
The goal of filtering is to find the belief state given all evidence to date: \(P(X_t |
              e_{1:t})\). In other words, we want to know what the current state is, given all of the current and
            past
            evidence. In the following picture, we query \(P(X_4 | e_{1:4})\). The
            belief state
            is \(X_4\). 
The goal of prediction is to learn about some future state, given all of the current and past evidence:
            \(P(X_{t+k} | e_{1:t})\), where \(k >
              0\).  
          
The goal of smoothing is to calculate a posterior distribution over the past state, given all of the
            current and past
            evidence: \(P(X_k | e_{1:t})\), where \(1 \leq k
              <
              t\). The reason for doing smoothing is that we now can make much better estimates of the past
            states with
            more evidence after the fact, which plays an important role for learning tasks. 
We might also want to infer about the whole sequence instead of single state. Given all the observations until \(t\), we might want to find the sequence of states that is most likely to have produced the observations. To do so, we compute most likely explanation: \(argmax_{X_{1:t}} P (X_{1:t} | e_{1:t})\). One application for this inference task is speech recognition.
 
        The forward algorithm is designed to answer filtering queries. It is based on the following expansion of
            the term
            \(P(X_t | e_{1:t})\): \[\begin{align*}
              P(X_t | e_{1:t}) &= P(X_t | e_t, e_{1:t-1}) \\
              &Expanding\ shorthand\ e_{1:t} \\
              &= \alpha P(X_t, e_t | e_{1:t-1}) \\
              &Definition\ of\ Conditional\ Probability \\
              &= \alpha \sum_{x_{t - 1}} P(X_t, e_t, x_{t - 1} | e_{1:t-1}) \\
              &Sum\ over\ x_{t - 1} \\
              &= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}, e_{1:t - 1})P(e_t | X_t, x_{t -
              1}, e_{1:t
              - 1}) \\
              &Chain\ Rule \\
              &= \alpha \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1})P(e_t | X_t) \\
              &Conditional\ Independence\ from\ BN \\
              &= \alpha P(e_t | X_t) \sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1})
              \end{align*}\] Notice that the term \(P(x_{t - 1} | e_{1:t - 1})\)
            appears
            in the expression for \(P(X_t | e_{1:t})\). This suggests a recursive
            method of
            answering filtering queries, where we "unroll" a query as a combination of smaller versions of the same
            query. The
            cost of this procedure is \(O(|X|^2)\), where \(|X|\) is the number of states (you can verify this by solving a recurrence
            relation if
            interested). Quadratic time is prohibitive for problems with large state spaces, which leads us to
            approximate methods
            we will discuss later.
          
Also note that filtering is essentially two step process: prediction and update. Notice \(\sum_{x_{t - 1}} P(x_{t - 1} | e_{1:t-1})P(X_t | x_{t - 1}) = P(X_t |
              e_{1:t-1})\)
            represents the one step prediction of the next state based on current distribution at \(t-1\). Then, it is multiplied with \(P(e_t |
              X_t)\) to
            update with the new evidence \(e_t\) just
            observed.
          
Example
For a more concrete example, consider the rain and umbrella example discussed in 2.2. Initially, we assume
            the
            uniform distribution. In other words, \(P(X_0 = 0) = 0.5\) and \(P(X_0 = 1) = 0.5\). Also, consider the
            following
            transition and sensor models: \[\begin{array}{|c|c|c|} 
 \hline 
 X_t & X_{t-1} & P(X_t |
            X_{t-1}) \\
            \hline 
 1 & 0 & 0.3 \\\hline 
 1 & 1 & 0.7 \\\hline 
 0 & 0 & 0.7 \\\hline
            
 0
            & 1 & 0.3 \\\hline 
 \end{array} 
 \begin{array}{|c|c|c|} 
 \hline 
 X_t & E_{t}
            & P(E_t |
            X_t) \\ \hline 
 1 & 0 & 0.1\\\hline 
 1 & 1 & 0.9 \\\hline 
 0 & 0 & 0.8
            \\\hline
            
 0 & 1 & 0.2 \\\hline 
 \end{array}\]
          
On day 1, we observe your friend brought an umbrella, i.e \(E_1 = 1\). First, we perform prediction:
            \[P(X_1=1) =
            \sum_{x_0} P(X_1 | x_0)P(x_0) = 0.3*0.5 + 0.7*0.5 = 0.5\] (Intuitively, what we’re doing here is considering
            all
            possibilities of \(x_0\), whether it rained on day 0, and computing the likelihood of it raining on day 1
            using the
            transition model.)
Then, we can update this prediction with the new observation: \[P(X_1 = 1 |
            E_1 = 1) =
            \alpha P(E_1=1 | X_1=1)P(X_1=1) = \alpha 0.9*0.5 = \alpha 0.45\] We can get the final posterior distribution
            for
            \(X_1\) by repeating for \(P(X_1 = 0 | E_1 = 1)\) and normalizing to incorporate \(\alpha\) back.
          
Having calculated these forward queries for filtering, we can use these results to retrieve smoothing query
            results
            as well. This is the "backwards" part of the algorithm. Recall that smoothing queries take the form \(P(X_k | e_{1:t}), 1 \leq k < t\).
          
We first observe that \(P(X_k | e_{1:t}) = \alpha P(e_{k + 1: t} | X_k) P(X_k | e_{1:k})\) (\(\alpha\) is not necessarily the same proportionality constant as is used in the forward pass). We derive this from the definition of conditional probability and the fact that \(e_{k + 1: t} \perp\!\!\!\perp e_{1:k} | X_k\). Notice that we have the second term from the forward algorithm. The first term is known as the "backwards message". \[\begin{align*} P(e_{k + 1: t} | X_k) = \sum_{x_{k + 1}}P(x_{k + 1} | X_k)P(e_{k + 1} | x_{k + 1}) P(e_{k+ 2: t}| x_{k + 1}) \end{align*}\] Where the first two terms in the sum can be obtained from the model itself, and the last term is recursive as in the forward algorithm.
As an exercise, try to derive the formula for the "backward message".
In Particle Filtering, we use likelihood-weighted sampling to instead approximate (rather than exactly infer) queries. The basic idea is as follows: we start with some initial distribution of samples (referred to as particles), and iteratively "move them around" according to the transition model to estimate the probability distribution \(X_t\) at each timestep \(t\). To incorporate observed evidence at each timestep, we also weight each sample using the sensor model. More concretely,
Our representation of \(P(X_t)\) is a list of \(N\) particles/samples. We approximate \(P(X_t = x)\) by \[\hat{P}(X_t=x) = \frac{\text{number of particles with value }x}{N}\]
Generally, \(N << ~\mid \text{domain}(X) \mid\)
So, many \(x\) will have \(\hat{P}(x) =
            0\).
With more particles, we will have higher accuracy with respect to the actual values.
If \(N=1\), there is no need for weighting or the resample step below (unless the weight is 0)
Particle Filtering Algorithm: Starting with our belief of some distribution \(\hat{P}(X_t)\),
Propagate Forward
For each particle \(x_t\) draw samples for \(X_{t+1}\) from \(P(X_{t+1} \mid x_t)\).
Observe (i.e., weight samples based on evidence)
Construct \(\hat{P}(X,e_{t+1})\). For each possible \(x\),
Weight \(w = P(e_{t+1} | x)\)
Compute \(\hat{P}(x, e_{t+1})\) by multiplying our current belief of \(X_{t+1}\)’s distribution (which doesn’t currently account for evidence) by \(w\): \(\hat{P}(x,e_{t+1}) = \hat{P}(x)P(e_{t+1} \mid x)\)
Normalize \(\hat{P}(X,e_{t+1})\) to get \(\hat{P}(X \mid e_{t+1}) = \frac{\hat{P}(X,e_{t+1})}{\sum_{x} \hat{P}(x, e_{t+1})}\). This is our updated belief which now incorporates observed evidence.
Resample
Resample \(N\) times from the new sample distribution \(\hat{P}(X \mid e_{t+1})\)
Note: We’re not picky on the \(P\) vs. \(\hat{P}\) notation - we just make the distinction here to explicitly show what’s a belief (i.e., our approximation) vs. a true value/probability distribution.