# Probability Reference Sheet

### Learning Objectives

There are no offical learning objectives for this section. This can used as a reference sheet for Probability in this course!

### Probability Notation

Suppose we have 3 random variables $$A, B$$, and $$C$$. Consider the expression $P(+b, C) = \sum_{a \in \{a_1, a_2, a_3\}} P(a, +b, C)$

In this course, we denote discrete random variables by capital letters and use them to represent all possible disjoint outcomes. In the above example, $$A, B,$$ and $$C$$ are random variables.

We use lower case letters to denote outcomes, i.e. possible values our variables can take on, such as $$+b$$ for the variable $$B$$, or $$a_1, a_2,$$ and $$a_3$$ for the variable $$A$$ in the above example.

We also have variables for values like $$a$$. Note that these variables are also represented by lower case letters and only represent a single outcome (as opposed to random variables).

### Basic Rules

#### Definition of Conditional Probability

\begin{align*} P(X \mid Y) = \frac{P(X,Y)}{P(Y)} \end{align*}

#### Product Rule

\begin{align*} P(X, Y) &= P(X \mid Y) P(Y) \\[0.5em] &= P(Y \mid X) P(X) \\[0.5em] P(X_1,X_2, X_3) &= P(X_1, X_2 \mid X_3)P(X_3) \\[0.5em] &= P(X_1 \mid X_2,X_3)P(X_2, X_3) \end{align*}

#### Bayes' Theorem

\begin{align*} P(Y \mid X) = \frac{P(X \mid Y)P(Y)}{P(X)} \end{align*}

#### Normalization

\begin{align*} P(Y \mid X) &= \frac{P(X, Y)}{P(X)} = \frac{P(X , Y)}{\sum_{y} P(X , y)} \\[0.5em] P(Y \mid X) &\propto P(X , Y) \\[0.5em] P(Y \mid X) &= \alpha P(X , Y) \textit{~~~~~Note this difference between }\propto\textit{ and }\alpha \\[0.5em] \alpha &= \frac{1}{P(X)} = \frac{1}{\sum_{y} P(X , y)} \end{align*}

#### Chain Rule

\begin{align*} P(X_1,X_2, X_3) &= P(X_1 \mid X_2,X_3)P(X_2, X_3) \\[0.5em] &= P(X_1 \mid X_2,X_3)P(X_2 \mid X_3)P(X_3) \\[0.5em] P(X_1, ..., X_N) &= \prod_{n=1}^{N} P(X_n \mid X_1, ..., X_{n-1}) \end{align*}

#### Law of Total Probability

\begin{align*} P(A) = P(A \mid b_1) P(b_1) + P(A \mid b_2) P(b_2) \end{align*} where events $$b_1, b_2$$ partition the sample space of events in the world (i.e., they are disjoint and their union makes up the entire sample space). More generically, \begin{align*} P(A) = \sum_b P(A \mid b) P(b) = \sum_b P(A, b) \end{align*}

All of these basic probability rules hold when conditioning on a set of random variables or outcomes. To make this work, the conditioned variables need to be included in each term in the rule.

#### Example

Take Bayes' Theorem from above, but now conditioned upon variables $$A$$ and $$B$$: \begin{align*} P(Y \mid X, A, B) = \frac{P(X \mid Y, A, B)P(Y \mid A, B)}{P(X \mid A, B)} \end{align*}

### Marginalization

Marginalization uses the law of total probability to “sum out" variables from a joint distribution. This is useful when we are given the joint probability distribution and want to find the probability distribution over just a subset of the variables. Marginalization has the following forms:

To sum out a single variable: \begin{align*} P(X) = \sum_{y}P(X, y) \end{align*}

To sum out multiple variables: \begin{align*} P(X) = \sum_{z} \sum_{y} P(X, y, z) \end{align*}

This also works for conditional distributions when summing out a variable that is not conditioned upon, i.e. a variable to the left of the $$\mid$$: \begin{align*} P(A \mid C, d) = \sum_{b} P(A, b \mid C, d) \end{align*}

This does NOT work when summing over a variable that is conditioned upon, i.e. a variable to the right of the $$\mid$$: \begin{align*} P(A, b \mid C) \neq \sum_{d} P(A, b \mid C, d) \end{align*}

### Independence

If two variables $$X$$ and $$Y$$ are independent ($$X \perp\mkern-10mu\perp Y$$), by definition the following are true:

• $$P(X,Y) = P(X)P(Y)$$

• $$P(X) = P(X \mid Y)$$

• $$P(Y) = P(Y \mid X)$$

If two variables $$X$$ and $$Y$$ are conditionally independent given $$Z$$ ($$X \perp\mkern-10mu\perp Y \mid Z$$), by definition the following are true:

• $$P(X,Y \mid Z) = P(X \mid Z)P(Y \mid Z)$$

• $$P(X \mid Y,Z) = P(X \mid Z)$$

• $$P(Y \mid X,Z) = P(Y \mid Z)$$

### Answering a Query from a CPT

Given

$$P(B \mid A)$$, $$P(A)$$

To query

$$P(A \mid b)$$

1. Construct joint distribution (use product rule or chain rule)

Product Rule:1 $$P(B,A) = P(B \mid A)P(A)$$

2. Answer query from joint distribution (use conditional probability or law of total probability)

By definition of Conditional Probability, $$P(A \mid b) = \frac{P(b,A)}{P(b)}$$

By the Law of Total Probability, $$P(A \mid b) = \frac{P(b,A)}{\sum_{a}P(b,a)}$$

1. Note that product rule is a smaller instance of chain rule↩︎

### Probability Tables

When representing probabilities with capital letters, e.g. $$P(A, B)$$, we are referring to all the combinations of outcomes that the discrete random variables can have. Thus, we have a table of probabilities rather than a single value. This is also true for conditional probabilities, e.g. $$P(A, B \mid C)$$. When there is a mixture of capital letters and lower case letters, e.g. $$P(A, b \mid C, d)$$, the table contains all the combinations of outcomes for the random variables, $$A$$ and $$C$$ (while the discrete values $$b$$ and $$d$$ are fixed).

#### Important Note about CPTs

It is important to understand when a probability table contains the complete distribution, or in other words, when a probability table sums to one.

A probability table will sum to one when:

1. there is exactly one specific combination of outcomes that is conditioned upon and

2. we are considering all possible combinations of the other random variables.

Another way to phrase this: a probability table will sum to one, when:

1. there are no capital letters on the right-hand side of the $$\mid$$, and

2. there are only capital letters on the left-hand side.