### Probability Theory

• Probability theory can be classified into two categories:

• Non-measure theoretic (or descriptive) probability theory

• primarily based on combinatorics, calculus and frequentist interpretation of probability (approximated by relative frequency due to law of large numbers).

• Measure-theoretical probability theory

• Probability is one kind of measure, satisfying sigma-additivity.

• Founded by Kolmogorov, the theory is developed from a few axioms.  So measure-theoretical probability theory has a rigorous foundation while descriptive probability theory does not.

Some frequentists, like Neyman, want to add some qualifications to such statements. Mathematical probability is a property of some formal, abstract mathematical objects. It so happens, as an empirical fact, that these objects can be used, with a fair degree of accuracy, to represent various bits and pieces of the real world. The long-run limits invoked are to be understood as a way of speaking about what would happen if we repeated experiments indefinitely, about (as Neyman puts it) imaginary random experiments.'' Rather than explore this topic in the depth it deserves, which would lead to discussing, on the one hand, the axioms of probability, measure theory, sigma-algebras and Kolmogorov complexity, and on the other hand the connection between mathematics and physical reality, so probability has two interpretations: rigorous mathematical interpretation (measure), and intuitive physical interpretation (relative frequency), which is simple but somewhat inaccurate.

Mean, median, mode characterize the central tendency of a r.v.  Variance and standard deviation characterize dispersion of a r.v.

• Sample space is a set of all possible outcomes of an experiment.   An event is a subset of a sample space.  Note that a sample space is not just a set; moreover, it contains random events.   The randomness is inherent in the experiment.
• An elementary event consists of a single element in a sample space.
• Geometric distribution:
• Geometric mean vs. arithmetic mean:
• Geometric mean is so called since it is the n-th root of the product of n numbers (say, a_1, ..., a_n), which has a direct geometric interpretation; that is, the volume of a hypercube with equal side length of the geometric mean, is equal to the volume of an orthotope with side lengths of a_1, ..., a_n.
• Geometric series
• b_n = c^n, which is the volume of a hypercube with equal side length of c.  The volume exponentially increases with the sequence number n.   So this series has a geometric interpretation.
• Geometric distribution
• P_n = pq^{n-1}, n=1, 2, ...
• It is a product of a constant and a geometric series.
• E[X]<infinity if and only if E[|X|]<infinity.   On the contrary, absolute integrability/summability implies conditional integrability/summability.
• Probability mass function (PMF), probability density function (PDF)
• come from physics.  Mass is an integral of density over an area/volume.   For discrete r.v., PDF is a Dirac delta function; but the integral of PDF is finite, i.e., PMF.
• Statistical mean = expectation = ensemble average; empirical mean = sample average/mean = time average.
• If Y = f(X), where f is a one-to-one correspondence, then p{X=x, Y=f(x)}=p{X=x}=p{Y=f(x)}.

### Types of Convergence

• Convergence everywhere
• We say that a random sequence x_n converges everywhere if the sequence of numbers x_n(\zeta) converges for every $\zeta$ in the sample space.   That is, every realization (a sequence of outcomes) of the process x_n converges.   Or every sample path converges.  The limit of each realization of the process x_n is a number that depends, in general, on $\zeta$.   That is, the limit of x_n is a random variable x.
• One may be confused by $\zeta$.   Here, $\zeta$ is just a symbol representing an outcome of an experiment; $\zeta$ is unknown before an experiment.   Once the experiment is done, $\zeta$ is known.  Then, x_n(\zeta) simply maps the outcome $\zeta$ to a real number.  So each realization of the random sequence will result in a sequence of deterministic number.
• An example:  Denote x an arbitrary r.v.   Let x_n= x+1/n.   Then any realization of the process x_n converges to the realization of x.  So x_n converges everywhere to x.
• Convergence almost everywhere (a.e.) or almost sure (a.s.), also called convergence with probability 1 or strong convergence
• Convergence in probability or weak convergence
• P(lim inf A_n) <= lim inf P(A_n) <= lim sup P(A_n) <= P( lim sup A_n)
• Since P( lim sup A_n)>= P(A_n) for every n,  we have  P( lim sup A_n)>= lim sup P(A_n).   For a rigorous proof, see Billingsley's book "Probability and Measure".

Transforms of PDF/PMF

• Characteristic function: E[exp(jwX)]
• Similar to Fourier transform of PDF: E[exp(-jwX)]
• So, when you compute characteristic function of a PDF using existing Fourier transform pairs, don't forget to put a negative sign before jw in the Fourier transform of the PDF.
• Moment generating function:
• For continuous r.v.: E[exp(sX)]
• Similar to Laplace transform of PDF: E[exp(-sX)]
• So, when you compute moment generating function of a PDF using existing Laplace transform pairs, don't forget to put a negative sign before s in the Laplace transform of the PDF.
• For discrete r.v.: E[z^(X)]
• Similar to Z-transform of PMF: E[z^(-X)]
• So, when you compute moment generating function of a PMF using existing Z-transform pairs, don't forget to put a negative sign before X in the Z-transform of the PMF.

Important distributions:

• Continuous distributions
• Uniform: simplest distribution for support = bounded region (a,b); for a scalar r.v.,  need two parameters, i.e., the two end points of the boundary, (a,b).
• Exponential: simplest distribution for support = [0,\infty); need one parameter, i.e., mean.
• Normal: simplest distribution for support = [-\infty,\infty); need two parameters, i.e., mean and variance.
• Log-normal:
• Two r.v.'s X and Y have the relation Y=exp(X) or X=log(Y); if X is a normal r.v., then Y is a log-normal r.v.
• Note that log-normal comes from the fact that the log function of a r.v. (e.g., Y) is normal-distributed.   The log function of a normal r.v. is not log-normal (the pdf in this case can be easily obtained).
• Log-normal distribution is used to model shadowing effect (large-scale fading) in wireless channel.
• Chi-square distribution
• When df=2, the chi-square distribution is exponential distribution.
• Chi distribution (Nakagami-m distribution):
• It is normal distribution if df=1
• It is Rayleigh distribution if df=2
• Used to characterized fading channel.
• Student t:
• Z=X/Y.   If X is normal and Y is chi distributed, then Z is Student t distributed.   If Y is normal (chi distribution with one degree of freedom), Z is Cauchy distributed.
• Used as a test statistic in CFAR problems (matched filter + unknown noise variance, for linear statistical model).
• Weibull distribution:
• Its PDF for any k > 0: f(t) = (k tk-1 / bk) exp[(t / b)k], t > 0.

• Note that when k = 1, the Weibull distribution reduces to the exponential distribution with scale parameter b. The special case k = 2, is called the Rayleigh distribution with scale parameter b, named after William Strutt, Lord Rayleigh.
• Discrete distributions
• Binomial distribution: pmf for configurations of two sets.
• Multinomial distribution: pmf for configurations of m sets.
• Mixture distribution: normalized weighted sum/integral of distributions.  For example, I have 10 Gaussian pdf f_i(x), \sum_{i=1}^10 a_i=1; a mixture Gaussian is \sum_{i=1}^10 a_i*f_i(x).   Another example of a mixture distribution is \int g(\theta)* f(x|\theta) d\theta, where g(\theta) is a pdf of \theta and f(x|\theta) is a pdf of x, parameterized by \theta.

Sterling's formula: n! \approx  (n/e)^n = e^{n log n - n},  the approximation is accurate as n goes to infinity.

What is inverse chi-square distribution?   It is just the inverse function of chi-square distribution, actually, the set of percentiles.

### The Goodness of Fit Test

#### Preliminaries

Suppose that we have a random experiment with a random variable X of interest. Assume additionally that X is discrete with density function f on a finite set S. We repeat the experiment n times go generate a random sample of size n from the distribution of X:

X1, X2, ..., Xn.

Recall that these are independent variables, each with the distribution of X.

In this section, we assume that the distribution of X is unknown. For a given density function f0, we will test the hypotheses

H0: f = f0 versus H1: ff0,

The test that we will construct is known as the goodness of fit test for the conjectured density f0. As usual, our challenge in developing the test is to find a good test statistic--one that gives us information about the hypotheses and whose distribution, under the null hypothesis, is known, at least approximately.

#### Derivation of the Test

Suppose that S = {x1, x2, ..., xk}. To simplify the notation, let

pj = f0(xj) for j = 1, 2, ..., k.

Now let Nj = #{i in {1, 2, ..., n}: Xi = xj} for j = 1, 2, ..., k.

1. Show that under the null hypothesis,

1. N = (N1, N2, ..., Nk) has the multinomial distribution with parameters n and p1, p2, ..., pk.
2. E(Nj) = npj.
3. var(Nj) = npj(1 − pj).

Exercise 1 indicates how we might begin to construct our test: for each j we can compare the observed frequency of xj (namely Nj) with the expected frequency of value xj (namely npj), under the null hypothesis. Specifically, our test statistic will be

V = (N1np1)2 / np1 + (N2np2)2 / np2 + ，，， + (Nknpk)2 / npk.

Note that the test statistic is based on the squared errors (the squares of the differences between the expected frequencies and the observed frequencies). The reason that the squared errors are scaled as they are is the following crucial fact, which we will accept without proof: Under the null hypothesis, as n increases to infinity, the distribution of V converges to the chi-square distribution with k − 1 degrees of freedom.

As usual, for m > 0 and r in (0, 1), we will let vm, r denote the quantile of order r for the chi-square distribution with k degrees of freedom. For selected values of m and r, vm, r can be obtained from the table of the chi-square distribution.

2. Show that the following test has approximate significance level α:

Reject H0: f = f0 versus H1: ff0, if and only if V > vk − 1, 1 − α.

Again, the test is an approximate one that works best when n is large. Just how large n needs to be depends on the pj; the rule of thumb is that the test will work well if the expected frequencies npj are at least 1 and at least 80% are at least 5.

• Chain rule:
• P(X,Y)=P(Y)*P(X|Y)
• P(X,Y|Z)=P(Y|Z)*P(X|Y,Z)
• P(X_1,X_2,..., X_n)=P(X_1)*P(X_2|X_1)*P(X_3|X_1,X_2)*...*P(X_n|X_1,...,X_{n-1})
• For Markov process, P(X_1,X_2,..., X_n)=P(X_1)*P(X_2|X_1)*P(X_3|X_2)*...*P(X_n|X_{n-1}) = P(X_1)* \prod_{i=2}^n P(X_i|X_{i-1})
• Law of total probability:
• Total probability can be computed by weighted sum of conditional probabilities, conditioned on partitioned causes: P(X)= \sum_{i=1}^N P(X|H_i)*P(H_i).
• The law of total probability consists of two computation steps:
1. Chain rule: P(X,H_i)=P(H_i)*P(X|H_i)
2. Marginalization (marginalizing the joint probability): P(X)= \sum_{i=1}^N P(X,H_i)= \sum_{i=1}^N P(X|H_i)*P(H_i).
• Bayes formula (Bayes rule) is to compute posterior probability P(H_i|X) from likelihood function P(X|H_i) and prior probability P(H_i):
• P(H_i|X)=P(X|H_i)*P(H_i)/(\sum_{i=1}^N P(X|H_i)*P(H_i)), where \sum_{i=1}^N P(X|H_i)*P(H_i)=P(X), i.e., total probability is used.
• Bayes rule consists of two steps:
1. Chain rule: P(X,H_i)=P(H_i)*P(X|H_i)
2. Use total probability: P(H_i|X)=P(X,H_i)/P(X) =P(X|H_i)*P(H_i)/(\sum_{i=1}^N P(X|H_i)*P(H_i))

• It is possible to have two r.v.'s that are each individually Gaussian, but are not jointly Gaussian.