Probability Theory

Probability theory can be classified into two
categories:
Some frequentists, like Neyman, want to add some
qualifications to such statements. Mathematical probability is a property of
some formal, abstract mathematical objects. It so happens, as an empirical fact,
that these objects can be used, with a fair degree of accuracy, to represent
various bits and pieces of the real world. The longrun limits invoked are to be
understood as a way of speaking about what would happen if we repeated
experiments indefinitely, about (as Neyman puts it) ``imaginary random
experiments.'' Rather than explore this topic in the depth it deserves, which
would lead to discussing, on the one hand, the axioms of probability, measure
theory, sigmaalgebras and Kolmogorov complexity, and on the other hand the
connection between mathematics and physical reality, so probability has two
interpretations: rigorous mathematical interpretation (measure), and intuitive
physical interpretation (relative frequency), which is simple but somewhat
inaccurate.
Mean, median, mode characterize the central
tendency of a r.v. Variance and standard deviation characterize dispersion
of a r.v.
 Sample space is a set of all possible
outcomes of an experiment. An event is a subset of a
sample space. Note that a sample space is not just a set; moreover, it
contains random events. The randomness is inherent in the
experiment.
 An elementary event consists of a
single element in a sample space.
 Geometric distribution:
 Geometric mean vs. arithmetic mean:
 Geometric mean is so called since it
is the nth root of the product of n numbers (say, a_1, ..., a_n),
which has a direct geometric interpretation; that is, the volume of
a hypercube with equal side length of the geometric mean, is equal
to the volume of an orthotope with side lengths of a_1, ..., a_n.
 Geometric series
 b_n = c^n, which is the volume of a
hypercube with equal side length of c. The volume
exponentially increases with the sequence number n. So
this series has a geometric interpretation.
 Geometric distribution
 P_n = pq^{n1}, n=1, 2, ...
 It is a product of a constant and a
geometric series.
 E[X]<infinity if and only if E[X]<infinity. On the contrary, absolute integrability/summability
implies conditional integrability/summability.
 Probability mass function (PMF), probability
density function (PDF)
 come from physics. Mass is an integral
of density over an area/volume. For discrete r.v., PDF is a Dirac
delta function; but the integral of PDF is finite, i.e., PMF.
 Statistical mean = expectation = ensemble
average; empirical mean = sample average/mean = time average.
 If Y = f(X), where f is a onetoone
correspondence, then p{X=x, Y=f(x)}=p{X=x}=p{Y=f(x)}.
Types of Convergence
 Convergence everywhere
 We say that a random sequence x_n
converges everywhere if the sequence of numbers x_n(\zeta) converges
for every $\zeta$ in the sample space. That is, every
realization (a sequence of outcomes) of the process x_n
converges. Or every sample path converges. The limit
of each realization of the process x_n is a number that depends,
in general, on $\zeta$. That is, the limit of x_n is
a random variable x.
 One may be confused by
$\zeta$. Here, $\zeta$ is just a symbol representing an
outcome of an experiment; $\zeta$ is unknown before an
experiment. Once the experiment is done, $\zeta$ is
known. Then, x_n(\zeta) simply maps the outcome $\zeta$ to a
real number. So each realization of the random sequence will
result in a sequence of deterministic number.
 An example: Denote x an
arbitrary r.v. Let x_n= x+1/n. Then any
realization of the process x_n converges to the realization of x.
So x_n converges everywhere to x.
 Convergence almost everywhere (a.e.) or
almost sure (a.s.), also called convergence with probability 1 or strong
convergence
 Convergence in probability or weak
convergence
 P(lim inf A_n) <= lim inf P(A_n) <= lim
sup P(A_n) <= P( lim sup A_n)
 Since P( lim sup A_n)>= P(A_n) for
every n, we have P( lim sup A_n)>= lim sup P(A_n).
For a rigorous proof, see Billingsley's book "Probability and
Measure".
Transforms of PDF/PMF
 Characteristic function: E[exp(jwX)]
 Similar to Fourier transform of PDF:
E[exp(jwX)]
 So, when you compute characteristic
function of a PDF using existing Fourier transform pairs, don't forget to
put a negative sign before jw in the Fourier transform of the PDF.
 Moment generating function:
 For continuous r.v.: E[exp(sX)]
 Similar to Laplace transform of PDF:
E[exp(sX)]
 So, when you compute moment generating
function of a PDF using existing Laplace transform pairs, don't forget to
put a negative sign before s in the Laplace transform of the PDF.
 For discrete r.v.: E[z^(X)]
 Similar to Ztransform of PMF: E[z^(X)]
 So, when you compute moment generating
function of a PMF using existing Ztransform pairs, don't forget to put a
negative sign before X in the Ztransform of the PMF.
Important distributions:

Continuous
distributions
 Uniform: simplest distribution for support =
bounded region (a,b); for a scalar r.v., need two parameters, i.e., the
two end points of the boundary, (a,b).
 Exponential: simplest distribution for
support = [0,\infty); need one parameter, i.e., mean.
 Normal: simplest distribution for support =
[\infty,\infty); need two parameters, i.e., mean and variance.
 Lognormal:
 Two r.v.'s X and Y have the relation Y=exp(X)
or X=log(Y); if X is a normal r.v., then Y is a lognormal r.v.
 Note that lognormal comes from the fact that
the log function of a r.v. (e.g., Y) is normaldistributed. The
log function of a normal r.v. is not lognormal (the pdf in this case can be
easily obtained).
 Lognormal distribution is used to model
shadowing effect (largescale fading) in wireless channel.
 Chisquare distribution
 When df=2, the chisquare distribution is
exponential distribution.
 Chi distribution (Nakagamim distribution):
 It is normal distribution if df=1
 It is Rayleigh distribution if df=2
 Used to characterized fading channel.
 Student t:
 Z=X/Y. If X is normal and Y is
chi distributed, then Z is Student t distributed. If Y is normal
(chi distribution with one degree of freedom), Z is Cauchy distributed.
 Used as a test statistic in CFAR problems
(matched filter + unknown noise variance, for linear statistical model).
 Weibull distribution:
 Discrete distributions
 Binomial distribution: pmf for configurations
of two sets.
 Multinomial distribution: pmf for
configurations of m sets.
 Mixture distribution: normalized weighted
sum/integral of distributions. For example, I have 10 Gaussian pdf f_i(x),
\sum_{i=1}^10 a_i=1; a mixture Gaussian is \sum_{i=1}^10 a_i*f_i(x).
Another example of a mixture distribution is \int g(\theta)* f(x\theta)
d\theta, where g(\theta) is a pdf of \theta and f(x\theta) is a pdf of x,
parameterized by \theta.
Sterling's formula: n! \approx (n/e)^n =
e^{n log n  n}, the approximation is accurate as n goes to infinity.
What is inverse chisquare distribution?
It is just the inverse function of chisquare distribution, actually, the set of
percentiles.
The Goodness of Fit Test
Suppose that we have a random experiment with a
random variable X of interest. Assume additionally
that X is discrete with density function f
on a finite set S. We repeat the experiment n times go
generate a random sample of size n from the distribution of
X:
X_{1},
X_{2}, ..., X_{n}.
Recall that these are independent variables,
each with the distribution of X.
In this section, we assume that the distribution
of X is unknown. For a given density function f_{0},
we will test the hypotheses
H_{0}:
f = f_{0} versus H_{1}: f
』 f_{0},
The test that we will construct is known as the
goodness of fit test for the conjectured density f_{0}.
As usual, our challenge in developing the test is to find a good test
statisticone that gives us information about the hypotheses and whose
distribution, under the null hypothesis, is known, at least approximately.
Suppose that S = {x_{1},
x_{2}, ..., x_{k}}.
To simplify the notation, let
p_{j} =
f_{0}(x_{j}) for
j = 1, 2, ..., k.
Now let N_{j} = #{i
in {1, 2, ..., n}: X_{i}
= x_{j}} for j = 1, 2,
..., k.
1. Show
that under the null hypothesis,
 N = (N_{1},
N_{2}, ..., N_{k}) has the multinomial
distribution with parameters n and p_{1}, p_{2},
..., p_{k}.
 E(N_{j}) =
np_{j}.
 var(N_{j}) = np_{j}(1
− p_{j}).
Exercise 1 indicates how we might begin to
construct our test: for each j we can compare the observed
frequency of x_{j} (namely N_{j})
with the expected frequency of value x_{j} (namely
np_{j}), under the null hypothesis. Specifically, our test
statistic will be
V = (N_{1}
− np_{1})^{2} / np_{1} + (N_{2}
− np_{2})^{2} / np_{2} + ，，， + (N_{k}
− np_{k})^{2} / np_{k}.
Note that the test statistic is based on the
squared errors (the squares of the differences between the expected
frequencies and the observed frequencies). The reason that the squared errors
are scaled as they are is the following crucial fact, which we will accept
without proof: Under the null hypothesis, as n increases to infinity,
the distribution of V converges to the chisquare distribution with
k − 1 degrees of freedom.
As usual, for m > 0 and r
in (0, 1), we will let v_{m, r} denote the quantile of order
r for the chisquare distribution with k degrees of
freedom. For selected values of m and r, v_{m, r}
can be obtained from the table of the chisquare distribution.
2. Show
that the following test has approximate significance level α:
Reject H_{0}:
f = f_{0} versus H_{1}: f
』 f_{0}, if and only if V > v_{k}_{
− 1, 1 − α}.
Again, the test is an approximate one that works
best when n is large. Just how large n needs to be depends
on the p_{j}; the rule of thumb is that the test will work
well if the expected frequencies np_{j} are at least 1 and at
least 80% are at least 5.
 Chain rule:
 P(X,Y)=P(Y)*P(XY)
 P(X,YZ)=P(YZ)*P(XY,Z)
 P(X_1,X_2,..., X_n)=P(X_1)*P(X_2X_1)*P(X_3X_1,X_2)*...*P(X_nX_1,...,X_{n1})
 For Markov process, P(X_1,X_2,..., X_n)=P(X_1)*P(X_2X_1)*P(X_3X_2)*...*P(X_nX_{n1})
= P(X_1)* \prod_{i=2}^n P(X_iX_{i1})
 Law of total probability:
 Total probability can be computed by
weighted sum of conditional probabilities, conditioned on partitioned
causes: P(X)= \sum_{i=1}^N P(XH_i)*P(H_i).
 The law of total probability consists of
two computation steps:
 Chain rule: P(X,H_i)=P(H_i)*P(XH_i)
 Marginalization (marginalizing the joint
probability): P(X)= \sum_{i=1}^N P(X,H_i)= \sum_{i=1}^N P(XH_i)*P(H_i).
 Bayes formula (Bayes rule) is to compute posterior
probability P(H_iX) from likelihood function P(XH_i) and prior probability P(H_i):
 P(H_iX)=P(XH_i)*P(H_i)/(\sum_{i=1}^N
P(XH_i)*P(H_i)), where \sum_{i=1}^N P(XH_i)*P(H_i)=P(X), i.e., total
probability is used.
 Bayes rule consists of two steps:
 Chain rule: P(X,H_i)=P(H_i)*P(XH_i)
 Use total probability: P(H_iX)=P(X,H_i)/P(X)
=P(XH_i)*P(H_i)/(\sum_{i=1}^N
P(XH_i)*P(H_i))
 It is possible
to have two r.v.'s that are each individually Gaussian, but are not jointly
Gaussian.