The goal of hypothesis test (detection) is to decide which hypothesis (among n hypotheses) is true, where n can be any integer greater than 1. A hypothesis is a statement about a random event. Note a hypothesis need not be a random variable, which is a function mapping an elementary event to a real number.

The goal of estimation is to extract quantity of interest from noisy data/signal. Estimation problems can be classified into the following categories:

- Parameter/point estimation: a parameter is a point in n-dimensional vector space (e.g, Euclidean space, Hilbert space, and Krein space), where n can be any positive integer. A parameter could be deterministically continuous, or discrete random variable, or continuous random variable, but it cannot be deterministically discrete. If the parameter is a discrete random variable, the estimation problem becomes a hypothesis test. Note a hypothesis test is not a subset of parameter estimation since a hypothesis need not be a random variable.
- (Time-varying) signal estimation: the signal could be discrete-time or continuous-time, and it could also be deterministic or stochastic. A basic method for signal estimation is Kalman filer. For continuous-time signal estimation, Ito statistical calculus will be used, which is fairly advanced.
- Density estimation: there are two approaches, i.e., parametric and non-parametric. Parametric methods are the same as those in parameter estimation. Non-parametric methods use various kernel functions (like weighted windowing used in FIR filter design, spectral estimation, non-linear image processing, etc.)
- Confidence interval estimation

- Maximum likelihood criterion: (used for both
detection and estimation)
- LRT (Likelihood Ratio Test) for simple hypothesis test: the threshold is 1, that is, decide the one with a larger likelihood. Can be used for multiple hypothesis test.
- GLRT (Generalized LRT) for composite hypothesis test (take the max/super-norm of the likelihood function over the parameter set, as the generalized likelihood function).
- ML estimator for estimation: (used for the case when prior distribution is unknown or when the parameter to be estimated is deterministic). An ML estimate is biased, but is consistent (asymptotically unbiased) and asymptotically efficient (asymptotically achieving the Cramer-Rao bound).

- Neyman-Pearson criterion: (used for detection
only)
- Maximize the detection probability
(called power) under the constraint on the false alarm probability
(called size). Generally, it's a
**constrained optimization**problem and can be solved by Lagrange multiplier method. - A typical solution to the Neyman-Pearson problem (the above constrained optimization problem): LRT/GLRT. That is, determine the threshold, based on the false alarm probability; use this threshold for LRT/GLRT.
- Insight: Neyman-Pearson criterion achieves a tradeoff between two contradictory requirements: detection probability and false alarm probability. The larger the false alarm probability, the larger the detection probability.
- Fundamental implication: whenever there are two contradictory design requirements, Neyman-Pearson criterion can be used to formulate a constrained optimization problem, and the optimal solution results in a best tradeoff between the two contradictory requirements.

- Maximize the detection probability
(called power) under the constraint on the false alarm probability
(called size). Generally, it's a
- Minimum variance unbiased (MVUB) estimation
criterion:
- Every efficient estimator is MVUB, but the converse is not true. That is, an estimator may be MVUB but not achieve the Cramer-Rao bound.
- Fisher information matrix: expectation of
second derivative of Shannon codeword length log(1/p(X|\theta).
- I(\theta)= E[s(\theta,X)*s^T(\theta,X)] = E[d^2 log(1/p(X|\theta)/d\theta^2], where the score s(\theta,X)=dlog(p(X|\theta))/d\theta, s^T is a transpose of s.

- Cramer-Rao bound: the estimation error covariance matrix C=E[(\hat{\theta}-\theta)*(\hat{\theta}-\theta)^T]>=I^{-1}(\theta), where I^{-1}(\theta) is an inverse of I(\theta).

- Bayesian criterion: (used for both detection
and estimation)
- Used for estimating a random parameter (not for estimating a deterministic parameter).
- Assume that the prior distribution is known and the cost/risk function is given.
- Minimize the average cost/risk. This is because the cost is generally a random variable (due to randomness of noise) and cannot be meaningfully optimized. So we choose to optimize the average cost. If the cost is a discrete random variable, we have another option: minimax criterion.
- If the cost function is quadratic, the
Bayes estimator, called Minimum Mean Square Error (MMSE) estimator, is
the conditional mean of the posterior PDF (i.e., conditional
distribution of the parameter given the observations).
- The nice thing about MMSE is that the MMSE estimator is linear if the signal and observation are jointly Gaussian. This has similar flavor of quadratic programming (an optimization theory), which also involves solving linear equations since the derivative/gradient of a quadratic cost is linear.
- For non-Gaussian signal/noise, MMSE estimator is not linear. Due to the simplicity of linear estimation, we choose to use linear MMSE estimator, which is sub-optimal compared to MMSE.

- If the cost function is uniform, the Bayes estimator reduces to MAP (Maximum A Posteriori probability) estimator. A point at which a density achieves its maximum value is termed a mode of the PDF. So MAP estimate is the mode of the posterior PDF. Note that the conversion from Bayes to MAP requires that the threshold for the cost function be very small.
- If the cost function is absolute, the Bayes estimate, termed Minimum Mean Absolute Error (MMAE) estimator, is the median of the posterior PDF.
- In summary, the MMSE, MMAE, and MAP
estimates are the
**mean, median, and mode**of the posterior PDF, respectively. In general, Bayes estimates are features of the posterior PDF. This is one reason why Bayesian estimators are widely used in pattern recognition.- For Gaussian posterior PDF, MMSE=MMAE=MAP.

- Minimax criterion: (used for both detection
and estimation)
- Game-theoretical approach: minimax detector/estimator uses zero-sum game theory. Suppose there are two players: an experimentalist and Mother Nature. Mother Nature chooses the least favorable prior PMF (probability mass function). Then the minimax detector/estimator reduces to Bayes detector/estimator with respect to the least favorable prior. So minimax detector/estimator can be regarded as a special case of Bayes detector/estimator.
- Minimax detector/estimator is applicable to both discrete and continuous random variable detection/estimation. For continuous r.v., play the strategies with arbitrary pdf. (In other words, discretize the range of the r.v. and get a pmf for the resulting discrete r.v.. The limit of this discrete r.v. is the continuous r.v. So we can play with this pmf of the discrete r.v. in place of the pdf of the continuous r.v.)

Types of estimators

- Maximum likelihood estimator
- Bayesian estimator
- Moment estimator:
- let sample moments equal to ensemble moments; solve the equations, resulting in estimates of parameters of interest.

- Least square estimator:
- does not require prior distribution
- minimize the quadratic cost, for given
observations Y, over the parameter of interest \theta.
- minimize (Y-h(\theta))^2

**
Minimum Description Length (MDL) model
selection**

If someone asks you whether a Markov chain is a reasonable model to do the time series analysis, you should ask him/her comparing to what? Without a baseline solution, anything could be reasonable to start with. So the first step in modeling is selecting a model. Then, the second step is to estimate the parameter of the model.

MDL can be used as a criterion for model selection. It measures the universal code length of the data, given a model. One can compare two different models based on their MDLs using the data available and favor one with shorter code length. However, it is not clear how to optimize from a model set (functional space) with infinite models based on MDL. Note that MDL is used for comparing given models in a parameterized family, rather than searching for the best model over the whole functional space.

Kalman filters requires 1) linear system equation, 2) Gaussian noise. The linearity of system equation preserves the Gaussianity of the process (Gauss Markov process, actually, AR process). Since the state process is a Gauss Markov process, Kalman filters only have to propagate mean and covariance. Kalman filters can be used to estimate the state of time-varying linear system. Just change the system matrices from constant ones to time varying ones.

*"Graphical models are a marriage between
probability theory and graph theory. They provide a natural tool for dealing
with two problems that occur throughout applied mathematics and engineering --
uncertainty and complexity -- and in particular they are playing an increasingly
important role in the design and analysis of machine learning algorithms.
Fundamental to the idea of a graphical model is the notion of modularity -- a
complex system is built by combining simpler parts. Probability theory provides
the glue whereby the parts are combined, ensuring that the system as a whole is
consistent, and providing ways to interface models to data. The graph theoretic
side of graphical models provides both an intuitively appealing interface by
which humans can model highly-interacting sets of variables as well as a data
structure that lends itself naturally to the design of efficient general-purpose
algorithms. *

*Many of the classical multivariate probabalistic
systems studied in fields such as statistics, systems engineering, information
theory, pattern recognition and statistical mechanics are special cases of the
general graphical model formalism -- examples include mixture models, factor
analysis, hidden Markov models, Kalman filters and Ising models. The graphical
model framework provides a way to view all of these systems as instances of a
common underlying formalism. This view has many advantages -- in particular,
specialized techniques that have been developed in one field can be transferred
between research communities and exploited more widely. Moreover, the graphical
model formalism provides a natural framework for the design of new systems."
*--- Michael Jordan, 1998.

EM algorithm vs. BCJR

EM is a sub-optimal algorithm to compute maximum likelihood or MAP. What EM obtains is a fixed-point, rather than the desired global optimum. It is guaranteed that each step in EM converges to the fixed-point.

BCJR is an algorithm to compute a posteri probabilty exactly.

EM is an iterative algorithm while BCJR is a sequential algorithm.