Andrew W. Moore
The following links point to a set of tutorials on many
statistical data mining, including the foundations of probability, the
foundations of statistical data analysis, and most of the classic
machine learning and data mining algorithms.
These include classification algorithms such as decision
nets, Bayesian classifiers, Support Vector Machines and cased-based
(aka non-parametric) learning. They include regression algorithms
such as multivariate polynomial regression, MARS, Locally Weighted
Regression, GMDH and neural nets. And they include other data mining
operations such as clustering (mixture models, k-means and
hierarchical), Bayesian networks and Reinforcement Learning.
I hope they're useful (and please let me know if they are, or
if you have suggestions or error-corrections). Click
here for a short list of topics.
The Decision Tree is one of the most popular classification algorithms
in current use in Data Mining and Machine Learning. This tutorial can
be used as a self-contained introduction to the flavor and
terminology of data mining without needing to review many statistical
or probabilistic prerequisites. If you're new to data mining you'll
enjoy it, but your eyebrows will raise at how simple it all is!
After having defined the job of classification, we explain how
information gain (next Andrew Tutorial) can be used to find predictive
input attributes. We show how applying this procedure recursively
us to build a decision tree to predict future events. We then look
carefully at a question so fundamental, it is the basis for much of all
statistics and machine learning theory: how do you choose between a
complicated model that fits the data really well and an "Occam's razor"
model that is succinct yet not so good at fitting data (this topic will
be revisited in later Andrew Lectures, including "Cross-validation" and
"VC-dimension"). We also discuss the very wide world of improvements
and tweaks on the basic decision tree idea.
Gain. This tutorial steps through the ideas from
Information Theory that eventually
lead to Information Gain...one of the most popular measures of
association currently used in data mining. We visit the ideas
of Entropy and Conditional Entropy along the way. Look at the lecture
on Gaussians for discussion of Entropy in the case of continuous
probability density functions.
Data Miners. This tutorial reviews Probability
starting right at ground level.
It is, arguably, a useful investment to be completely happy with
before venturing into advanced algorithms from data mining, machine
learning or applied statistics. In addition to setting the stage for
techniques to be used over and over again throughout the remaining
tutorials, this tutorial introduces the notion of Density Estimation
as an important operation, and then introduces Bayesian Classifiers
as the overfitting-prone Joint-Density Bayes Classifier, and the
over-fitting-resistant Naive Bayes Classifier.
Density Functions. A review of a world that you've
probably encountered before: real-valued
random variables, probability density functions, and how to deal
with multivariate (i.e. high dimensional) probability densities. Here's
where you can review things like Expectations, Covariance Matrices,
Independence, Marginal Distributions and Conditional Distributions.
you're happy with this stuff you won't be a data miner, but you'll have
the tools to very quickly become one.
Gaussians, both the friendly univariate kind, and the
kind are extremely useful in many parts of statistical data
mining, including many data mining models in which the underlying data
assumption is highly non-Gaussian. You need to be friend with
Estimation. MLE is a solid tool for learning
parameters of a data mining model.
It is a methodology which tries to do two things. First, it is a
reasonably well-principled way to work out what computation you should
doing when you want to learn some kinds of model from data. Second, it
is often fairly computationally tractable. In any case, the important
is that in order to understand things like polynomial regression,
nets, mixture models, hidden Markov models and many other things it's
going to really help if you're happy with MLE.
Classifiers. Once you are friends with Gaussians,
it it easy to use them as subcomponents
of Bayesian Classifiers. This tutorial shows you how.
Cross-validation is one of several approaches to estimating how well
the model you've just learned from some training data is going to
on future as-yet-unseen data. We'll review testset validation,
cross validation (LOOCV) and k-fold cross-validation, and we'll discuss
a wide variety of places that these techniques can be used. We'll also
discuss overfitting...the terrible phenomenon that CV is supposed to
present. And at the end, our hairs will stand on end as we realize that
even when using CV, you can still overfit arbitrarily badly.
We begin by talking about linear regression...the ancestor of neural
nets. We look at how linear regression can use simple matrix
operations to learn from data. We gurgle with delight as we see
why one initial assumption leads inevitably to the decision to
try to minimize sum squared error. We then explore an alternative way
to compute linear parameters---gradient descent. And then we exploit
gradient descent to allow classifiers in addition to regressors, and
to allow highly non-linear models---full neural nets in all their glory.
learning (aka Case-based or Memory-based or non-parametric).
Over a century old, this form of data mining is still being used
very intensively by statisticians and machine learners alike. We
nearest neighbor learning, k-nearest-neighbor, kernel methods and
locally weighted polynomial regression. Software and data for the
algorithms in this tutorial
are available from
http://www.cs.cmu.edu/~awm/vizier. The example figures in
this slide-set were created with the same software and data.
Regression Algorithms. You'll have to wait to find
out Andrew's ordering on them, but based on
all the foundations you've covered so far we will quickly be able to
Regression Trees, Cascade Correlation, Group Method Data Handling
(GMDH), Multivariate Adaptive Regression Splines (MARS),
Multilinear Interpolation, Radial Basis Functions, Robust Regression,
Cascade Correlation + Projection Pursuit
Real-valued Outputs: An introduction to regression.
This lecture is made up entirely from material from the start of the
Neural Nets lecture and a subset of the topics in the "Favorite
Regression Algorithms" lecture. We talk about linear regression, and
then these topics: Varying noise, Non-linear regression (very
briefly), Polynomial Regression, Radial Basis Functions, Robust
Regression, Regression Trees, Multilinear Interpolation and MARS.
Networks. The tutorial first reviews the
fundamentals of probability (but to do
that properly, please see the earlier Andrew lectures on Probability
Data Mining). It then discusses the use of Joint Distributions for
representing and reasoning about uncertain knowledge. Having discussed
the obvious drawback (the curse of dimensionality) for Joint
as a general tool, we visit the world of clever tricks involving
independence and conditional independence that allow us to express our
uncertain knowledge much more succinctly. And then we beam with
as we realize we've got most of the knowledge we need to understand
and appreciate Bayesian Networks already. The remainder of the tutorial
introduces the important question of how to do inference with Bayesian
Networks (see also the next Andrew Lecture for that).
Bayesian Networks (by Scott Davies and Andrew Moore).
The majority of these slides were conceived and created by Scott
Davies. Once you've got hold of a Bayesian
Network, there remains the question of how you do inference with
it. Inference is the operation in which some subset of the attributes
are given to us with known values, and we must use the Bayes net to
estimate the probability distribution of one or more of the remaining
attributes. A typical use of inference is "I've got a temperature of
101, I'm a 37-year-old Male and my tongue feels kind of funny but I
have no headache. What's the chance that I've got bubonic plague?".
Bayesian Networks. This short and simple tutorial
overviews the problem of learning
Bayesian networks from data, and the approaches that are used. This
is an area of active research by many research groups, including
Andrew and his students (see the Auton
Website for more details).
Short Intro to naive Bayesian classifiers. I recommend using Probability
Data Mining for a more in-depth introduction to density
and general use of Bayes classifiers, with naive Bayes classifiers as
a special case. But if you just want the executive summary bottom line
on learning and using naive Bayes classifiers on categorical
attributes then these are the slides for you.
Overview of Bayes Nets. This is a very short 5
minute "executive overview" of the
intuition and insight behind Bayesian Networks. Read the
Tutorial for more information.
Models. Gaussian Mixture Models (GMMs) are among
the most statistically mature
methods for clustering (though they are also used intensively for
density estimation). In this tutorial, we introduce the concept of
clustering, and see how one form of clustering...in which we assume
that individual datapoints are generated by first choosing one of a
set of multivariate Gaussians and then sampling from them...can be a
well-defined computational operation. We then see how to learn such a
thing from data, and we discover that an optimization approach not
used in any of the previous Andrew Tutorials can help considerably
here. This optimization method is called Expectation Maximization
(EM). We'll spend some time giving a few high level explanations and
demonstrations of EM, which turns out to be valuable for many other
algorithms beyond Gaussian Mixture Models (we'll meet EM again in the
later Andrew Tutorial on Hidden Markov Models). The wild'n'crazy
algebra mentioned in the text can be found (hand-written) here.
Hierarchical Clustering. K-means is the most famous
clustering algorithm. In this tutorial
we review just what it is that clustering is trying to achieve, and we
show the detailed reason that the k-means approach is cleverly
something very meaningful. Oh yes, and we'll tell you (and show you)
what the k-means algorithm actually does. You'll also learn about
another famous class of clusterers: hierarchical methods (much beloved
in the life sciences). Phrases like "Hierarchical Agglomerative
and "Single Linkage Clustering" will be bandied about.
Models. In this tutorial we'll begin by reviewing
Markov Models (aka Markov
Chains) and then...we'll hide them! This simulates a very common
phenomenon... there is some underlying dynamic system running along
according to simple and uncertain dynamics, but we can't see it. All
we can see are some noisy signals arising from the underlying system.
From those noisy observations we want to do things like predict the
most likely underlying system state, or the time history of states, or
the likelihood of the next observation. This has applications in fault
diagnosis, robot localization, computational biology, speech
understanding and many other areas. In the tutorial we will describe
how to happily play with the mostly harmless math surrounding HMMs and
how to use a heart-warming, and simple-to-implement, approach called
dynamic programming (DP) to efficiently do most of the HMM computations
you could ever want to do. These operations include state estimation,
estimating the most likely path of underlying states, and as a grand
(and EM-filled) finale, learning HMMs from data.
This tutorial concerns a well-known piece of Machine Learning Theory.
If you've got a learning algorithm in one hand and a dataset in
the other hand, to what extent can you decide whether the
learning algorithm is in danger of overfitting or underfitting? If you
want to put some formal analysis into the fascinating question of
how overfitting can happen, then this is the tutorial for you. In
to getting good understanding of the overfitting phenomenon, you also
end up with a method for estimating how well an algorithm will
perform on future data that is solely based on its training set error,
and a property (VC dimension) of the learning algorithm. VC-dimension
thus gives an alternative to cross-validation, called Structural Risk
Minimization (SRM), for choosing classifiers. We'll discuss that. We'll
also very briefly compare both CV and SRM to two other model selection
methods: AIC and BIC.
Machines. We review the idea of the margin of a
classifier, and why that may
be a good criterion for measuring a classifier's desirability. Then we
consider the computational problem of finding the largest margin linear
classifier. At this point we look at our toes with embarrassment and
note that we have only done work applicable to noise-free data. But we
cheer up and show how to create a noise resistant classifier, and then
a non-linear classifier. We then look under a microscope at the two
things SVMs are renowned for---the computational ability to survive
projecting data into a trillion dimensions and the statistical
ability to survive what at first sight looks like a classic overfitting
PAC stands for "Probably Approximately Correct" and concerns a nice
formalism for deciding how much data you need to collect in order for
a given classifier to achieve a given probability of correct
on a given fraction of future test data. The resulting estimate is
somewhat conservative but still represents an interesting avenue by
which computer science has tried to muscle in on the kind of analytical
problem that you would normally find in a statistics department.
Processes. How do you plan efficiently if the
results of your actions are
uncertain? There is some remarkably good news, and some significant computational hardship. We begin by discussing Markov
Systems (which have no actions) and the notion of Markov Systems with
Rewards. We then motivate and explain the idea of infinite horizon
discounted future rewards. And then we look at two competing approaches
to deal with the following computational problem: given a Markov
System with Rewards, compute the expected long-term discounted rewards.
The two methods, which usually sit at opposite corners of the ring and
snarl at each other, are straight linear algebra and dynamic
We then make the leap up to Markov Decision Processes, and find that
we've already done 82% of the work needed to compute not only the
long term rewards of each MDP state, but also the optimal action to
take in each state.
Learning. You need to be happy about Markov
Decision Processes (the previous
Andrew Tutorial) before venturing into Reinforcement Learning. It
the fascinating question of whether you can train a controller to
perform optimally in a world where it may be necessary to suck up
some short term punishment in order to achieve long term reward. We
will discuss certainty-equivalent RL, the Temporal Difference (TD)
learning, and finally Q-learning. The curse of dimensionality will
be constantly learning over our shoulder, salivating and cackling.
An example. We review methods described in other
biosurveillance slides as applied to hospital admissions data from the
Walkerton Cryptosporidium outbreak of 2000. This is work performed as
part of the ECADS project.
probability and Naive Bayes classifiers. This slide
repeats much of the material of the main Probability Slide from
Andrew's tutorial series, but this slide-set focuses on disease
surveillance examples, and includes a very detailed description for
non-experts about how Bayes rule is used in practice, about Bayes
Classifiers, and how to learn Naive Bayes classifiers from data.
Surveillance. This tutorial discusses Scan
Statistics, a famous epidemiological method for discovering
overdensities of disease cases.
Series Methods. This tutorial reviews some
elementary univariate time series methods, with a focus on using the
time series for alerting when a sequence of observations is starting to
Algorithms, including Alpha-Beta Search.
Introduction to algorithms for computer game playing. We describe the
assumptions about two-player zero-sum discrete finite deterministic
games of perfect information. We also practice saying that noun-phrase
in a single breath. After the recovery teams have done their job we
talk about solving such games with minimax and then alpha-beta search.
We also discuss the dynamic programming approach, used most commonly
for end-games. We also debate the theory and practice of heuristic
evaluation functions in games.
Game Theory. Want to know how and why to bluff in
poker? How games can be compiled
down to a matrix form? And general discussion of the
basics of games of hidden information? Then these are the slides for
It might help you to begin by reading
the slides on game-tree search.
Game Theory. Auctions and electronic negotiations
are a fascinating topic. These
slides take you through most of the basic story of the assumptions,
the formalism and the mathematics behind non-zero-sum game theory.
It might help you to begin by reading
the slides on game-tree search
Zero-sum Game theory with Hidden information
available from this same set of tutorials.
In this tutorial we cover the definition
of a multiplayer non-zero-sum game, domination of strategies, Nash
Equilibria. We deal with discrete games, and also games in which
include real numbers, such as your bid in a two player double auction
negotiation. We cover prisoner's dilemma, tragedy of the commons,
double auctions, and multi-player auctions such as the first price
sealed auction and the second price auction.
The math for the double auction analysis can be found at
overview of time-series-based anomaly detection algorithms.
This simple tutorial overviews some methods for detecting anomalies
in biosurveillance time series. The slides are incomplete: verbal
from the presentation has not yet been included as explanatory
Please let me (email@example.com) know if you would be interested in more
detail on these slides and/or access to the software that implements
graphs the various univariate methods. If I receive enough requests I
will try to make both of the above available.
introduction. A very quick informal discussion of
the different kinds of AI research motivations out there
Algorithms. What is a search algorithm? What job
does it do and where can it be
applied? We introduce various flavors of Breadth First Search and
Depth First search and then looks at alternatives and improvements
that include Iterative Deepening and Bidirectional Search. Then we
look with furrowed brow at an idea called Best First Search. This will
be our first view of a search algorithm that is able to exploit a
Search. The classic algorithm for finding shortest
paths given an admissible
heuristic. We'll deal with the notion of admissibility (summary:
optimistic). We show how you can prove properties of A*. We'll also
briefly discuss IDA* (iterative deepening A*).
Satisfaction Algorithms, with applications in Computer Vision and
Scheduling. The tutorial teaches concepts from the
AI literature on Constraint
Satisfaction. Accompanying animations are in
This is a special case of uninformed search in which we
want to find a solution configuration for some set of variables that
satisfies a set of constraints. Example problems including graph
coloring, 8-queens, magic squares, the Waltz algorithm for
interpreting line drawings, many kinds of scheduling and most
important of all, the deduction phase of minesweeper. The algorithms
we'll look at include backtracking search, forward checking search and
constraint propagation search. We'll also look at general-purpose
heuristics for additional search accelerations.
Planning. We review some algorithms for clever path
planning once we arrive in
real-valued continuous space instead of the safe and warm discrete
space we've been sheltering in so far. We look at configuration spaces,
visibility graphs, cell-decomposition, voronoi-based planning and
potential field methods. Unfortunately some of the figures are missing
from the PDF version.
Simulated Annealing and Genetic Algorithms. Some
very useful algorithms, to be used only in case of emergency.