All of these models can be described within the framework of probabilistic graphical models, which we will briefly introduce. In this framework it becomes easy to explore variants and hybrids (such as mixtures of factor analyzers and switching state-space models) which are potentially powerful tools. This framework also makes it clear that the same general probability propagation algorithm can be used to infer the hidden (i.e. latent) variables in all these models, and that the EM algorithm can be used to learn the maximum likelihood (ML) parameters. In the latter part of the tutorial we will focus on approximate inference techniques for models in which probability propagation is intractable, and on variational methods for Bayesian model averaging which can overcome the overfitting and model selection problems in ML learning. Matlab demos will be used to demonstrate some of the models and algorithms.
The loss of the on-line algorithm on a sequence of examples is typically larger than the loss of the best off-line model. However, the goal of the on-line learner is to minimize the additional loss of the on-line algorithm over the loss of the best off-line model. Bounds relating the on-line loss to the best off-line loss are called `relative loss bounds'. Such bounds hold for arbitrary sequences of examples. They quantify the price of hiding the future examples from the learner.
We will review methods for proving relative loss bounds and give an overview of applications. We will emphasize a method that starts with a Bregman divergence which measures the `distance' between parameterized models. This divergence function is used to derive a parameter update rule for the on-line learner, and it becomes the potential function in the proof of the relative loss bound for this update rule. We will discuss related methods for proving bounds on the generalization error. We will then introduce families of update algorithms that are characterized by different divergence functions. The two main families are gradient descent and exponentiated gradient. The former family includes all the kernel based algorithms and the latter family is motivated by the minimum relative entropy principle. We will discuss the merits of the two families. No background is required.
We have attempted to ensure that all information is correct, but we cannot guarantee it.
Please send comments and corrections to: