Staged learning is the problem of finding an appropriate bias for learning machines in order to maximize generalization. When knowledge of related tasks exists, a bias for a learning machine can be found by using examples from each of the related tasks.

Staged learning could have a large impact on all parts of machine learning because learning a good bias for a set of tasks would minimize the number of examples required to learn each task, minimizing learning time on new tasks and making the learning process generally cheaper.

There have been many experimental results showing the value of 'learning to learn' but little theoretical work. The experimental results show that learning of bias is possible and useful while the small amount of theoretical work suggests such results should occur.

My approach has been mostly theoretical so far, deriving upper bounds on the number of examples/task and number of tasks required for 'learning to learn' with support vector machines.

I've also found a way to introduce a parameterization into Jon Baxter's proof of 'learning to learn' bounds which trades off between the number of tasks and number of examples/task required in order to learn a particular bias.

I've also worked on extending Baxter's work to learning 'hyper-bias' in order to more easily learn 'bias' in order to more easily learn a task. For this model, you have a set of sets of sets of hypothesis and you want to choose a set of sets of hypothesis given a set of sets of related tasks.

A similar extension to arbitrary degrees of removal from the base task has been completed. The model is a further abstraction of the above models to 'learning to learn ... to learn'. In this model, you must choose from a set of sets of ... sets of hypothesis given a set of ... sets of tasks to get a set of ... sets of hypothesis.

The point of all this work is using exterior knowledge concerning the 'relatedness' of tasks to design an experimental architecture where the information from each of these tasks can be combined to improve learning on future tasks or reduce the number of examples required per task.

I intend to follow up these theoretical results with experimental results.

For a more detailed write ups of the work, look at: http://www.cs.cmu.edu/ jcl/ltol/index.html

I'm interested in extending this work with both experimental results and more theoretical extensions. Jon Baxter has also worked out some 'learning to learn' bounds in a Bayesian framework. I'm interested in doing something similar for staged learning (learning bias of arbitrary remove from the examples). I also expect to work with Thorsten Joachims on experimental results working with support vector machines.

File translated from T