"Automatically Building Domain-Specific Search Engines
		       using Machine Learning"

			   Andrew McCallum
			 Just Research & CMU
		   http://www.cs.cmu.edu/~mccallum


Abstract:

Domain-specific search engines are growing in popularity because they
offer increased accuracy and extra functionality not possible with the
general, Web-wide search engines.  For example, www.campsearch.com
allows complex queries over summer camps by age-group, size, location
and cost.  Yahoo is another example, in that it is a collection of
many domain-specific search engines.  Unfortunately, these search
engines are difficult and time-consuming to maintain.

This talk proposes the use of machine learning techniques to greatly
automate the creation and maintenance of domain-specific search
engines.  I will briefly describe recent work applying reinforcement
learning to efficient topic-directed spidering, and applying hidden
Markov models to information extraction.  Then I will concentrate on
the problem of automatically categorizing documents into a Yahoo-like
topic hierarchy---demonstrating near-human accuracy by leveraging only
a few keywords, a lot of unlabeled data and the hierarchy.

A common thread throughout this work is the integration of supervised
and unsupervised learning in order to avoid the expense of labeling
training data.

Using these techniques, we have built a demonstration system: a search
engine for computer science research papers.  It already contains over
50,000 papers and is publicly available at www.cora.justresearch.com.
Research for the engine was performed this past summer at Just
Research.

--------
Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni
Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore and Jason
Rennie.