"Automatically Building Domain-Specific Search Engines using Machine Learning" Andrew McCallum Just Research & CMU http://www.cs.cmu.edu/~mccallum Abstract: Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries over summer camps by age-group, size, location and cost. Yahoo is another example, in that it is a collection of many domain-specific search engines. Unfortunately, these search engines are difficult and time-consuming to maintain. This talk proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. I will briefly describe recent work applying reinforcement learning to efficient topic-directed spidering, and applying hidden Markov models to information extraction. Then I will concentrate on the problem of automatically categorizing documents into a Yahoo-like topic hierarchy---demonstrating near-human accuracy by leveraging only a few keywords, a lot of unlabeled data and the hierarchy. A common thread throughout this work is the integration of supervised and unsupervised learning in order to avoid the expense of labeling training data. Using these techniques, we have built a demonstration system: a search engine for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. Research for the engine was performed this past summer at Just Research. -------- Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore and Jason Rennie.