Dayne Freitag

http://www.cs.cmu.edu/~dayne

Career Experience

February, 2000 -
Present
Principal Scientist & VP, Technology at Burning Glass Technologies
Led research and development of the extraction engine at the core of the company's most successful product, a resume parsing system. The engine, which converts plaintext resumes to an XML schema identifying approximately 70 different contexts, was trained on 4000 hand-labeled resumes. Primarily a statistical system using a hidden Markov model, it also incorporates symbolic machine learning methods and discrete finite-state techniques common in traditional information extraction. Work on this engine led to several innovations, including a patented mechanism, called re-entry penalization, which mitigates some of the problems caused by the Markov assumption.

Wrote the indexing and retrieval engine upon which the Boolean query processor in the company's resume corpus management product is based. The context operators supported by this engine, combined with the granular structure recoved by the extraction engine, facilitates precise searching of resume collections.

Spearheaded development of the company's service offering, called Aperture, a resume management system for emailed and faxed resumes. Aperture accepts emailed resumes, screens out spam and duplicate submissions, separates the resume from the message containing it, performs extraction, scores the resume against the posting of the job for which it was submitted, and sends back the resume in a special fixed format designed to support easy ranking and review in the client's mail reader.

November, 1998 -
February 2000
Research Scientist at Just Research
Headed research efforts in information extraction and text mining. Applied machine learning and statistical techniques to the problem of information extraction from text. Research integrated a variety of text dimensions, such as term co-occurrence, document formatting, and linguistic structure in a machine learning framework. Investigated the prospect of high-precision text retrieval using information extraction.

Research Interests

Machine learning for information extraction
  • Statistical models, relational learning, grammar inference, and vector-space learning for IE.
  • How can a system learn to perform IE in ungrammatical text?
  • How can we recognize and exploit visual devices, such as tables?

Information retrieval
  • How can IE enable more effective IR and question answering?
  • If we have a set of "correct" query/response pairs as training data, can we use machine learning to optimize IR?
  • How does the IR problem change when working with a corpus, the contents of which are known (e.g., personal email)?

Text classification and user interest modeling
  • How does text fit into the classical classification framework?
  • How are user interests best modeled for text classification?
  • What alternatives are there to vector-space models?

Machine learning in hypertext
  • How can we exploit hypertext structure to improve text classification?
  • How can we exploit user access patterns?

Relevance and feature selection
  • How can we select feature sets for learning in spaces with many irrelevant or redundant features?
  • How is this applicable to text classification?

Publications

D. Freitag, "Toward Unsupervised Whole-Corpus Tagging," Proceedings of Coling 2004.

D. Freitag, "Trained Named Entity Recognition Using Distributional Clusters," Proceedings of EMNLP 2004.

A. McCallum, D. Freitag, and F. Pereira, "Maximum entropy Markov models for information extraction and segmentation," Proceedings of ICML-2000.

D. Freitag and N. Kushmerick, "Boosted wrapper induction," Proceedings of AAAI-2000.

D. Freitag and A. McCallum, "Information extraction with HMM structures learned by stochastic optimization," Proceedings of AAAI-2000.

A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal, "Bridging the lexical chasm: statistical approaches to answer-finding," Proceedings of SIGIR-2000.

D. Freitag and A. McCallum, "Information extraction using HMMs and shrinkage," Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.

D. Freitag, "Machine Learning for Information Extraction in Informal Domains," PhD. dissertation, November, 1998.

D. Freitag, "Multistrategy learning for information extraction," ICML-98.

D. Freitag, "Information extraction from HTML: application of a general machine learning approach," AAAI-98.

D. Freitag, "Using grammatical inference to improve precision in information extraction," ICML-97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition, Nashville, July, 1997.

T. Joachims, D. Freitag, and T. Mitchell, "WebWatcher: A Tour Guide for the World Wide Web," Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97).

J. Boyan, D. Freitag, and T. Joachims, "A Machine Learning Architecture for Optimizing Web Search Engines," AAAI-96 Workshop on Internet-based Information Systems, Portland, August 1996.

D. Freitag, T. Joachims, and T. Mitchell, "WebWatcher: Knowledge Navigation in the World Wide Web," 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, Boston, November 1995.

T. Joachims, T. Mitchell, D. Freitag, and R. Armstrong, "WebWatcher: Machine Learning and Hypertext," Fachgruppentreffen Maschinelles Lernen, Dortmund, Germany, August 1995.

R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, "WebWatcher: A Learning Apprentice for the World Wide Web," 1995 AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, March 1995.

R. Caruana and D. Freitag, "How Useful is Relevance?" 1994 AAAI Fall Symposium on Relevance, New Orleans, 1994.

T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, "Experience with a Learning Personal Assistant," Communications of the ACM, July, 1994.

R. Caruana and D. Freitag, "Greedy Attribute Selection," Proceedings of the 11th International Conference on Machine Learning, 1994.

Thesis

Machine Learning for Information Extraction from Online Documents (proposed December, 1996)

The goal of my thesis research is to explore the applicability of several machine learning paradigms to the problem of information extraction from typical online documents, and to identify opportunities for improved performance by combining diverse learners. Traditional IE approaches fall down in the face of the ungrammatical and ill-formed text that is common in informally constructed electronic documents, such as email messages, news posts, and Web pages. Nevertheless, the format of such "broken" text is rarely arbitrary. I hypothesize that by exploiting alternative (i.e., non-linguistic) sources of information, such as layout, and using learners that do not require linguistic information, such as term vector-space learners, we can achieve acceptable performance in many domains otherwise inaccesible to us. Moreover, because the proposed system is built around machine learning methods, it should be possible to adapt it to new domains rapidly. Please see my paper proposal and preliminary experiments.

Projects

1996-present World Wide Web Knowledge Base Project
with Tom Mitchell (PI), Jaime Carbonell (PI), Mark Craven, Andrew McCallum, and Kamal Nigam
Most research to date on representing information contained in the Web either has required manual annotation of Web pages with symbolic categories, or has contented itself with TFIDF-style models of categories as word-weight vectors. This project effectively aims to bridge the gap between these two approaches. Given an ontology and instantiations of ontologic entities and relations realized as Web pages, the system should learn to perform such instantiations automatically.
1995-present Learning Architecture for Search Engine Retrieval
with Justin Boyan and Thorsten Joachims
LASER is a Web search engine which attempts to improve retrieval performance by noting which links a user selects after entering a query. Instead of viewing a HTML page as a flat collection of terms, LASER pays attention to the context in which a term occurs (e.g., in a title field). It associates a coefficient with each context it recognizes, as well as a number of other factors affecting the retrieval status value of pages given a query. Learning is then an optimization problem in this space of coefficients.
1994-1996 WebWatcher
with Tom Mitchell and Thorsten Joachims
WebWatcher attempts to serve as a tour guide to Web neighborhoods. Users invoke WebWatcher by following a hyperlink to the WebWatcher server, then continue browsing as WebWatcher accompanies them, providing advice along the way. WebWatcher gains expertise by analyzing user actions, statements of interest, and the set of pages visited by users. Our studies suggested that WebWatcher could achieve close to the human level of performance on the rather difficult problem of predicting which link a user will follow given a page and a statement of interest.
Summer 1994 Newton Agent Architecture
with Siegfried Boconek, Siemens
Designed and implemented an architecture for communicating software agents on the Apple Newton, as part of the Software Secretary initiative. In this context, agents were "learning apprentice" applications, personal software, such as a calendar manager, that learned through interaction with the user. At issue was how these agents should communicate and what sorts of knowledge they might usefully exchange.
1992-1994 Calendar APprentice Project
with Tom Mitchell, Rich Caruana, David Zabowski, and others
CAP was conceived as a learning apprentice system, a software application that unobtrusively learns to improve performance through user interaction. I was part of efforts to make CAP more aware of its network environment and to add to its set of prediction tasks. As part of the latter initiative, Rich Caruana and I developed Greedy Attribute Selection, a method for selecting learning features in spaces with many redundant and irrelevant features. GAS included optimizations for decision-tree learners that yielded exponential speedup.
1990 Scheme Toolkit for Modeling Systolic Arrays
supervised by Sanjay Rajopadhye, then at the University of Oregon
I developed Scheme code for the display and manipulation of 3-dimensional graphical models of systolic arrays.
1991 OREGAMI
supervised by Virginia Lo, University of Oregon
The OREGAMI group studied the possibility of automatically modeling known algorithms to message-passing parallel architectures. I developed a compiler for a language called LaRCS, which was designed to describe regularities of algorithms exploitable by parallel architectures.

Honors & Awards

1992-1995 NSF Graduate Student Fellowship
1990-1991 Dean's List, University of Oregon
1986 Elected to Phi Beta Kappa
1981-1986 Commendations for Excellence in Scholarship

Education

1992-1998 Graduate student in CS, Carnegie Mellon. Ph.D., November, 1998.
1990-1991 Undergraduate in CS, University of Oregon
1984-1985 Enrolled in Lewis & Clark College's Year in Munich exchange program
1981-1986 Undergraduate at Reed College. B.A. in English Literature, May, 1986