Research
I work at the intersection of machine learning and human-computer interaction. Particularly interactive learning systems with applications in natural language processing, biological research, and social computing. For more detail, see a complete list of publications by year.
Active and Interactive Machine Learning
Annotating training data for machine learning is often slow, expensive, and difficult. My research involves systems that learn more economically by playing an active role in learning process, e.g., by asking questions about the learning task. I have written a fairly comprehensive literature survey of this field, and studied several important subtopics.
- Can we train semi-supervised systems more effectively allowing the learner to solicit domain knowledge (e.g., labels for feature-based rules)? [EMNLP11, EMNLP09]
- Is it advantageous for learners to query labels at mixed granularities? [NIPS08]
- What if different queries have different costs? [NIPS08ws]
- How do query algorithms for structured learning problems (e.g., information extraction) compare? [EMNLP08]
See also: DUALIST (software)
Natural Language Processing
Language technology is a fascinating application area for machine learning. In particular, I am interested in statistical "machine reading" systems that extract information from large text collections and make use of them in various ways, as well as corpus-based generative models to foster creative thinking in people.
- Can computers learn (forever) to extract information from the Web? [AAAI10]
- Can we leverage existing corpora to build "creativity tools" that assist human writers? [CALC10]
See also: Read the Web project website, @cmunell on Twitter, and The Muse creativity tools
Computational Biology and Bioinformatics
Biology is an increasingly data-driven (vs. hypothesis-driven) science. Today we can harness intelligent computer systems to help us predict, explain, and explore biological phenomena. I also believe we can exploit the biomedical literature in such systems to aid in biological discovery.
- How can machine learning help explain and predict enzyme activity from high-throughput peptide array measurements? [ACS11]
- Can we improve biomedical information retrieval by focusing on localized passages of text and encouraging diversity in results? [TREC07, TREC06, TREC05]
- State-of-the-art biomedical information extraction with conditional random fields [Bioinf05, NLPBA04]
Social Computing
Our modern web-based society creates a lot of data as a byproduct of daily interactions. I study ways of using such arbitrary (often noisy) data to train useful and informative machine learning systems.
Research-Related Software
- DUALIST: Utility for Active Learning with Instances and Semantic Terms, interactive text annotation and learning system.
- AMIL: Active Multiple-Instance Library, a Java library for multiple-instance learning.
- ABNER: A Biomedical Named Entity Recognizer, a state-of-the-art biomedical information extraction tool.
I've also released a few research data sets.