Matthew W. Bilotti, Ph.D.
Now a Search Engineer at Twitter, Inc., as of January, 2010
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213 USA
Email: mbilotti ·at· cs · cmu · edu
[Photo credit: @ded]
About me: I am a sixth year Ph.D. student at the LTI, advised by Eric Nyberg. I am formerly of MIT CSAIL, where I completed my undergraduate degree, and my M.Eng, supervised by Dr. Boris Katz.
Research interests: Information Retrieval over semi-structured text annotated with linguistic and semantic content; machine learning approaches to ranking annotated text with respect to complex information needs; Question Answering.
If Question Answering (QA) systems are ever to reach the level of speed and accuracy required to be competitive with the web search engines that are ubiquitous in the lives of today's internet users, the quality of the underlying text retrieval process must be improved.
Most Information Retrieval (IR) systems, including those that are embedded in QA systems, are optimized to provide a quality ad hoc retrieval experience for a human user, but fail to address the unique needs of QA systems. QA systems often have a much more complete specification of what they are looking for than human users do. This specification consists of linguistic and semantic constraints that the system knows must hold for a piece of text to contain the answer to the question.
My dissertation research focuses on improving the quality of retrieved text within the context of a Question Answering (QA) system by applying learning-to-rank techniques to a feature space derived from the linguistic and semantic constraints of interest to the system.
The approach involves re-ranking the retrieval output from a bag-of-words and Named Entity baseline, which consists of keywords drawn from the question and a Named Entity type placeholder representing the expected answer type. This is considered to be a strong passage retrieval baseline for QA, because when the retrieval unit is small, it approximates density-based methods.
The trained linear model is used to re-rank the retrieved passages based the degree of partial satisfaction of the linguistic and semantic constraints derived from the question. Recent experiments with TREC QA data show that this method realizes significant improvements in Mean Average Precision with respect to a bag-of-words and Named Entity baseline.