Language Technologies Thesis Proposal
- Gates Hillman Centers
- Reddy Conference Room 4405
- DI WANG
- Ph.D. Student
- Language Technologies Institute
- Carnegie Mellon University
Example-Driven Question Answering
Open-domain question answering (QA) is an emerging information-seeking paradigm, which automatically generates accurate and concise answers to natural language questions from humans.It becomes one of the most natural and efficient ways to interact with the web and especially desirable in hands-free speech-enabled environments. Building QA systems, however, eitherhasto rely on off-the-shelf natural language processing tools that are not optimized for the QA task or train domain-specific modules (e.g., question type classification) with annotated data. Additionally, optimizing QA systems with hand-crafted procedures or feature engineering is costly, time-consuming and laborious to transfer to new domains and languages.
This dissertation studies the idea of example-driven question answering, which focuses on learning to search, select, and generate answers to unseen questions solely by observing existing noisy question-answer examples along with text corpus or knowledge base. To achieve this goal, we developed novel neural network architectures throughout the QA pipeline, that can be trained directly from question-answer examples. First, we propose candidate retrieval models using distant supervision to produce dense indexing for text corpus (proposed) and generate structured queries for knowledge graphs (in progress). Second, we developed generative answer passage selection models (completed) that do not require annotated negative QA pairs and discriminative answer context ranking models (completed) that can utilize pseudo negative examples. Third, we improved encoder-decoder models for response text generation which can accept external guidance for specific language style and topic (completed). The integrated QA pipeline aims to generate answer-like embedding vectors to search, select the most relevant passages, and compose a natural-sounding response based on the selected passages.
This dissertation demonstrates the feasibility of creating open-domain example-driven QA pipelines based on neural networks without any feature engineering or dedicated manual annotations for each QA module. Experiments show our models achieve state-of-the-art or competitive performance on several real-world ranking and generation tasks on domains of QA and conversation generation. When applying in TREC LiveQAcompetitions, our approach received the highest average scores among automatic systems in main tasks of 2015, 2016 and 2017, and the highest average score in the medical subtask of 2017.
Eric Nyberg (Chair)
Jaime G. Carbonell
Nejojsa Jojic (Microsoft Research)