Language Technologies Thesis Defense
- Gates&Hillman Centers
- HEDEKI SHIMA
- Ph.D. Student
- Language Technologies Institute
- Carnegie Mellon University
Paraphrase Pattern Acquisition by Diversifiable Bootstrapping
Texts that convey the same or close meaning can be written in many different ways. Because of this, computer programs are not good at recognizing meaning similarity between short texts. Toward solving this problem, researchers have been investigating methods for automatically acquiring paraphrase templates (paraphrase extraction) from a corpus. State-of-the-art approaches in paraphrase extraction have limited ability to detect variation (e.g. "X died of Y'', "X has died of Y'', "X was dying of Y'', "X died from Y'', "X was killed in Y''). Considering practical usage, for instance in Information Extraction, a paraphrase resource should ideally have higher coverage so that it can recognize more ways to convey the same meaning in text (e.g. "X succumbed to Y'', "X fell victim to Y'', "X suffered a fatal Y'', "X was terminally ill with Y'', "X lost his long battle with Y'', "X(writer) wrote his final chapter Y''), without adding noisy patterns or instances that convey a different meaning than the original seed meaning (semantic drift).
The goal of this thesis work is to develop a paraphrase extraction algorithm that can acquire lexically-diverse binary-relation paraphrase templates, given a relatively small number of seed instances for a certain relation and an unstructured monolingual corpus. The proposed algorithm runs in an iterative fashion, where the seed instances are used to extract paraphrase patterns, and then these patterns are used to extract more seed instances to be used in the next iteration, and so on.
The proposed work is unique in a sense that lexical diversity of resulting paraphrase patterns can be controlled with a parameter, and that semantic drift is deferred by identifying erroneous instances using a distributional type model. We also propose anew metric DIMPLE which can measure quality of paraphrases, taking lexical diversity into consideration.
Our hypothesis is that such a model that explicitly controls diversity and includes a distributional type constraint will outperform the state-of-the-art as measured by precision, relative recall, and DIMPLE. We also present experimental results to support this hypothesis.
Teruko Mitamura (Chair)
Patrick Pantel (Microsoft Research)