Language Technologies Ph.D. Thesis Defense
- Gates Hillman Centers
- Traffic21 Classroom 6501
- PEDRO JOSÉ dos REIS MOTA
- Ph.D. Student
- Language Technologies Institute
- Carnegie Mellon University
BeamSeg: A Joint Model for Multi-Document Text Segmentation and Topic Identification
The work in this thesis is motivated by the problem of navigating the content of a collection of related documents, which is cumbersome if only a list of documents is given. Automatically structuring the content organization of a dataset by identifying topically cohesive segments and link segments describing the same topic addresses this issue. Previous work deals with this problem by using a multi-document joint model for segmentation and topic identification at the dataset level, a perspective we also take. This multi-document approach to segmentation that contrasts with approaches that segment documents individually. The advantage of a multi-document model is that segmentation is leveraged by repeated descriptions of the same topic across different documents. We continue this work by hypothesizing that vocabulary relationships between different segments can be used to obtain a more accurate segmentation and topic segment identification.
We also hypothesize that documents sharing the same modality (video, PowerPoint, etc.) have similar characteristics that could be modeled to obtain a better performance in these tasks. To study the previous hypothesis, we propose BeamSeg, a joint model for multi-document segmentation and topic identification where it is assumed that segments have vocabulary usage relationships. BeamSeg implements segmentation and topic identification in an unsupervised Bayesian setting by drawing from the same multinomial language model segments with the same topic. We assume that language models are not independent by putting a dynamic Dirichlet prior over the language models that takes into account data contributions from other topics.
In order to test our hypothesis, we carry out a data collection task, as datasets from previous works have few documents with short segments, leaving little room to observe vocabulary relationships. The evaluation using the collected dataset shows that BeamSeg obtains the best results affording this way practical improvements in both segmentation and topic identification.
Maxine Eskenazi (Chair)
Maria Luísa Coheur (Co-Chair, Instituto Superior Técnico)
Anselmo Peñas (Instituto Superior Técnico/INESC)
Chris Dyer (DeepMind//previously LTI)
Bruno Emanuel da Graça Martins (Instituto Superior Técnico)