With the availability of web scale corpora of semi-structured data in the form of HTML tables and unstructured data in the form of text, there is a need for developing information extraction techniques that will work with such datasets. One such approach is the Never Ending Language Learning (NELL) system which classifies entities and entity pairs into a large ontology of concepts. However, for a large enough corpus, any existing categorization will be incomplete as there will always be unanticipated classes that are not represented in the ontology but are present in the unlabeled Web datasets. This thesis focuses on semi-supervised learning in the presence of such unanticipated classes. We develop unsupervised or weakly-supervised information extraction techniques to extract facts from semi structured data on the Web. We also develop extensions of semi-supervised learning approaches that use seed examples as weak supervision to dynamically induce new clusters of datapoints that do not belong to any of the seeded classes.
Furthermore concepts present in such datasets are related to each other in terms of inclusion, mutual-exclusion and overlapping class constraints. Finally, the entities on the Web are likely to be present in multiple data views like HTML tables, text, hearst patterns, etc. This thesis focuses on the problem of semi-supervised learning in the presence of unanticipated classes and extends it to complex, real world, categorization tasks.
William Cohen (Chair)
Alon Halevy (Google)