Information Retrieval Lab
Document Structure Extraction and Thesaurus Building

project  -  resources  -  demonstration    


Project Description

This project will attempt to extract semantic concept structures from unstructured documents.  The extracted structures for each document will then be merged to create a graph of associations for a given domain.  Using this merged graph of associations, a thesaurus will be automatically built for query expansion in information retrieval.

One normal technique in automatically building a thesaurus for query expansion is to slide a window of fixed size over the relevant documents representing the training corpus.  The co-occurence of the terms in each sliding window is counted and the terms with the highest level of co-occurence (can be measured by mutual information, entropy or other statistic) are added to the thesaurus.

The problem with this approach is that documents in some contexts contain a high level of irrelevant information.  For example, in the World Wide Web context web pages often contain both formatting and content information.  For retrieval purposes, however, the formatting is often irrelevant.  This leads to distortion of co-occurence statistics and degrades the performance of query expansion.

Similarly, some documents contain many related concepts and are organized, not in a linear fashion, but hierarchically.  For these documents, the linear sliding window technique (i.e., sliding a window linearly through a document, one line at a time) could also lead to distortion of the co-occurence statistics.  If, on the other hand, the system knew how the document concepts are organized hierarchically, then a window can be slid down this semantic hierarchy, thus improving the counting of co-occurence statistics.


People

The project is currently a lab for the Advance Information Retrieval course at the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University.  The people involved in the project include:  

Faculty Advisor
Graduate Students
Jamie Callan
Peter Suen


References


Updated on September 26, 2003.
http://www.cs.cmu.edu/~petesuen/