This
project will attempt to extract semantic concept structures from unstructured
documents. The extracted structures for each document will then be merged
to create a graph of associations for a given domain. Using this merged
graph of associations, a thesaurus will be automatically built for query
expansion in information retrieval.
One normal technique in automatically building a thesaurus for query expansion is to slide a window of fixed size over the relevant documents representing the training corpus. The co-occurence of the terms in each sliding window is counted and the terms with the highest level of co-occurence (can be measured by mutual information, entropy or other statistic) are added to the thesaurus.
The
problem with this approach is that documents in some contexts contain a high
level of irrelevant information. For example, in the World Wide Web
context web pages often contain both formatting and content information. For
retrieval purposes, however, the formatting is often irrelevant. This
leads to distortion of co-occurence statistics and degrades the performance
of query expansion.
Similarly, some documents contain many related concepts and are organized, not in a linear fashion, but hierarchically. For these documents, the linear sliding window technique (i.e., sliding a window linearly through a document, one line at a time) could also lead to distortion of the co-occurence statistics. If, on the other hand, the system knew how the document concepts are organized hierarchically, then a window can be slid down this semantic hierarchy, thus improving the counting of co-occurence statistics.
People
The project is currently
a lab for the Advance Information Retrieval course at the Language Technologies Institute in
the School of Computer Science at Carnegie
Mellon University. The people involved in the project include:
| Faculty
Advisor |
Graduate
Students |
| Jamie Callan |
Peter Suen |
References