AUTHOR: Thomas Hofmann Artificial Intelligence Laboratory Center for Biological and Computational Learning M.I.T. TITLE: Structuring Document Databases by Hierarchical Clustering and Abstraction ABSTRACT: Statistical mixture models are versatile tools for many tasks in machine learning, exploratory data analysis, and pattern recognition. They are suitable to detect structures such as clusters and data hierarchies and have a sound foundation in probability theory. In this talk, I will present a novel hierarchical mixture model for co-occurrence data and demonstrate its benefits in the domain of information retrieval. The described learning architecture generates hierarchical organizations of document databases from word occurrence statistics. The key feature of the model is the combination of the document hierarchy with an abstractive organization of keywords. This supports the identification of discriminant terms and has proven to be very useful for interactive retrieval, as well as for problems like cluster summarization. To train the model I propose an annealed variant of the Expectation-Maximization (EM) algorithm for maximum likelihood estimation.