Students who want to do an independent study or IR Lab with me can either i) propose their own topic, or ii) choose a topic from the list below.
LTI Site Search: We recently deployed an LTI Site Search capability, using Lemur. It's a good start, but it could be improved in many ways. If you are interested in working on improved ranking algorithms, query processing, improved crawling algorithms, search user interfaces, search log analysis, or other aspects of local Web search, this might be an interesting project for you.
Associating BitTorrent Filenames with Content: Researchers at the Heinz College have 18 months of data from a prominent BitTorrent tracker site, which include the name of the tracker file, and the description of the tracker. The problem with this data is that it is not well tagged for the purpose of determining which piece of content (e.g., song, album, movie, TV program, book, game) the torrent refers to. Mike Smith and I would like to work with a motivated student to use ML techniques to associate the BitTorrent file names with content in databases of available music, movies, and books.
Build your Own Search Service (BOSS): Yahoo! recently deployed a new service called BOSS that allows you to build search interfaces on top of Yahoo's search engine. I would be interested in projects that use this interface to deliver more accurate search, personalized search, or better organization and display of search results.
Web Page Structure Annotation: Modern web pages are complex. They contain advertising, navigation links, links to related content, the main content, and other material, all mixed together. Before the page can be indexed for search or used for text mining, the web page must be annotated with additional markup that identifies each type of material so that each type can be handled appropriately. I am interested in supervised and unsupervised techniques that can provide reliable annotation of large datasets (e.g., at least a hundred million web pages).