Students who want to do an independent study or IR Lab with me can either i) propose their own topic, or ii) choose a topic from the list below.
LTI Site Search: The LTI has a Site Search capability, using the Indri search engine. It's a good start, but it could be improved in many ways. If you are interested in working on improved ranking algorithms, query processing, improved crawling algorithms, search user interfaces, search log analysis, or other aspects of local Web search, this might be an interesting project for you.
Web Page Structure Annotation: Modern web pages are complex. They contain advertising, navigation links, links to related content, the main content, and other material, all mixed together. Before the page can be indexed for search or used for text mining, the web page must be annotated with additional markup that identifies each type of material so that each type can be handled appropriately. I am interested in supervised and unsupervised techniques that can provide reliable annotation of large datasets (e.g., at least a hundred million web pages).
LTI Publications Database: Create a publications database and search facility for the LTI (or any department). Use WebISO to control access. Document attributes (metadata) are stored in a MySQL database. Documents are stored in one of several standard formats (e.g., txt, docx, pdf). Documents are indexed and searched by an open source search engine (e.g., Indri). Several forms of visualization are available to summarize search results (e.g., timeline, histogram, word cloud).
Cloud-Based Text Analytics: Create a cloud-based text analytics service that integrates a variety of open-source software packages that provide lexical processing, feature selection, text categorization, text clustering, sentiment analysis, and other capabilities. Example open-source tools might include the Stanford Named Entity Recognizer, the Stanford Part-of-Speech Tagger, Weka, and LingPipe.
Movie Sentiment Tracking: Track the mentions of movies in social media. Use sentiment analysis to determine whether people like the movie or not, and what they like or don't like about it. Examine how sentiment analysis correlates with box office revenues.