Students will choose from projects designed by the instructors as listed below. Students will also have the option of designing their own projects, subject to instructor approval. Generate a web page of your online report (see examples) for your interaction with the instructors.

instructor: Yang
Write your own web crawler to collect bilingual documents (see Mining the Web for Bilingual Text by Resnik, ACL'99) in a specific domain (scientific literature or eCommerce products, for example). Use the collected parallel corpus for training a corpus-based CLIR method -- Pseudo-Relevance Feedback (PRF). Tune this system using a pair of languages that you know both (e.g., Chinese/English), and evaluate the system with a different pair of languages. Since we only have one parallel corpus with human relevance judgments on queries in the area of scientific literature, i.e., the NACSIS corpus (Japanese/English abstracts of journal articles), this is the only language pair you can use for evaluation. Use standard retrieval measure (such as 11-pt average precision) and compare the CLIR performance with monolingual retrieval performance.
Provided:

instructor: Yang
Write an acquisition system to collect bilingual text from the Web in scientific literature or eCommerce products, and use the resulting parallel collections for training in cross-language text categorization (CLTC) using a k-nearest neighbor approach. We assume that we only have labelled training data in one language (L1) but not in another (L2). Given a new document (the "query") in L2, we find its kNNs in L2 and the corresponding documents in L1 via the parallel corpus; we then use the category labels of the kNN in L1 to assign categories to the query in L2. You should choose the pair of languages that you know both of them (e.g., Chinese/English). For evaluation, you need to manually confirm a test set of document pairs where the two documents in each pair belong to the same category in a given taxonomy (e.g., by "Yahoo!"). Evaluate and compare the monolingual categorization performance and the crosslingual categorization performance of the kNN classifier on the documents in the test set, using standard text categorization measures.
Provided:

instructor: Yang
Apply a set of existing CLIR systems (prototype) to the NACSIS Japanese/English collection, debug the systems and data, evaluate these systems and compare their effectiveness and scalability.
Provided:

instructor: Yang
The CLIR task is to use German queries to retrieve English documents, or vice versa, in the MUCHMORE collection (documents categorized using the Medical Subject Headings). The idea is to reduce the CLIR problem to a text categorization (TC) problem in each of the two languages respectively, and then establish a mapping between the categorized queries and documents. You can use an existing (state-of-the-art) classifier (SVM, kNN, LLSF, etc.) and apply it to the MUCHMORE data. You need to implement the idea and examine the effectiveness and scalability.
Provided:

instructor: Yang
Apply the idea of boosting to multiple classification algorithms including Decision Tree (DTree), Naive Bayes (NB), k-Nearest Neighbor (kNN) and Support Vector Machines (SVM). Evaluate the effect of boosting on individual classifiers using different data collections, including Reuters-21578, 20-NG (by Mitchell's group) and TDT1 corpora (or your favorite collection). Analyze the effects of boosting with respect to the statistical properties of training data (e.g., the training-set frequencies of categories) or inductive bias of classifiers if any inference can be made based on empirical evidence. For most of the above classifiers you can use publicly available codes except one: you will need to write your own DTree code which should use efficient for sparse vector representations of documents (pseudo code and complexity analysis).
Provided:

instructor: Yang
Implement a feature selection algorithm based on different criteria including information gain, chi-squared statistic and document frequency (Yang & Pedersen, ICML'97). Test the effect of feature selection on multiple classifiers including Decision Tree (DTree), Naive Bayes (NB), k-Nearest Neighbor (kNN) and Support Vector Machines (SVM) on different data collections including Reuters-21578, 20-NG (by Mitchell's group) and TDT1 corpora (or your favorite collection). Analyze observations and conclude your findings with respect to effectiveness and efficiency improvements (if any).
Provided:

instructor: Yang
Implement a web interface on top of our clustering system (Group Average Clustering or GAC) to support Scatter/Gather type of navigation. Design the user interaction and define the evaluation for user study. Try it with a few users on the TDT corpora, or subsets from the Web as a further challenge. Analyze the properties of these algorithms (clustering bias, scalability, etc.), and compare the quality and usefulness of the resulting hierarchies and clusters. Conclude your findings.
Provided (Tom Pierce needs to complete this) :

instructor: Yang
You are given a text summarizer (sentence-based or keyword-based) which works for English documents. Adapt this summarizer to one or more language (Spanish, Chinese, Japanese, etc.) that you know. You need to incorporate language-specific properties (e.g., syntactic clues for sentence segmentation or name/phrase identification, statistics like TF-IDF and so forth) in your system design and optimization.
For examining the effectiveness, you need to explore possible measures and suggest a evaluation method. We suggest you to use a parallel text which have English as one half, and compare the summaries in English with those in the other language(s). We have the access to several parallel corpora, including the TDT collections (containing English documents and their Chinese translations by SYSTRAN), the UNICEF subset of the United Nation Multilingual Corpus (English and Spanish parallel) and the NACSIS corpus (English and Japanese).
Provided:

instructor: Yang
Design and implement your own text summarizer built on top of a given clustering system (GACINCR) that generates clusters on-the-fly given a query. The summarization should extract sentences from the documents in the cluster based on the following criteria:
Provided:

instructor: Yang
Problem definition: Detect news events described by documents in two languages -- English and Chinese. This task can be reduced to two subtasks: document clustering and cross-language information retrieval (CLIR). For example, one can first cluster the documents in English and Chinese respectively, then establish a mapping between the English clusters and the Chinese clusters using a CLIR system (corpus-based, dictionary-based or MT-based). We provide a clustering system that works on both English and Chinese documents on the TDT corpora (TDT3). We will also provide corpus-based CLIR systems that were tested in English-Spanish and English-Japanese retrieval. To make these CLIR method(s) working with English-Chinese, you need to write (or adapt existing) HTML parser and document alignment algorithm for the extraction of parallel document pairs from the TDT3 multilingual corpus (not aligned), and use the resulting corpus as the training set in CLIR. You also need to evaluate the effectiveness of your approach.
Provided:

instructor: Yang
Apply your favorite classification method (SVM, kNN, Naive Bayes, etc.) to the categorization of "Yahoo!" web pages (normal web pages or those for eCommerce products in the "Yahoo! Shopping" section). Investigate whether a hierarchical organization of classifiers can improve the effectiveness and/or scalability, compared to using non-hierarchical classification. You can use any publicly available classifiers (SVM, kNN, Naive Bayes, for example) as the core engine. However, you need to write procedures to collect training data from "Yahoo!", to parse web pages in a required format, and you need a scheme to combine the predictions by multiple classifiers in different levels of the hierarchy. You also need to concern how frequently your system need to be re-trained with up-to-date web pages, and make sure your choice of the method is scalable for that reason. You should complete your evaluation on a relatively large test set of documents, for example, 10,000 pages with 500 categories. You should also randomize your selection of test documents and categories, so that the statistical distribution is representative of the real-world applications, not a substantially simplied subset (for example, consisting of the 100 most common categories only). The scale-up challenge is relatively new in the TC literature, requiring substantial work for algorithm implementation and integration. It is OK to approach this by a team effort, and each one work on some components, such as:

Yiming Yang ( yiming@cs.cmu.edu)