Chenmin Liang's Homepage

Master Student
Language Technologies Institute
SCS (School of Computer Science)
Carnegie Mellon University

Office: 4626 Newell Simon Hall
Carnegie Mellon University
Pittsburgh, PA 15213, USA

Email: chenminl AT cs.cmu.edu

My advisor is Professor Jaime Carbonell and Ralf Brown.

publications | projects | blog
This site is still under construction. You are welcome to come back later for more information. Last update: Feb.18.2009

Introduction

I am a 2nd year Master student in the Language Technologies Institute (LTI), School of Computer Science (SCS), CMU. My research interests are in Information Retrieval, Machine Learning, Machine Translation, Natural Language Processing, Knowledge Management.

Selected Projects

	Link analysis for spam filtering Many spam webpages form link farm to link each other in order to get a high page rank value. We tried to solve this by assigning spam values to web pages and semi-automatically selecting potential spam web pages. We first manually select a small set of spam pages as seeds. Then, based on the link structure of the web, the initial R-SpamRank values assigned to the seed pagespropagate through links and distribute among the whole web page set. After sorting the pages according to their R-SpamRank values, the pages with high values are selected.

	Collaborative filtering How to predict how much a user would like an item? For example, between rating scores 1-5, which score will the user most probably assign to the item? Collaborative filtering predicts user preferences for items by learning past user-item relationship. A predominant approach to collaborative filtering is neigoborhood based (such as "k-nearest neighbors"), where a user-item preference rating is interpolated from ratings of similar items and/or users.

	Relevance Feedback Have you noticed that when using key words to search, sometimes you don't get the most relevant documents you want? Actually, there might be a gap between the key words and the user's real searching need. By learning more information from the feedback, we can either adjust the weights of the terms in the original query, or add more words to the query.

	Document Ranking Given a query, how to rank documents by similarity to the query? How to build a prototype system to do this from scratch? Besides keyword, what other queries can we input? I built and played with this mini system to learn these.

	Bioinformatics Database System protein-protein interaction databases use various database structures and adopt different naming methods for proteins. Besides, they are not linking to each other. This work defined an efficient database structure for integrating data from differnt databases, to facilitate searching and using protein-protein interaction data.

	Ontology Integration Model for Name Expansion Dictionary based name expansion methods are widely used in biomedical items. However, dictionaries don't consider the relationship between different items, while ontology could represent the repationship of different items. Ontology is very helpful, because usually people don't use the exact name for an item, but a name which represents a broder or narrower scope. For example, ApoE has three alleles - ApoE2, ApoE3 and ApoE4, and ApoE4 is the major genetic risk factor for Alzheimer's Disease. However, in medical literature, people usually use ApoE instead of ApoE4. In Ontology, we could locate the position of ApoE in the ontology tree, and expand it further to get ApoE4.

	Feature Selection & Adaboost Machine Learning for Protein-Protein Interaction Verification Traditional protein-protein interaction detection methods generate experiment results quite slowly. With the high throughput experiment methods developed in recent years, biologists now generate protein-protein interaction data in a much faster speed. However, the accuracy of high throughput data is not high, and the data needs to be verified before further application. Based on data from multiple sources, we selected some features and used AdaBoost algorithm to verify protein-protein interaction data.

	Image Processing Many cool machine learning algorithms could be applied to image processing and pattern recognition. Which algorithm should we adopt when facing a specific problem? How to do dimension reduction and reconstruct image? There are many interesting things to explore in this field.

	Chinese Segmenter This project compares several Chinese Segmenters and picks one to use and refine for our Machine Translation purpose. With a very large dictionary, we found that Joy's lrSegmenter performs very well. You can download Joy's original perl version of the Segmenter from Joy's Segmentation Evaluation Website . I have reimplemented the segmenter in Java, and you could download my code in Java version of lrSegmenter