Research Projects
Since ancient times, people have organized information into ontologies, also known as concept hierarchies or taxonomies. Often ontologies are detailed, task-independent, user-independent, and long-lifespan data models, such as MeSH and WordNet, that represent and standardize sets of concepts and relations among concepts. However, some situations require light-weight, task-specific, and user-specific ontologies with short lifespan, which we call personal ontologies. For example, in lawsuits and regulatory reforms, lawyers or government employees must quickly organize large amounts of material into task-specific concept hierarchies that will later be discarded. Sophisticated ontologies for these situations may be unnecessary or may even create information overload. My thesis examines personal ontology learning. It focuses on creating light-weight, personal ontologies that allow users to quickly understand the range of the issues raised, and enable "drilling down" into documents that discuss a specific topic. My work has been done in collaboration with my Ph.D. advisor, Professor Jamie Callan.
Particularly, my dissertation addresses how to construct ontologies from text collections automatically or with a-human-in-the-loop. It proposes a human-guided ontology learning framework to address personalization, human-and-computer interaction, and on-the-fly machine learning for quickly constructing personal ontologies (Yang & Callan, IEEE Intelligent Systems'09, Yang & Callan Dg.o'08, Yang & Callan HCIR'08, and Yang & Callan, CIKM'08). In this framework, periodic manual guidance is leveraged to direct machine learning towards personal preferences. Human and machine take turns to organize concepts into hierarchies. At each interaction cycle, a user only makes a few edits to organize concepts. The machine formulates a customized Mahalanobis distance function learned from the relations indicated by the user's edits and quickly predicts how and which concepts this user wants to further group. The user evaluates the changes made by the machine and edits or groups more concepts in the next cycle. The machine learns again, and the loop continues until the user is satisfied with the ontology. The machine's predictions not only save the user's effort but also make sensible suggestions when the user knows little about a domain. This is the first work of on-the-fly machine learning for organizing personalized and task-specific information in an interactive paradigm.
Aside from interactive methods, automatic methods are also valuable for proposing initial ontologies. An automatic metric-based ontology learning method is studied in my thesis work. It transforms ontology construction into a multi-criterion optimization problem (Yang & Callan, ACL'09). Specifically, it incrementally clusters concepts based on minimum evolution of the ontology structure, as well as optimization derived from the modeling of concept abstractness and long distance relations. This is the first work that models abstract concepts, such as "science" and "issues", and concrete concepts, such as "basketball" and "polar bears", differently to produce more sensible ontologies. Moreover, this work represents the semantic distance between concepts as a wide range of features, each of which corresponds to a state-of-the-art ontology learning technique, such as lexical-syntactic pattern, contextual information, and co-occurrence. The use of multiple features allows a further study of the interaction between features and different types of semantic relations as well as the interaction between features and concepts at different abstraction levels. We found that co-occurrence and lexico-syntactic patterns are good features for is-a, sibling, and part-of relations; while contextual and syntactic features are only good for sibling relations. Contextual, co-occurrence, patterns, and syntactic features work well for concrete concepts; while only co-occurrence works well for abstract concepts (Yang & Callan, SIGIR'09).
My work also studies user behavior during ontology construction, exploring whether people create ontologies more quickly or more consistently using the proposed framework, whether there are consistent dataset-specific or user-specific differences in the ontologies that people construct, whether people are self-consistent, and how these factors interact with the construction methods. The user study demonstrates that human-guided ontology learning is promising because it not only reduces time and effort as expected but also provides assistance when a user knows little about a domain. The study discovers interesting findings such as computer science (CS) majors and males tend to construct ontologies using a variety of features while non-CS majors and females tend to construct ontologies using just lexical-syntactic patterns to easily identify relations among concepts. Females working on easier datasets using the interactive method are more self-consistent.
This work has recently submitted to SIGCHI 2011, and was published in IEEE Intelligent Systems 2009, ACL 2009, SIGIR 2009, DG.O 2008, and CIKM 2008 Workshop in Ontology Learning.
Search Engine Training and Evaluation (Summer 2009, Internship at Microsoft)The accuracy of a learned model depends on both the quality of the training labels and the amount of training examples. As expected, the higher the quality of the training labels, and the more the training examples, the better the accuracy of the learned model. I spent the summer of 2009 working at Microsoft Research and Bing where I proposed a new method to improve data quality and search engine accuracy (Yang et al., SIGIR'10). My work explores whether, when, and for which data points one should obtain multiple, expert training labels, as well as what to do with the labels once they have been obtained. Collecting multiple overlapping labels only for a subset of training samples that has already been labeled relevant is far more effective than blindly labeling all training samples. This labeling scheme yields higher quality labels and improves several learning-to-rank models' accuracy by considering more opinions from different judges on samples that need to be noise-free. The proposed labeling scheme is currently employed by the Bing search engine.
This work has been published in SIGIR 2010.
Sentiment Detection and Opinion Detection, Carnegie Mellon University (May 2006- Aug 2006)Due to the richness of natural language representations, sentiment and opinion detection is challenging. Often as a classification task, sentiment and opinion detection classifies the polarity of a given text at the document or sentence level. It first appeared in the TREC'06 Blog Track where the domain of interest was blog posts. At that time, no training data was available for this task, thus my research focused on transfer learning for sentiment and opinion detection: training documents are movie reviews and product reviews and testing documents are blog posts. Common linguistic features and statistical language features in the training data are captured by a non-diagonal prior covariance matrix, and used as shared knowledge to build informative priors for a Gaussian logistic regression model (Yang et al., TREC'06). This work participated in the TREC'06 Blog Track evaluation.
This work participated in TREC 2006 evaluation and was published in TREC 2006.
Near-Duplicate Detection in eRulmaking, Carnegie Mellon University (2004-2007)The U.S. regulatory agencies are required to read and solicit every single piece of public comments to the proposed rules. To save the human effort in the rulemaking process, near-duplicate detection is developed via a semi-supervised clustering approach, which allows flexibly incorporating constraints into the clustering process to achieve a better clustering accuracy.
This work was reported in Digital News Journal (2004 Aug) and was published in SIGIR 2006, DG.O 2006, and DG.O 2005.
Multimedia Information Retrieval, Carnegie Mellon University (2004), National University of Singapore (2003-2004)News Video collection contains thousands of hours of videos, which is a combination of text scripts, audio, image, and video sequences. To find a qualified video sequences matching with the use query, the system applies text analysis, audio analysis, speech recognition and image processing. A comparison of uni-modal, multi-modal and multi-concept classifiers of feature extraction is studied. Both visual-only features and multi-modal video features are also explored in the search process.
This work participated in the TRECVID 2003 and TRECVID 2004 evaluations and won the first (National University of Singapore) and second place (Carnegie Mellon University). It was published in TRECVID 2003 and TRECVID 2004.
Question Answering, National University of Singapore (2001-2004)My research in QA is centered around the tasks and evaluations initiated by TREC. The TREC QA Track attempts to deal with open-domain factoid, list, and definitional questions. The event-based question answering approach that Professor Tat-Seng Chua and I proposed, exploits general ontologies and external resources, such as WordNet glosses and synonyms, and search result snippets, to gather additional world knowledge about a question-answer event, in which the answer lies. The constraints imposed by additional knowledge within a question-answer event enable more effective passage retrieval and answer extraction (Yang & Chua, TREC'02; Yang & Chua, EACL'03 ; Yang et al., TREC'03; Yang et al., SIGIR'03, Yang et al., WWW'03; Yang & Chua, COLING'04; Yang & Chua, SIGIR'04). The system participated in TREC'02, '03, and '04, and consistently won second place in the TREC QA competitions among systems from all over the world.
This work was published in SIGIR 2003, TREC 2002, TREC 2003, COLING 2004, EACL 2003, WWW 2003, and SIGIR 2004.
VideoQA: Question Answering on News Video, National University of Singapore (2003-2004)Question Answering for Video (VideoQA, Yang et al., ACM Multimedia'03) extends my research in QA from text to the context of multimedia, in particular, news video. News video collections usually contain thousands of hours of videos, which are a combination of text scripts, audio, image, and video sequences. VideoQA answers short natural language questions with implicit constraints on content, context, duration, and genre of expected segments of video, and returns short and precise video summaries as the answer. It takes advantage of multi-modal features of visual, audio, textual, and external resources to correct speech recognition errors and locate precise answers.
This work was published in ACM MM 2003.
Online Streaming Video Broadcasting and Recording (2000-2001)The work sets up an online video station by capturing the analogue video signals broadcasted from local television stations. It converts the analogue signals into digital signals, and allows the user to view and select from different stations. A video recording feature also highlights the capability of the system being a ready-to-be-commercialized product. The main research effort is in synchronization of speeches and videos.
This work was an undergraduate final year research project, which won 100 out of 100 in the project evaluation.