About
Education
Research
People
News/Events
Contacts

LTI Seminar Abstracts Fall 2000

Sep 18, 2000 -- Vanathi Gopalakrishnan, Department of Medicine, University of Pittsburgh
Parallel Experiment Planning, Macromolecular Crystallization and Computational Biology

This talk mainly describes a Parallel Experiment Planning (PEP) framework that has been developed based on an abstraction of how experiments are performed in macromolecular crystallization. The goal in this domain is to find at least one set of experimental conditions that produce a good quality crystal of a macromolecule. The parameter space to be searched is large and crystallographers typically resort to trial-and-error experimentation. The nature of experimentation is tedious since several experiments have to be set up in parallel in order to search efficiently. Moreover, resources such as protein are either scarce or expensive to obtain/purify.

The PEP system that has been developed in this work: (1) provides a computational representation and set of tools to manage information about parallel experiments (or trials), and (2) can provide intelligent assistance for decision-making by suggesting likely places in search space for new trials and portions of space that are unlikely to yield results so that they could be closed. Sufficiency of the PEP framework is demonstrated by developing a prototype implementation of the complex environment that supports parallel experiment planning. This prototype PEP system is evaluated by experiments with human decision-making agents who can manipulate the environment to achieve their goals. The framework and system are evaluated with respect to usability, completeness or sufficiency, utility, time complexity, and flexibility.

This work may also be viewed as including reinforcement learning wherein there exists an agent that tries to learn a policy that can achieve its goals by manipulating a complex environment and receiving feedback from it. Much of prior work in reinforcement learning in artificial intelligence uses only a real-valued feedback or reward in sequential probes of the environment. Within the framework developed here, an agent receives more than one type of feedback, which includes a set of symbolic rules, and receives feedback from multiple probes of (copies of) the same environment. Symbolic feedback amounts to a description of the boundary between observed classes of partial results that can be utilized to prune the vast search space, and thus is more useful than point values.

The talk also describes the motivation for this work and how it fits in within the goals of the emerging discipline of computational biology.


Sep 22, 2000 -- Chris Manning, Stanford University
Probabilistic Head-driven Parsing

Two central intuitions within constraint-based head-driven parsing are that a head is a locus of constraining information, and that around the head is an "island of certainty" -- verb arguments, adjectival modifiers, and the like. While the first intuition is adequately captured by the kind of head percolation commonly used in statistical parsing models, the second is not. This talk discusses "bottom-up" generative statistical models, which work outwards from heads. Among the results are: this conception of parsing can easily make the model immune to some fairly arbitrary aspects of tree construction, while simultaneously capturing some of the length-based preference effects that have been more studied in functional approaches to linguistics, and that the model has a cleaner interpretation in dependency grammar terms.

Christopher Manning is assistant professor of computer science and linguistics at Stanford University. Previously, he has held faculty positions at Carnegie Mellon University and the University of Sydney. His research interests include statistical models of language, syntax, information extraction, and computational lexicography. He is co-author of Foundations of Statistical Natural Language Processing (MIT Press, 1999).


Sep 29, 2000 -- Abdelhadi Soudi, Visiting Researcher, LTI
A Computational Lexeme-based Treatment of Arabic Morphology

This talk is in part a sequel to Aronoff's (1994) treatment of Hebrew verb morphology involving binyanim as inflectional classes and offers further support for the conclusion reached there, namely that this claim can be extended to other semitic languages. We believe that Arabic conjugations are inflectional classes in that they determine the forms of verb stems. We use the lexeme-based morphology theory to represent the linguistic resources, and Morphe as a tool to perform them. We argue that such a morphological theory captures generalizations in the Arabic morphological system. It is claimed that rules of referral are suitable for capturing generalizations pertaining to syncretism exhibited in stems as well as in inflectional affixes. The adequacy of the linguistic analysis is reflected by the results obtained at the implementational/computational level, namely the improvement of the space efficiency of the system and its maintainability.

Abdelhadi Soudi is a "Professeur Assistant" at the "Ecole Nationale de L'Industrie Minerale", Center For Languages and Communication And Computer Science Department. This is his third visit to CMU as a Fulbright visiting researcher. He has worked with an LTI team of researchers and developed an Arabic morphological generator using MORPHE, a tool for modeling morphology based on discrimination trees and regular expressions. He has also been exploring English to Arabic machine translation using the KANT system.


Oct 27, 2000 -- Michael Kohlhase, Visiting Researcher, LTI
Model Generation as a Model for Inference in Natural Language Uderstanding

This talk is about inference processes in/for for natural language understanding. However, instead of looking at a formalization of inference as a deductive or abductive process in first-order logic, I will present a variant of automated theorem proving called model generation and use this inference technique as a basis for semantics-based NL understaning.

I will argue for the necessity and potential of inference-based processes in NL understanding and show applications of the basic inference techniques in first-order, higher-order and dynamic logics. Subsequently, I will refine the model generation framework for the linguistic application by introducing ressource-sensitivity and saliences into the computation. This allows to circumvent some of the shortcomings (e.g. monotonicity, worst-case complexity, or omniscience) of current inference-based approaches.

Dr. Michael Kohlhase is a visiting researcher at the School of Computer Science at Carnegie Mellon University on a long-term Heisenberg Stipend from the German Research Foundation. He studied pure mathematics at the University of Bonn (1989), and wrote his dissertation on higher-order unification and automated theorem proving (1994, Saarland University, Saarbr"ucken). In the postdoc period at Saarland University (1994-2000) he has taken up research in applying techniques from automated deduction in natural language semantics, leading projects on both automated theorem proving and computational linguistics in the Sonderforschungsbereich 378 (special research action "Resource-Adaptive Cognitive Processes"). His current research interests include automated theorem proving and knowledge representation for mathematics, higher-order and dynamic reasoning in natural language processing. He has pursued these interests during extended visits to Carnegie Mellon Universitiy, SRI International and the Universities of Amsterdam and Edinburgh.


Nov 21, 2000 -- John Guidi, Chief Scientist, Lycos
Naive Bayes Classification of Web Pages

This talk discusses application of Bayesian learning methods to classify web pages at Terra Lycos. Results are compared and contrasted with similar efforts involving a canonical set of Usenet news articles. Various feature subset selection methods are explored, including information gain, cross entropy, and odds ratio. The paucity of labeled web pages led to an attempt to augment the existing training data with articles from appropriate Usenet news groups. Typical web pages and news articles are substantially different. Yet, this augmented approach, coupled with a simple, threshold based, feature subset selection method, yielded a classification accuracy of 60% for placing authored web pages into 15 categories.


Dec 1, 2000 -- David Kaufer, Dept of English; and Suguru Ishizaki, School of Design, CMU
Text Visualization

We are part of larger team [Kaufer, S. Ishizaki, K. Ishizaki, Butler] that has created a tool for visualizing texts. The tool categorizes a text against a dictionary of hand-coded phrasal patterns [patterns about 1-5 words in length, though there is no length limit and some patterns are much longer], accumulated over three years of intense reading and journal-keeping across a wide sampling of English prose. While the patterns have been hand-coded, we have been helped by tools that can build variability around a pattern, allowing us to have at this point about 250 million phrases. The phrases have been further classified, by hand, into over one hundred functional classes. One example of a functional class is interactivity. Many English phrases containing the word "you" will be classified as interactive: (e.g., "I want you to understand"). The tool has been used in writing classrooms for two years, allowing students to "see" (across many interface views) the functional distributions of the texts they read or produce. It also allows them to see how distributions vary by the different genres they read or write. Interactivity, for example, is relatively high in instructions but relatively low in narrative history writing. The interface visualizes, by color and placement, the distribution of functional phrase classes both in single texts and in text collections. In the spring, we will be using the tool in a course on the analysis of text collections, team taught between the departments of English and Statistics. We will discuss the language and visualization ideas behind the tool, demo it, and report on applications.

Our purpose in presenting this tool, and the research behind it, to the LTI is to get specific feedback (and invite collaborators!) in the following important areas of LTI [where we know little!]:

1. Automating the knowledge acquisition process of identifying phrases characteristic of particular text functions. An English professor can do this by hand but it is laborious. If there were automatic methods for learning about and extending a functional class of English, like interactivity, that would be good.

2. Validating the acquired knowledge. One of the goals of our work is to build an electronic archive of English based on functional phraseology. Using this reference, writers could search on a particular functional class of English prose (e.g. interactivity) and have returned a list of all the various phrasings (both second person and beyond) that skilled writers, across a corpus, use to produce that effect. The value of this archive depends on being able to validate the phrasal associations. We have been thinking of print-based techniques for doing this (user feedback, standards committees). But are there also automatic means to help in this process?

Bios:

David Kaufer is Professor and Head of the Department of English at CMU. His areas of interest are writing education and functional approaches to textual composition. His most recent book, Principles of Writing as Representational Composition (Erlbaum, 2000; with Brian Butler), provides a framework for the functional categories underlying the visualization tool we will be discussing.

Suguru Ishizaki is an Associate Professor of Communication Design at the School of Design, CMU. His areas of interest are kinetic typography and interactive/intelligent visualization. His professional work includes interaction design and early-stage digital-product development.


Dec 1, 2000 -- Sanda Harabagiu, Southern Methodist University
Boosting Knowledge for Open-Domain Answer Engines

The design of open-domain answer engines is guided by two thrusts. First, natural language processing (NLP) methods are used to derive the questions semantics, in order to identify the candidate answers in the text collections. These methods are integrated with specially crafted information retrieval (IR) techniques that return all text paragraphs of interest. Second, to be able to extract the correct answers, bag-of-words approaches are not always sufficient. They are replaced by surface-based NLP methods that are boosted with pragmatic knowledge that filters out incorrect answers.

The boosting methodology relies on several new sources of pragmatic knowledge. First, we considered that it is likely that an answer engine would be presented with reformulations of previously posed questions. Thus we devised an approach of recognizing question reformulations and caching their corresponding answers. Secondly, we designed a new paragraph retrieval mechanism that enables keyword alternations, such that paraphrases of question concepts and even some related concepts are included in the search for the textual answer. Finally, instead of operating at word level, we have escalated our extraction methods to operate at the level of dependencies between words, thus better approximating the semantics of questions and answers. Without any loss of robustness and without downgrading the elegance of our answer engine, we enable the representation of questions and answers into semantic forms based on information brought forward by fast, wide-coverage probabilistic parsers. Furthermore, by translating the semantic forms into logical forms, we enable a justification option relying on minimal abductive knowledge. The proof mechanism is easily extensible for special domains or situations.

_____________________________________________________________________ Sanda Harabagiu is an Assistant Professor in the Department of Computer Science and Engineering at Southern Methodist University, Dallas TX. She received a PhD in Computer Engineering from the University of Southern California, Los Angeles in 1997 and a Doctorate in Computer Science from the University of Rome "Tor Vergata", Italy, in 1994. Prior to joining SMU, Dr. Harabagiu was a researcher in the Artificial Intelligence Center at SRI International, Menlo Park, California. Dr. Harabagiu is a recipient of the National Science Foundation CAREER award.


Webmaster: ehn@cs.cmu.edu



LTI is part of the School of Computer Science at Carnegie Mellon University.
This page is maintained by teruko+@cs.cmu.edu, and was last updated 28 Nov 2000.