Eduard Hovy
|
Carnegie Mellon University
tel: +1-412-268-6592 Projects webpage: http://www.edvisees.cs.cmu.edu |
Current Positions
Research Directions and Projects
Work Experience
Honors
Current Positions
Prof. Hovy currently holds the following positions:
Research Directions and Projects
Research can be organized into three principal overlapping directions:
(1) Natural Language Processing / Computational Linguistics / Human Language
Technology
(2) Deep (neural), Distributional, and Lexical Semantics, Ontologies, and Text
Mining/Harvesting
(3) Digital Government and Homeland Security
Work Experience
Honors
Biography
| Education
| Publications
| Research Grants
DARPA's programs AIDA, World Modeling, Big Mechanisms, DEFT, and Machine Reading all
have the goal to develop NLP and knowledge representation and reasoning techniques for
deeper semantic analysis of text and resultant automated learning of domain information.
Prof. Hovy leads or has led the following projects:
OPERA (AIDA Program, 2018--, domain: automated hypothesis formation and reasoning
based on multilingual news and reports);
STORM/SOFIA (World Modelers Program, 2017--, domain: reading to help construct
in-depth causal models of world situations) as part of the STORM project headed by
researchers at the University of Pittsburgh;
RUBICON (Big Mechanism program, 2014--2017, domain: research articles on cancer),
which includes researchers at Carnegie Mellon University (CMU), the University of
Southern California's Information Sciences Institute (USC/ISI), and Elsevier Inc.;
SAFT (Semantic Analysis and Filtering of Text) (DEFT program, 2012--2015, domain:
news articles and reports on violent and legal events), which includes researchers at
CMU and USC/ISI;
from 2008--12, Prof. Hovy's groups participated in two of DARPA's MRP teams: RACR
(headed by IBM, the team that developed the Watson QA game-playing engine) and
ERUDITE (headed by BBN; the OntoNotes corpus was developed as part of
this project);
the SASO (2004--11) and
MRE (2001--04) projects at the Institute of
Creative Technology of the University of Southern California developed virtual
humans in virtual reality simulations, employing text-to-semantics parsers and
opposite-direction generators developed by Prof. Hovy and students.
Associated with the above are several QA systems developed at ISI, such as
Textmap and
Webclopedia
(with Dr. Daniel Marcu, Dr. Ulf Hermjakob, Dr. Chin-Yew Lin, and others). This work
employed information retrieval, clustering, text summarization, parsing, and text
harvesting methods described elsewhere.
Summarization engines developed by Prof. Hovy, Dr. Chin-Yew
Lin, and others at ISI
include
SUMMARIST
(single documents),
NeATS (multiple documents), and
GOSP
(producing headlines). Summarization was used in multilingual text access and
management systems such as
C*ST*RD and MuST.
Summarization evaluation systems include
ROUGE (2003--04) developed
by Dr. Chin-Yew Lin of ISI with Prof. Hovy, and the
BE package (2005--08) developed by Dr.
Stephen Tratz and Prof. Hovy.
For MT evaluation, work in 2002--04 includes a systematization of all major machine
translation evaluation measures (
the FEMTI survey) with
Prof. Maghi King and
Dr. Andre Popescu-Belis
at the University of Geneva, as well as students and researchers at other
universities and commercial MT companies.
Work on machine translation included development of the Pangloss MT system
(1990--94) together with researchers at CMU and New Mexico State University, which
helped establish ISI's
Gazelle
system headed by Dr. Kevin Knight.
The NSF-sponsored IL-Annotation project
IAMTC (2003--04), joint with
researchers at CMU, University of Maryland, MITRE, Columbia University, and New
Mexico State University, focused on Interlingua design and text annotation; see
under lexical semantics below.
In one project, focusing on identifying the personality and interaction patterns
of people active on social media. This work with researchers at ISI and elsewhere in
the Social Media project, developing techniques for classifying participants in
online discussions into roles such as Leader, Follower, Idiot, etc., and quantifying
their degree of persuasiveness and persuadability. Builds upon prior research in the
MKIDS-ISI project (2002--05) that developed methods to analyze emails for
expertise (of people and groups) and relative social status, using topic signature and
speech act recognition.
In another project, developing techniques to identify new events of interest and
track their evolution, building on work done with Dr. Don Metzler and others at ISI
in 2010--11 to recognize important events from analyzing the Twitter stream.
Earlier, the Psyop project (2004--08) employed information extraction and
sentiment analysis technology to extract from online texts entities, events, beliefs,
goals, opinions, and other information of interest, and to compose the results into
psychologically informative descriptions of people.
Work with researchers at USC's Institute for
Creative Technologies to develop a parser and generator generator for the
software agents in virtual reality simulations called SASO (2004--09) and
Mission Rehearsal Exercise
(2001--04) (this work in collaboration with Dr. David
Traum, Dr. Anton Leuski, Dr. David
DeVault, and others).
The project
Quick!Help focuses on the
generation of tailored recipes for poor people (this work in collaboration with
Prof. Peter
Clarke
and Dr. Susan Evans from USC and Andrew Philpot from ISI). This work relates to
language tailoring done earlier in the
HealthDoc
project with Prof. Chrysanne
DiMarco from the University
of Waterloo, Canada and Prof. Graeme Hirst
from the University of Toronto).
Earlier work focused on the development of discourse relations and planners that
employ them to ensure the production of coherent multisentential text. This includes
a taxonomization of all available discourse relations collected from various sources
(1992) and the RST Test Structurer (1987--92).
Prof. Hovy's work in 1987--92 included participation on the Penman sentence
generator with researchers in various countries, to develop the then-largest
sentence generator in the world.
Prof. Hovy's Ph.D. work focused on the development of a text generation program
PAULINE that took into account the pragmatic aspects of communication, since the
absence of sensitivity toward hearer and context has been a serious shortcoming of
generator programs written to date. In general, he is interested in all facets of
communication, especially language, as situated in the wider context of intelligent
behavior. Related areas include Artificial Intelligence (work on planning and learning),
Linguistics (semantics and pragmatics), Psychology, Philosophy (ontologies), and Theory
of Computation.
This work (1989--2002), conducted with Dr. Yigal Arens of ISI and students, focused
specifically on the question of dynamic planning and allocation of
information to media during presentation design.
Several separate projects explore ways to make embeddings explainable/understandable
(for example, by retrofitting them against semantic ontologies or making them sparse)
or to understand the specific information transformation done by neural networks.
In the in-depth reading projects OPERA (2018--), SAFT (2012--17), and
earlier projects (funded under DARPA's AIDA, DEFT, and earlier programs), I am
exploring various ways of creating and working with distributional and deep (neural
embedding) semantic models that are learned from text in various domains and used to
enable automated reasoning and other NLP-based tasks
(relevant publications).
The DARPA-funded OntoNotes project (2008--2012), joint with Dr. Ralph Weischedel
and Dr. Lance Ramshaw of BBN, Prof. Mitch
Marcus of the University of Pennsylvania,
and Prof. Martha Palmer of the
University of Colorado, focused on the creation of a large corpus of texts in English,
Chinese, and Arabic that was annotated with shallow semantic information (word senses
and some coreference). The wordsense information was incorporated into the Omega
ontology (see below).
The NSF-funded IL-Annot project IAMTC
(2003--04), joint with researchers at CMU, University of Maryland, MITRE, Columbia
University, and New Mexico State University, focused on stepwise Interlingua design
and verification by annotation of texts in 7 languages. In both these projects, the
Omega ontology (see below) provided the symbol set for semantic annotation.
The Omega ontology, built at ISI since 2003, contains
over 120,000 concept terms and several million instances, in addition to various other
information, acquired from a variety of sources, including Princeton's
WordNet, NMSU's Mikrokosmos, and
ISI's earlier ontology SENSUS
(1996--2000). During 2008--2011, in the OntoNotes project (see above), a new Upper
Model was built for Omega, and its Entities were thoroughly re-organized. Work on
Omega has been performed by Prof. Hovy in collaboration with Mr. Andrew Philpot, Dr.
Patrick Pantel, Mr. Michael Fleischman, and Dr. Jerry
Hobbs from ISI.
At ISI, Dr. Zornitsa Kozareva and Prof. Hovy
developed the Double-Anchored Pattern (DAP) text harvesting technique and
demonstrated its effectiveness for collecting terms and relations, and for organizing
them hierarchically, over large amount sof domain texts. (This work was partially
done in collaboration with Prof. Ellen Riloff from the University of Utah.)
In several earlier projects since 1996, Prof. Hovy, students, and collaborators
developed a series of text mining and information extraction engines, and built
collections comprising several millions facts (about people, locations, objects, etc.).
This information, stored in a database, was in many cases connected to the Omega ontology
(see above). The Learning by Reading and Möbius (2005--08) experiments
attempted to combine tagging, parsing, semantic analysis, and inference techniques to
create a knowledge base automatically from a high school textbook of Chemistry and from
texts about the heart and engines, and to answer high school-level test questions about
this.
.
Developing methods to identify instances of human trafficking, to help locate victims,
and to collect and synthesize enough information that trends and patterns can be
discovered and used to combat the problem. The system WAT developed in this project
performed information extraction, data synthesis, and pattern analysis to assist
US law enforcement agencies locate and help underage victims of trafficking.
The development of an environment that allows non-experts to learn about cybersecurity,
to identify experts and/or publications relevant to specific topics and questions, and
to check any software they have received.
The development of systems to automatically find alignments or aliases across and
within databases (2003--06). The SiFT system used
mutual information technology to detect patterns in the distribution of data values.
Government partners in this NSF-funded project project were the Environmental Protection
Agency (EPA), who provided databases with air quality measurement data. (This work was
done at ISI with Mr. Andrew Philpot and Dr. Patrick Pantel).
Several projects from 2000--07 addressed the problem faced by government regulation
writers that they regularly face tens to hundreds of thousands of emails and other
comments about proposed regulations, sent to them by the public. Funded by the NSF,
the eRule projects were a
collaboration between Prof. Stuart
Shulman (a political scientist then at the
University of Pittsburgh and the University of Massachusetts Amherst), Prof. Jamie
Callan (a computer scientist at CMU),
Prof Steven Zavestoski
(a sociologist at the University of San Francisco), and Prof. Hovy. Government
partners providing data were the Environmental Protection Agency (EPA) and the
Department of Transportation (DOT). Research at ISI focused on technology to perform
opinion detection and argument structure extraction. This research fed into the analysis
of text for psychological profiling in the Psyop project mentioned above
Funded by the NSF (1999--2003) a series of projects addressed the problem that many
government agencies face: their data is distributed in various formats over different
databases, and evolves to include slightly different variations over the years. Our
EDC and
AskCal systems provided access
to over 50,000 table of information about gasoline, produced by various Federal
Statistics agencies, including the Census Bureau, the Bureau of Labor Statistics, and
the Energy Information Administration. The system included a large ontology and a
natural language question interpreter. This work was done at ISI in collaboration
with Mr. Andrew Philpot and Dr. Jose-Luis Ambite. External partners in this project
were the DGRC team at Columbia University, New York, headed by Dr. Judith Klavans.
Professional Activities
| Invited Presentations
| Teaching and Advising