Alon Lavie |
Statistical MT
with Syntax and Morphology: Challenges and Some Solutions
The talk will focus on three main lines of work:
(1) Learning of syntax-based synchronous context-free grammars from large
volumes of parsed parallel corpora; (2) Morphological segmentation of Arabic
and its impact on English-to-Arabic phrase-base SMT; and (3) MT between
Hebrew and Arabic - two morphologically-rich related languages with very
limited parallel data resources. Bio: |
|
Robert
Frederking |
Going beyond Identifinder Extraction of common named entities (Named Entity
Recognition, or NER) from normal English documents is largely a solved
problem. As part of an ongoing
research project here, we are attempting to go beyond standard NER in several
dimensions. We have investigated
ensemble approaches to improving NER quality, rapid development of NER in
other languages, and NER co-reference both within and across documents. We have also made an initial investigation
into NER in short informal messages (Twitter "tweets"). I will describe the context of this work,
our approaches and results so far, and then mention future directions that we
are considering. This talk presents
joint work with Anatole Gershman,
Rushin Shah, and Bo Lin. Bio: |
|
Lori
Levin |
Modal
Constructions in Machine Translation: This talk describes an experiment in which
semantic information about modality (might, could, can, etc.) was added to a syntactically
augmented machine translation system.
The work is conducted in the frameworks of Semantically Informed MT
(SIMT) and Linguistic Core MT (LCMT), which enable us to add
interlingua-style semantic information to statistical MT while avoiding some
of the shortcomings of old-style interlingua systems. An example will be given from the SIMT
framework on modality in Urdu-English MT (NIST 2009 task). Functional information about modality was
added to a syntax based MT system by relabeling VP nodes in English parse
trees with more specific categories such as VP-Require, resulting in a small
increase in BLEU score. I will discuss
the implications of this experiment more generally with respect to the role
of semantics in machine translation and the goal of language technologies for
low-resource languages. Bio: Lori Levin has a B.A. in Linguistics from
University of Pennsylvania (1979) and a Ph.D. in Linguistics from MIT
(1986). She has been involved in
machine translation since 1986, starting in the Center for Machine
Translation (precursor of the LTI).
She has worked on MT and other language technologies for major
European and Asian languages as well as native American languages and African
languages. |
|
Joseph
Keshet |
Loss
Minimization for Voice Onset Time (VOT) Measurement, In discriminative learning one is interested in
training a system to optimize a certain desired measure of performance, or
task loss. In binary classification one typically tries to minimize the error
rate. But in prediction for more complex tasks, such as phoneme recognition
or voice onset time (VOT) measurement, each task has its own loss. Phoneme
recognition performance is measured in terms of phoneme error rate (edit
distance) and VOT measurement is quantitatively assessed by the mean
deviation from the manually labeled VOT. In the talk I will present two
algorithms applied to VOT measurement, phoneme alignment, and phoneme
recognition, where the goal is to minimize the specific loss for each task. In the first part of the talk I will present the
problem of automatic VOT measurement and define its loss. I will describe an
algorithm which is based on structural support vector machines (SVMs) to
minimize this loss. Applied to initial voiceless stops from four corpora
(read and conversational speech), the agreement between automatic and manual
measurements were found to be near human inter-judge agreement. The
experimental results also show that this algorithm provides an accurate and
efficient technique for large-scale phonetic analysis. While algorithms based on structural SVMs are
aimed at minimizing the task loss, they actually minimize a surrogate to the
task loss, and there is no guarantee about the actual task loss. In the
second part of the talk, I will describe a new theorem stating that a general
learning update rule directly corresponds to the gradient of the task loss.
Based on this update rule I will present a new algorithm for minimizing the
unique task loss of phoneme alignment. I will present empirical results on
phoneme alignment of a standard test set from the TIMIT corpus, which surpass
all previously reported results on this problem. I will show how this update
rule can be applied to continuous-density HMMs and will present empirical
results on phoneme recognition of TIMIT, showing our approach outperforms
previous results on large-margin training of HMMs. This is joint work with Chih-Chieh
Cheng, Tamir Hazan, David
McAllester, Morgan Sonderegger,
and Mark Stoehr. Bio: Dr. Keshet received his B.Sc.
and M.Sc. degrees in Electrical Engineering in 1994 and 2002, respectively,
from Tel Aviv University. He received his Ph.D. in Computer Science from The
School of Computer Science and Engineering at The Hebrew University of
Jerusalem in 2007. From 1995 to 2002 he was a researcher at IDF, and won the
prestigious Israeli award, "Israel Defense Prize", for outstanding
research and development achievements. From 2007 to 2009 he was a
post-doctoral researcher at IDIAP Research Institute in Switzerland. From
2009 He is a research assistant professor at TTI-Chicago, a philanthropically
endowed academic computer science institute within the campus of university
of Chicago. Dr. Keshet's research interests are in
speech and language processing and machine learning. His current research
focuses on the design, analysis and implementation of machine learning
algorithms for the domain of speech and language processing. |
|
Huan Liu |
When
Connected, Are We Still in Control of Our Destinies? The prevalence of social media offers a new kind
of laboratory for behavioral study. In the era of the social Web, we are presented
with unparalleled opportunities and novel challenges. In this talk, we will
introduce some of our recent studies of collective behavior with social
media. In particular, we will discuss some projects that illustrate our
endeavors to improve the understanding of collective behaviors in social
media. We look into user migration patterns in the presence of seemingly
unlimited choices of social media services; and investigate ways of
exploiting vulnerability to protect user privacy on a social networking site.
We benefit from sociological theories and methodologies in carrying out
interdisciplinary research that sheds light into our study of collective
behavior in social media. The improved understanding of collective behavior
can help develop social media services that encourage more user participation
with better experience in social media activities. Joint work with former and current DMML members Bio: Dr. Huan Liu is a
professor of Computer Science and Engineering at Arizona State University. He
obtained his Ph.D. in Computer Science at University of Southern California
and B.Eng. in Computer Science and Electrical Engineering at Shanghai JiaoTong University. He is recognized for excellence in
teaching and research in Computer Science and Engineering at Arizona State
University. His research interests are data mining, machine learning, social
computing, and artificial intelligence.
His research focus is centered on investigating problems that arise in
many real-world applications with high-dimensional data of disparate forms
such as analyzing social media, group interaction and modeling, data
preprocessing (feature selection), and
text/web mining. His well-cited publications include books, book chapters,
encyclopedia entries as well as conference and journal papers. He serves on
journal editorial boards and numerous conference program committees, and is a
founding organizer of the International Conference Series on Social
Computing, Behavioral-Cultural Modeling, and Prediction (http://sbp.asu.edu/).
He is an ACM Distinguished Scientist. For contact information and links to
recent publications, please visit http://www.public.asu.edu/~huanliu/. |
|
Eduard
Hovy |
Toward a New
Semantics: Despite hundreds of years of study on semantics,
theories and representations of semantic content—the
actual meaning of the symbols used in semantic propositions—remain
impoverished. The traditional
extensional and intensional models of semantics are
difficult to actually flesh out in practice, and no large-scale models of
this kind exist. Recently, researchers
in Natural Language Processing (NLP) have increasingly treated word
distributions (also called ‘context vectors’, ‘topic models’, ‘language
models’, etc.) as a de facto placeholder for semantics at various levels of
granularity. This talk explores a new
kind of semantics that combines traditional symbolic logic-based
proposition-style semantics (of the kind used in older NLP) with
(computation-based) statistical word distribution information (what is being
called Distributional Semantics in modern NLP). The core resource is a single lexico-semantic ‘lexicon’ that can be used for a variety
of tasks. I outline how to define and
build such a lexicon and how to use it for various tasks. Combining the two
views of semantics opens many fascinating questions that beg study, including
the operation of logical operators such as negation and modalities over word(sense) distributions, the nature of ontological
facets required to define concepts, and the action of compositionality over
statistical concepts. Bio: Eduard Hovy directs the
Human Language Technology Group at the Information Sciences Institute of the
University of Southern California, and holds several adjunct professorships
at universities in China, Korea, and Canada.
He is also a research associate professor of USC’s Computer Science
Department, and is co-Director of Research for the Center for Command,
Control, and Interoperability Data Analytics, funded by DHS. Dr. Hovy
completed a Ph.D. in Computer Science (Artificial Intelligence) at Yale
University in 1987. His research
addresses many areas in Natural Language Processing, including machine
reading of text, question answering, information extraction, automated text summarization, the semi-automated construction
of large lexicons and ontologies, and machine
translation. His work combines
statistical machine learning methods with insights from Linguistics,
Sociology, and other disciplines to develop models of language-based
phenomena that go beyond simple statistical word correspondences. Dr. Hovy is the author or co-editor of six books and over 300
technical articles and is a popular invited speaker. In 2001 Dr. Hovy
served as President of the Association for Computational Linguistics (ACL)
and in 2001–03 as President of the International Association of Machine
Translation (IAMT); for 2007–2009 he served as President of the Digital
Government Society of North America (DGSNA). Dr. Hovy
has also worked on various aspects of Digital Government. Dr. Hovy regularly
co-teaches a specialized course in the Computer Science Department of the
University of Southern California, as well as occasional short courses on
machine translation, ontologies, text annotation,
information extraction, and other topics at universities and
conferences. He serves on Advisory
Boards for institutes and funding organizations in Germany, Italy, the
Netherlands, and the USA. URLs: http://www.isi.edu/natural-language/nlp-at-isi.html
http://www.isi.edu/~hovy.html |
|
Martha
Palmer |
Beyond Shallow
Semantics Shallow semantic analyzers, such as semantic role
labelers and sense taggers, are increasing in accuracy and becoming commonplace.
However, they only provide limited and local representations of words and
individual predicate-argument structures. This talk will address some of the
current opportunities and challenges in producing deeper, richer
representations of coherent eventualities. Available resources, such as VerbNet, that can assist in this process will also be
discussed, as well as some of their limitations. Bio: |
|
Sarah
Cohen |
Computation
and Watchdog Journalism: While Twitter and social media have already
fundamentally altered news that happens in the public square, uncovering
information that powerful institutions wish to keep secret is as hard as ever.
But journalism is on the cusp of a new revolution in reporting methods
spurred by advances in unstructured data analysis and cheap access to
computing resources. The changes could be as significant as advances spurred
by the copy machine in the 1970s and the widespread use of relational
databases in the 1990s to find and document stories. This talk will review
some of the methods used and challenges faced among 21st century
investigative reporters. Bio: Sarah Cohen is the Knight Professor of the Practice
of Journalism at Duke University's DeWitt Wallace Center for Media and
Democracy. She joined Duke in 2009 after nearly 20 years as a beat and
investigative reporter, including more than a decade as a member of
investigative units at The Washington Post. Her awards include the Pulitzer
Prize in Investigative Reporting, the Goldsmith Prize in Investigative
Reporting and the Robert F. Kennedy Public Service Award for Journalism.
Cohen serves as an officer of the 4,200-member Investigative Reporters and Editors
Inc., to which she was elected to the board of directors in 2010. At Duke, she is the founder of the new Duke Project for the Advancement of Public
Affairs Reporting (or the Reporter's Lab for short), which curates, develops,
deploys and adapts free and open source software and tools for public affairs
reporting. |
|
Yiming Yang |
Modeling
Novelty in Multi-session Retrieval An open challenge in information retrieval is to
detect the novel information from sequenced ranked lists, and to optimize
system’s utility with respect to both relevance and novelty. Modeling novelty is difficult because novelty
depends on user browsing history, and user behavior over ranked lists is
non-deterministic. We propose a new probabilistic framework for
stochastic modeling of user interactions with multi-session ranked lists, an
algorithmic solution (based on sub-modularity) for efficient approximation of
expected utility (an NP-hard problem), and new search strategies for
retrieval optimization based on nugget detection and nugget-level
relevance/novelty estimation. Our framework provides a strong
foundation for new methodologies both in retrieval evaluation and in
retrieval optimization. It allows significant utility improvements by
leveraging realistic stochastic assumptions about user behavior, without
requiring cost-intensive and time-consuming studies with human subjects. Our
evaluations on benchmark datasets (TDT and TREC) show significant performance
improvements with the proposed approach, over the results of other state-of-the-art
methods. Bio: Yiming Yang is a
professor in the Language Technologies Institute and the Machine Learning
Department in the School of Computer Science at Carnegie Mellon University.
Her research has centered on statistical learning methods and their
applications to a broad range of challenging problems, including large-scale
text categorization, utility (relevance and novelty) based retrieval and
adaptive filtering, personalization and active learning for recommendation
systems, social network analysis for personalized email prioritization, etc. |
|
Michael
"Fuzzy" Mauldin |
Turning your
ideas into money by taking your company public Between 1994 and 1996 I created the Lycos search engine
while a faculty member at the LTI at Carnegie Mellon. Under the auspices of
the CMU Technology Transfer Office, the University and I successfully sold
Lycos to a venture capital firm. Less than nine months later, Lycos was the
fastest company ever to go public in the United States. Using my experiences
with Lycos as an example, I will discuss company ownership from the
entrepreneur's point of view. I will also describe the process of taking a
private company public through the IPO process. Bio: Michael "Fuzzy" Mauldin earned his BA in
CompSci from Rice University in 1981 and his PhD in
CompSci from Carnegie Mellon in 1989. While a
junior faculty member of the Language Technology Institute, Dr. Mauldin
created the Lycos search engine. Today he is retired and raises beef cattle
at his Austin, Texas ranch. He also builds combat robots for the BattleBots and RoboGames
competitions. |
|
Larry
Birnbaum Knight
News Innovation Laboratory |
From Contextual
Search to Automatic Content Generation: Systems that present people with information
inescapably make editorial judgments in determining what information to show
and how to show it. However the
editorial values used to make these determinations are generally invisible to
users and in many cases even to the engineers who design them. This talk describes some of the problems
that this creates, and presents some approaches to providing explicit and
visible editorial control in news and media information systems. I’ll also talk about our recent work on
automatically generating stories from data using human-authored editorial
models. A system based on this
technology is already generating more than 10 thousand stories weekly in
areas ranging from sports to business.
This system is the nation’s most prolific and
published author of, among other things, women’s collegiate softball
stories. The stories compare favorably
to those written by human beings. Bio Larry Birnbaum is
Associate Professor of Electrical Engineering and Computer Science and of
Journalism at Northwestern University, Director of the Medill-McCormick
Center for Innovation in Technology, Media and Journalism at Northwestern,
and serves on the management committee of its Knight News Innovation
Laboratory. He is also the Chief
Scientific Advisor of Narrative Science, a Chicago-based startup. Larry received his B.S., M.S., and Ph.D. in
Computer Science from Yale. |
|
Kevin
Knight ISI/USC |
Language
Translation and Code-Breaking In 1949, Warren Weaver suggested applying
cryptanalysis methods to the problem of automatic language translation. He said: "When I look at an article in
Russian, I say: this is really written in English, but it has been coded in
some strange symbols. I will now proceed to decode". Weaver's inspiration has borne fruit in this
century, as statistical techniques have enabled us to build translation
systems for many languages, with increasing accuracy. But other fruitful connections between
code-breaking and translation are only starting to emerge. This talk will examine some: estimating the
amount of data required to break a cipher, building translation systems
without parallel data, and solving a previously-undeciphered
manuscript from the 1730s. Bio: Kevin Knight is a Senior Research Scientist and
Fellow at the University of Southern California's Information Sciences
Institute, and a Research Professor in the Computer Science Department at
USC. He received a Ph.D. in computer
science from Carnegie Mellon University and a bachelor's degree from Harvard
University. His research interests
include natural language processing, statistical modeling, machine
translation, language generation, and decipherment. He currently serves as president of the
Association for Computational Linguistics. |