Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!howland.reston.ans.net!pipex!uunet!bcstec!bronte!snake!rwojcik
From: rwojcik@atc.boeing.com (Richard Wojcik)
Subject: Re: best parser???
Message-ID: <1994Dec12.202609.3641@grace.rt.cs.boeing.com>
Sender: usenet@grace.rt.cs.boeing.com (For news)
Reply-To: rwojcik@atc.boeing.com
Organization: Research & Technology
References: <MAGERMAN.94Dec3152308@snoopy.bbn.com>
Date: Mon, 12 Dec 1994 20:26:09 GMT
Lines: 63

In article 94Dec3152308@snoopy.bbn.com, magerman@bbn.com (David Magerman) writes:
[This is in reference to David's response to Jeffrey Sisskind's disparaging comments
  on so-called "parser evaluation" efforts that take place independently of an
  application domain.  Briefly, my views agree more with Sisskind's.]

>However, there are some problems which are interesting basic research
>problems for which there is no clear-cut or easy-to-build application.
>For instance, people have done work on prepositional phrase
>attachment, word sensing or categorization, and phrase chunking in
>generic domains (newswire, etc.).  I see parser evaluation as a way of
>determining how one can do when you combine all these problems and try
>to solve them concurrently.  For instance, PP-attachment is an
>important problem to solve for some tasks.  But sometimes PP
>information is contained in an adverbial phrase, so it isn't
>sufficient to simply look at where only the PP's are attached.  And,
>if you attach PP's correctly, but don't identify the categories of the
>words in the phrase, then it isn't clear what you've accomplished.

Attachment ambiguity is an excellent problem to discuss.  As you know, there is 
no way to resolve every  attachment locally, i.e. within the context of the 
sentence alone.  You must rely on information about the imagined context
in which the sentence resides.  Having gone through this exercise myself (on
a set of roughly 300 aircraft maintenance sentences), I know that you can get
the bracketting wrong in many cases even if you do know what domain you are
working with.  I learned this by checking some of the brackettings against the
intuitions of a domain expert.   Nevertheless, most of my attachment judgments
were pretty good because I knew some things about the domain I was working
with.  It would have been silly to take my brackettings and use them to judge the
performance of other grammar-checking applications, since our application has
been specifically tuned to prefer resolutions in our domain.   (You might, however,
find out useful things by comparing it to GSI-Erli's new system, which is being
built for Aerospatiale to parse the same domain.)

What would really be interesting, IMHO, is to evaluate the ability of different
parser/grammars to *switch* domains and prefer different brackettings of 
the same sentence in different contexts.  To the best of my knowledge, nobody 
has really tried to build a system that does this.  The best that you can do in
the modern world of NLP is to produce a parser that performs as a kind of
lowest common denominator, producing the same brackettings across all
domains.   That, IMHO, is why you get such comparatively good results from
statistically-based systems, which produce no sense-based analyses at all.  If
such systems are cleverly constructed, they can do a quicker, more efficient job
of figuring out the "lowest common denominator" across several domains.

My point here is not to discourage anyone from evaluating parsers, but to 
discourage people from the pretense that such evaluations are somehow
"application" or "theory" neutral.  There is no sense in which a given bracketting
is "correct" independently of a context.  When you produce a bracketted corpus,
you are producing a set of context-specific analyses.   Any comparison of 
different parsing systems on that type of corpus will only tell you which parser
performs best in that context, not which parser is "the best" in a general sense.
If you have a parser/grammar that performs magnificently in some specific domain
but acts like a brain-dead turtle in some other domain (or set of domains), does
that mean that it is based on bad NLP?  What does it mean?   How would that
change the way we pursue NLP goals in the future?  Are we always to prefer the
lowest common denominator?

---

Disclaimer:  Opinions expressed above are not those of my employer.

    Rick Wojcik   (rick.wojcik@boeing.com)   Seattle, WA

