Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!newsxfer.itd.umich.edu!nntp.cs.ubc.ca!unixg.ubc.ca!vanbc.wimsey.com!news.bc.net!newsserver.sfu.ca!fornax!jamie
From: jamie@cs.sfu.ca (Jamie Andrews)
Subject: Re: best parser???
Message-ID: <1994Nov25.183438.23764@cs.sfu.ca>
Organization: Faculty of Applied Science, Simon Fraser University
References: <MAGERMAN.94Nov15175620@platypus.bbn.com> <MAGERMAN.94Nov22110438@snoopy.bbn.com> <3aujei$95g@cantaloupe.srv.cs.cmu.edu> <QOBI.94Nov23151624@qobi.ai>
Date: Fri, 25 Nov 1994 18:34:38 GMT
Lines: 39

In article <QOBI.94Nov23151624@qobi.ai>,
Jeffrey Mark Siskind <Qobi@CS.Toronto.EDU> wrote:
>Precise quantitative evaluation of parser performance on large corpora is a
>misguided enterprise. The whole notion of `parsing' and `parse tree' is a
>theory internal notion....
>One can only evaluate a system on a task which has a theory independent
>external evaluation criterion.

     This seems to me like a pretty harsh and almost
"behaviourist" approach to evaluation.  No, parse trees do not
exist in the real world, but they capture a lot of the structure
of language.  If a system can parse "rubber baby buggy bumpers"
in the correct way, whatever the underlying theory, then it will
almost certainly be more successful in whatever it has to do
with the phrase than one which doesn't.

     Also, when we evaluate entire systems only along the lines
Jeff suggests, --

>1. what percentage of the time a system gives the desired or useful response
>to a query
>2. what percentage of the time a system finds the desired information
>3. how much editing must one do to the output of a speech-to-text system to
>arrive at the required document entered using speech
>4. how much editing must one to to the output of a machine translation system
>to produce a document that meets certain specified standards

-- we are doomed to always having to build an entire system,
put it into operation on a real task, and do laborious human
measurements in order to evaluate it.  This certainly has its
place if we are making claims for direct applicability of our
systems, but if we are just testing a parser (which is pretty
much guaranteed to be a component of any NL system), some
abstract measurement of parser performance should be very
helpful.

--Jamie.
  jamie@cs.sfu.ca
"Make sure Reality is not twisted after insertion"
