Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!news.kei.com!travelers.mail.cornell.edu!cornell!chrisb
From: chrisb@cs.cornell.edu (Chris Buckley)
Subject: Re: Books on Intro. natural language proces
Message-ID: <1994Sep27.020633.16498@cs.cornell.edu>
Organization: Cornell Univ. CS Dept, Ithaca NY 14853
References: <TED.94Sep23121916@kyklopon.crl.nmsu.edu> <3608t2$eo9@news.cais.com> <1994Sep25.174751.28787@cs.cornell.edu> <1994Sep25.230427.29778@iitmax.iit.edu>
Date: Tue, 27 Sep 1994 02:06:33 GMT
Lines: 73

sanders@iitmax.iit.edu (Greg Sanders) writes:

>I believe that IR programs based on traditional techniques have improved
>steadily, they just haven't caught up to the statistical approaches.  Is
>that not correct?  If so, we have gotten improvement.

Hard to say.  They seem to have improved in direct proportion to the
degree they incorporate statistical info.  The most successful IR
system at TREC that calls itself an NLP system is CMU's CLARET system.
It has a pure statistical first pass, and then refines the ranking of
the top docs by both statistical and standard NLP techniques.

>>As of right now, I would claim the most useful representation of the meaning
>>of a NL text is an un-ordered, weighted set of words.

>The most useful for IR, but not for understanding the document (which is
>what the phrase "representation of the meaning" denotes).

It most certainly is "a representation of the meaning".  I entirely
agree it is not a complete representation of the meaning; but I don't
know anybody in NLU who considers a complete representation of the meaning
to be feasible (since that requires a complete representation of the
shared world knowledge of the writer/reader).

>  Let's recast the problem.  Suppose we don't just
>want to retrieve the document.  Suppose the system must be able to 
>answer questions about it and to generate a paragraph saying *why* 
>you will think the document is of interest.  I believe this shows how 
>the arguments you are advancing are misstated.  You seem to mean only 
>that the approaches you are defending result in the right selection of 
>documents.  

>It is just plain wrong to say these approaches constitute any sort of 
>understanding of the documents.  Recasting the task as I have done
>above makes clear how these approaches cannot perform any task that
>requires real understanding of the text.  

It's not at all clear! Even using your somewhat slanted tasks:
   1. The system can certainly correctly answer a lot of questions about
the document.  Eg. about your response here : "Did the message
discuss NLU?".  "Was your message about apples?"
   2. A possible paragraph: "Your article was about NLU and IR because
'NLU' was the most highly weighted term in the article and 'IR' was
the fifth highest weighted term".

Again, the system is not doing your tasks nearly as well as a human could,
but it is doing your tasks.

You can do a lot with purely statistical approaches.  Eg, an amazingly
good job of summarizing a long article can be done by finding the
central passages of the article - those statistically related to the
paragraphs in the rest of the article.  (See our "Automatic Analysis,
Theme Generation, and Summarization of Machine-Readable Texts" by
Salton,Allan,Buckley,Singhal, in Science, June 3rd,1994.)

>Similarly, these techniques clearly deserve to be called an IR success,
>but they in no way constitute a NLU success.  Let's not confuse the two.

Can you give me an example of a system operating on general text that
you consider more of an NLU success?

Summarizing what I've said: Statistical IR approaches will not
satisfactorily solve the NLU problem.  But they have been very
successful at certain NLP tasks while traditional NLP methods have not
proved helpful.  The almost total ignoring of statistical IR by the
NLP field (especially within introductory text books (that is the
subject of this discussion after all!)) seems like a mistake.

                                ChrisB
-- 
Chris Buckley                   Dept of Computer Science
chrisb@cs.cornell.edu           Upson Hall, Cornell University
(609)  275-4691                 Ithaca, NY   14852
