Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!vixen.cso.uiuc.edu!uchinews!iitmax!sanders
From: sanders@iitmax.iit.edu (Greg Sanders)
Subject: Re: Books on Intro. natural language proces
Message-ID: <1994Sep25.230427.29778@iitmax.iit.edu>
Organization: Illinois Institute of Technology / Academic Computing Center
References: <TED.94Sep23121916@kyklopon.crl.nmsu.edu> <3608t2$eo9@news.cais.com> <1994Sep25.174751.28787@cs.cornell.edu>
Date: Sun, 25 Sep 94 23:04:27 GMT
Lines: 87

In article <1994Sep25.174751.28787@cs.cornell.edu> chrisb@cs.cornell.edu (Chris Buckley) writes:
>crawford@cais.cais.com (Randolph Crawford) writes:
>
>>In article <TED.94Sep23121916@kyklopon.crl.nmsu.edu>, Ted Dunning <ted@crl.nmsu.edu> wrote:
>>>In article <ASHWIN.94Sep23102450@pravda.cc.gatech.edu>, ashwin@cc.gatech.edu (Ashwin Ram) writes:
>>>
>>>   This does not mean that fields such as information retrieval are
>>>   unimportant or that methods such as statistical matching are
>>>   useless.  All this means is that those particular methods, while
>>>   resulting in useful technology, tell us very little about
>>>   *intelligence* in general or *natural language understanding* in
>>>   particular.
>>>
>>>that the best method currently known (after 35 years of work) to
>>>automatically classify documents based on their meaning is *not* based
>>>on syntax, knowledge representation or symbol pushing is a very
>>>significant experimental result.
>>
>>I think this whole debate revolves around the mistaken equating of
>>NLU with NLP.  Document classification is *not* NLU.
>
>I agree with you 100%.  But whose mistake is it?  Ted has made no
>claims that statistical IR is a good approach for solving the NLU
>problem.  It obviously isn't.  He <has> claimed that it's the most
>important NLP success.  I would agree.
>
>It's Ted's responders from the AI community that seem to be claiming
>that it doesn't deserve attention in an NLP book because it can't
>solve the NLU problem.  This is quite perplexing to me.  One would
>expect a field to build off of its successes, or at least try and 
>understand them.

I definitely agree that trying to understand why they succeed is important,
but so is understanding how limited and/or generalizable they are.

>I would think almost any practical advance in NLU would improve IR
>performance.  IR is mostly an attempt to match the meaning of a query
>against the meaning of a document.  Any advances how meanings can be
>represented should translate into improved retrieval.  As Ted said,
>after 35 years we have gotten no improvement in IR due to traditional
>NLP/NLU, and that raises fundamental questions that I would have
>thought would have been important to anybody in NLP or looking at NLP.

I believe that IR programs based on traditional techniques have improved
steadily, they just haven't caught up to the statistical approaches.  Is
that not correct?  If so, we have gotten improvement.

>As of right now, I would claim the most useful representation of the meaning
>of a NL text is an un-ordered, weighted set of words.

The most useful for IR, but not for understanding the document (which is
what the phrase "representation of the meaning" denotes).

>                                                        It's very 
>difficult to improve on it.  The SMART IR system has been around for
>over 30 years, and during most of that time people have been trying to
>augment this list of weighted terms with syntactic and semantic
>information.  None of the augmentations have helped significantly; and
>any place they have helped (eg using POS-tagged noun phrases), a
>pure statistical approach (eg consider any two adjacent non-stopwords
>to be a "phrase") works better.
>
>I don't expect this situation to last.  In 5 years I expect parsing
>and NLU techniques will improve IR performance noticably.  But I said
>that 5 years ago also ... and people were saying that in the '60s.
>Perhaps it's time to consider a paradigm shift?

Oh, I think not.  Let's recast the problem.  Suppose we don't just
want to retrieve the document.  Suppose the system must be able to 
answer questions about it and to generate a paragraph saying *why* 
you will think the document is of interest.  I believe this shows how 
the arguments you are advancing are misstated.  You seem to mean only 
that the approaches you are defending result in the right selection of 
documents.  

It is just plain wrong to say these approaches constitute any sort of 
understanding of the documents.  Recasting the task as I have done
above makes clear how these approaches cannot perform any task that
requires real understanding of the text.  Thus, a paradigm shift to
them would, IMHO, be a mistake. 

Similarly, these techniques clearly deserve to be called an IR success,
but they in no way constitute a NLU success.  Let's not confuse the two.

-- Greg Sanders  (gsanders@nimue.hood.edu)
   Assistant Professor of Computer Science, Hood College

