Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!news.kei.com!travelers.mail.cornell.edu!cornell!chrisb
From: chrisb@cs.cornell.edu (Chris Buckley)
Subject: Re: Statistical vs. NLU approaches to IR
Message-ID: <1994Sep29.051716.27025@cs.cornell.edu>
Organization: Cornell Univ. CS Dept, Ithaca NY 14853
References: <1994Sep25.174751.28787@cs.cornell.edu> <1994Sep25.230427.29778@iitmax.iit.edu> <1994Sep27.020633.16498@cs.cornell.edu> <1994Sep27.214521.30117@iitmax.iit.edu>
Date: Thu, 29 Sep 1994 05:17:16 GMT
Lines: 81

sanders@iitmax.iit.edu (Greg Sanders) writes:

>Well, my own reaction is that statistical analysis of the document simply
>isn't understanding in any real sense.  We appear to disagree here, and
>I think I do understand what you are saying.

I would claim it's a matter of degree, not of kind.

>>Can you give me an example of a system operating on general text that
>>you consider more of an NLU success?

>I think the message-understanding conferences provide various
>examples. 

I was asking about general text.  The MUC/TIPSTER folks are still
operating in very limited domains.  They've made great progress in the
past couple of years in separating out the domain-specific parts of
their approaches, but the last I heard they were still aiming at
an expert man-month to switch domains. 

>Statistical approaches have some impressive successes, and the reasons
>why deserve careful study, but they you can also see what sorts of 
>tasks they will *never* perform.  We probably want a hybrid approach.

I agree.  I think the TIPSTER Phase 2 projects (basically joining IR
and MUC participants in one system) will eventually end in some
nice results.  (But I also think advances in pure statistical methods
will offer much more improvement during that same time period.)

>The question of whether or not statistical analysis of text constitutes
>"understanding" seems to be an interesting philosophical divide.

One thing that has to be remembered is that statistical IR works
because documents, paragraphs, etc, are always being analyzed within
some context.  The power lies not in splitting text up into words, but
in being able to tell what the important words and combinations of
words are, based upon the context.  It's in this ability to determine
importance that "understanding" occurs, at least to some extent.


The question of where meaning occurs within a text can be debated
endlessly.  But it's clear that words plus a context can give you an
awful lot of meaning, even without a parse.  As a non-IR example,
consider speed-readers.  They don't do a parse of text; but an
on-going context plus occasional words are sufficient to get most
of the meaning out of a text.

Statistical IR can give you ways of attacking the question of how
much meaning lies where, and what sorts of words are helpful in
representing meaning.  Traditional NLP has problems doing this.

For instance, in some circumstances, IR can predict how much the
quality of our meaning representation will improve by adding more
terms to the representation:

Suppose we have a query plus some number of documents whose
relevance to the query is known.  Improve query by adding the X
terms which most often occur in the documents judged relevant.
Averaged over many queries, it turns out
        Effectiveness = E' + c log (X)
where E' and c are constants and X ranges between, say, 2 and 300.  
(And by a linear relationship between effectiveness and log (X)
I don't mean you can draw a line among 7 or 8 points and half
of them are above and half below.  I mean you can play connect
the dots between those points and end up with a straight line.)
The fact that such a strong mathematical relationship exists
between effectiveness and number of added terms is very intriguing.
This is in Buckley, Salton, Allan,  SIGIR 94, "The Effect of Adding
Relevance Information in a Relevance Feedback Environment".

There is no question that a lot of information is carried in
representing text by a weighted list of terms.  And the above
shows that we can explore how that information is carried.  If we
can explain the above result, we'll be much far along the way of
understanding how a natural language text conveys information.

                                    ChrisB
-- 
Chris Buckley                   Dept of Computer Science
chrisb@cs.cornell.edu           Upson Hall, Cornell University
(609)  275-4691                 Ithaca, NY   14852
