Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!news.duke.edu!news-feed-1.peachnet.edu!gatech!newsxfer.itd.umich.edu!nntp.cs.ubc.ca!fornax!jamie
From: jamie@cs.sfu.ca (Jamie Andrews)
Subject: Information retrieval, statistics- vs. AI-based
Message-ID: <1994Sep29.165631.10000@cs.sfu.ca>
Organization: Faculty of Applied Science, Simon Fraser University
References: <3608t2$eo9@news.cais.com> <1994Sep25.174751.28787@cs.cornell.edu> <366k2s$500@redwood.cs.scarolina.edu> <36ci4i$if6@narnia.ccs.neu.edu>
Date: Thu, 29 Sep 1994 16:56:31 GMT
Lines: 53

     (I changed the Subject line to reflect what most of the
discussion is about now.  At least I assume that this new
acronym IR means Information Retrieval!)

In article <36ci4i$if6@narnia.ccs.neu.edu>,
Carole Hafner <hafner@ccs.neu.edu> wrote:
>I expect that the near future will see targeted
>NL systems in restricted domains, attached to a statistical IR "front
>end" to select which documents the NL systems should be applied to.
>This type of organization has already been reported in several
>systems described in MUC Proceedings.

     The discussion here suggests another line of inquiry
integrating statistical with AI methods:  using statistical
methods, but on more complex data derived from the text --
possibly derived with the use of traditional AI NLU methods.
One can imagine a spectrum of possible systems, from those that
rely heavily on statistics and not very much on AI to those that
rely heavily on AI and not very much on statistics:

- Systems which do no NLU analysis (e.g. counting "set",
  "sets", and "setting" as different words in lists of word
  frequencies);

- Systems which do some lexical analysis (e.g. counting "set",
  "sets", and "setting" as different forms of the same word,
  but not figuring out whether they are used as nouns, verbs,
  or adjectives);

- Systems which do some (robust) parsing, guessing how likely it
  is that a particular word in a particular sentence is being
  used as a noun, verb, etc., and using this information to
  produce finer statistics;

- Systems which incorporate more details in their lexicons to
  try to guess how likely a particular sense of a word or phrase
  is intended (e.g. the many different meanings of "set",
  possibly with particles as in "set up", "set off");

- Systems which attempt to parse entire sentences wherever
  possible and try to guess how likely it is that a particular
  concept is the topic of a particular sentence; and finally

- Systems which do fully-automated high-quality NLU and list the
  most prominent concepts discussed in the document, derived
  exclusively from heuristics.

     I guess there must have been some work on this kind of
thing -- any references?

--Jamie.
  jamie@cs.sfu.ca
"Make sure Reality is not twisted after insertion"
