Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!math.ohio-state.edu!jussieu.fr!univ-lyon1.fr!news.imag.fr!pinea.xerox.fr!news
From: copperma@grenoble.rxrc.xerox.com (Max Copperman)
Subject: Re: Word root extraction in English (q)
Message-ID: <D58GnE.GJB@xerox.fr>
Sender: news@xerox.fr
Nntp-Posting-Host: grand-van.grenoble.rxrc.xerox.com
Reply-To: copperma@grenoble.rxrc.xerox.com
Organization: Rank Xerox Research Centre - Grenoble Laboratory
References: <asorli-0703951729090001@hfmac395.uio.no>
Date: Fri, 10 Mar 1995 16:33:14 GMT
Lines: 39

In article <asorli-0703951729090001@hfmac395.uio.no>, asorli@ilf.uio.no (Are Srli) writes:
> We are graduate students in AI, currently implementing an automatical text
> summarisation system for electronical texts in English.
> 
> We are in need of an algorithm for approximate root extraction of
> arbitrary English words. We do not use full parsing algorithms or a
> lexicon. Therefore, the part of speech of the word will never be known. We
> do not require that the algorithm be complete or accurate for all cases,
> it should suffice to deal with regular syntactic suffixes etc.
> 
> Any pointer or view would be helpful. Thanks in advance.
> 
> asorli@ilf.uio.no
> holger@hedda.uio.no

There are various stemmers for English.  The Porter stemmer is probably the
best known; it chops off suffixes, mostly.  It does not produce words as the
result: "comparative" and "comparable" would stem to "compar".  I believe it
is in the public domain (It has been reproduced in an IR book by Frakes).

The Lovins stemmer was designed for information retrieval, so it stems when
that improves precision and recall (on the corpus for which it was designed).
It is more sophisticated than the Porter stemmer.  It also doesn't produce 
words as output.  It is part of the SMART IR system, which is not
public domain but is freely available for research purposes.

I don't know ftp sites for these---perhaps someone will post them.  Otherwise,
email me and I'll try to locate sites.

XEROX DDS sells morphological analyzers that do stemming, among other
things.  They do produce the root words, and handle virtually all the words
correctly.  They can do derivational and/or inflectional morphology.
They are high-end tools that cost an arm and a leg.

There is another commercial morphological analyzer from InfoSoft that I know
little about other than that it is a competitor of DDS.

Max Copperman

