Newsgroups: comp.ai.nat-lang,sci.lang
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!nntp.sei.cmu.edu!news.psc.edu!hudson.lm.com!news.math.psu.edu!news.cac.psu.edu!newsserver.jvnc.net!newsserver2.jvnc.net!howland.reston.ans.net!tank.news.pipex.net!pipex!uknet!newsfeed.ed.ac.uk!edcogsci!steve
From: steve@cogsci.ed.ac.uk (Steve Finch)
Subject: Re: How to "parse" e-mail messages for use in statistical NLP?
Message-ID: <DDIL78.2o6@cogsci.ed.ac.uk>
Organization: Centre for Cognitive Science, Edinburgh, UK
References: <40vraj$kus@uhura.kurz-ai.com> <41017c$6u4@age.cs.columbia.edu>
Date: Fri, 18 Aug 1995 16:16:18 GMT
Lines: 42
Xref: glinda.oz.cs.cmu.edu comp.ai.nat-lang:3739 sci.lang:42192

radev@news.cs.columbia.edu (Dragomir R. Radev) writes:

>some suggestions:

>1. strip the headers using "formail" (use archie to find where it is
>archived).

Look for the first blank line for a quick fix.  Works on all my email
files and all newsgroup messages I've ever seen.

>2. remove all lines with no spaces (uuencoded stuff)

[of at least 50 characters]

>3. remove alllines starting with a quotation symbol (">",":",etc.)

Unfortunately there are very many ways to quote text used in usenet.
I guess you have to accept you can get 95% very easily and the
remaining 5% more slowly.  Some editors indent quoted stuff (with or
without an initial tell tale character), so it's slightly more
difficult in this case.

You probably also want to recognise footers and get rid of them.  I do
this by looking for lines starting "--" "**" "==" and cutting off
everything below.  Doesn't work in every case (it also cuts short many
automated postings such as FAQs etc, but you don't usually want to
consider these in any case).  Alternatively look for the last
paragraph in the document and cut it off; You'd usually only lose a
name even if you got it wrong.

Also, note well that in our system at least
/usr/spool/news/*.. contains lots of repeat files from crossposted
articles, so searching through all the files gives many repeats which
are either wasteful or undesirable.  Hash the XRef: field to get rid
of these.

Also I get rid of every line containing `@' (since these are usually
thinks like "x@y (XY) writes:" or other bits of noise.

Cheers,

Steve.
