Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!cam-news-feed3.bbnplanet.com!cam-news-hub1.bbnplanet.com!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!feed1.news.erols.com!howland.erols.net!rill.news.pipex.net!pipex!uknet!usenet1.news.uk.psi.net!uknet!uknet!newsfeed.ed.ac.uk!edcogsci!cnews
From: Chris Brew <chrisbr@cogsci.ed.ac.uk>
Subject: Re: segmentation of text into sentences
X-Nntp-Posting-Host: brodie110
Message-ID: <f3tk9okzkgb.fsf@cogsci.ed.ac.uk>
Sender: chrisbr@brodie
Organization: Centre for Cognitive Science, University of Edinburgh
X-Newsreader: Gnus v5.3/Emacs 19.34
References: <5c5cqs$150@hera.cs.kun.nl> <32e7ad48.0@news.pins.co.uk>
	<slrn5ehik0.7k.jandac@nephilim.eti.pg.gda.pl>
	<32F5F277.5681@harlequin.co.uk> <f3td8ue9sd1.fsf@cogsci.ed.ac.uk>
Date: Fri, 7 Feb 1997 15:48:04 GMT
Lines: 52


David Palmer has kindly made the Satz software available via his 
old home page at Berkeley. He writes:


The forthcoming Computational Linguistics article (329k):
http://http.cs.berkeley.edu/~dpalmer/cl.ps
 
The source code, including some training/crossvalidation data and
other miscellany (for English, but usable for German too) (1MB):
http://http.cs.berkeley.edu/~dpalmer/satzeng.tar
 
The English dictionary (94k):
http://http.cs.berkeley.edu/~dpalmer/diction.eng
 
The English test corpus:
 
http://http.cs.berkeley.edu/~dpalmer/ltest.t (349k)
http://http.cs.berkeley.edu/~dpalmer/wsj2 (837k) [see comment later CB]
http://http.cs.berkeley.edu/~dpalmer/wsj30 (761k)[see comment later CB]
http://http.cs.berkeley.edu/~dpalmer/wsj41 (843k)[see comment later CB]
 
My German News test corpus (589k):
http://http.cs.berkeley.edu/~dpalmer/germnews.test
 
The German word lists I used, obtained from CLR at NMSU (268k):
http://http.cs.berkeley.edu/~dpalmer/diction.deu



and in response to my question about whether it is safe
to post, further writes:

Well, technically, the test data is from the ACL/DCI collection and
should only be available to researchers that have that CD-ROM.  But
those same WSJ files have appeared in various forms in other
collections, including TREC.  I (that's David) 
can make the files wsj2, wsj30, and
wsj41 available to anyone wishing to obtain and report a direct
comparison of their system against our results (and another sentence
boundary algorithm which will be presented by Jeffrey Reynar and
Adwait Ratnaparkhiat at ANLP in March).


All the links that should work (not the wsj) seem to work OK for me

Chris
-- 
Email: Chris.Brew@edinburgh.ac.uk
Address:  Language Technology Group, HCRC,
          2 Buccleuch Place,  Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4631  Fax: +44 131 650 4587
