Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!udel!rochester!galileo.cc.rochester.edu!prodigal.psych.rochester.edu!roberto
From: roberto@prodigal.psych.rochester.edu (Roberto Zamparelli)
Subject: Looking for texts in many different languages
Message-ID: <1995Mar19.202336.20045@galileo.cc.rochester.edu>
Summary: Looking for samples of texts in many different languages
Keywords: corpus-based-text-analysis 
Sender: news@galileo.cc.rochester.edu
Nntp-Posting-Host: prodigal.psych.rochester.edu
Organization: University of Rochester - Rochester, New York
Date: Sun, 19 Mar 95 20:23:36 GMT
Lines: 31

Hello,

I need help to collect a wide sample of small language corpora.

I am a linguistic student, and the instructor for an introductory
programming class. As a final project, I as thinking of assigning an
automatic ``language recognizer''.

This is a simple program that looks at short texts (2/3 pages),
written in as many different languages as possible, extracts
statistical patterns from this texts, and is then (hopefully) capable to
categorize a new text as belonging to one of the languages it has
seen.

To do this, I am looking for pointers to Internet sites where I
could find short samples of as many languages as possible.
Ideally, they should be rendered without the use of control
characters to obtain characters that are not in the English alphabet
(cf. the German use of "ue" to write "u" with umlaut).

If you know of such sites, I would appreciate if you could send me the
address at:

roberto@ling.rochester.edu

Many thanks,

Roberto Zamparelli
Dept. of Linguistics
University of Rochester
roberto@ling.rochester.edu
