Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!udel!gatech!swrinde!cs.utexas.edu!utnut!nott!cunews!freenet.carleton.ca!FreeNet.Carleton.CA!ac355
From: ac355@FreeNet.Carleton.CA (David Solly)
Subject: Need multilingual search technique
Message-ID: <D09Awo.9rM@freenet.carleton.ca>
Sender: ac355@freenet2.carleton.ca (David Solly)
Reply-To: ac355@FreeNet.Carleton.CA (David Solly)
Organization: The National Capital FreeNet
Date: Sat, 3 Dec 1994 22:38:48 GMT
Lines: 41



	I realize that the question below deals mostly with databases and
database programming, however, I am hoping that the linguists among you
might have some suggestions on how to tackle this problem.


     I am trying create a database of individual selections contained
within a large corpus of early music sound recordings.  What I am finding
is that the spelling of a song title changes from epoch to epoch and from
area to area so that the title of a selection may come up as:  "My Lady
Greay's Dump" or "Mi Laidie Grayes Domppe" or "My Lady Greys Doumpe" etc. 
(A "dump" was a kind of Renaissance dance).  A search for one version of
the title does not come up with the other versions of the title.  What I
would like to build into the database is a feature which would
automatically search for "variations on a theme" especially if there are
no immediate hits upon what my intended users enter as a title. 
 
     I realize that perhaps I could avoid this problem by standardizing
the spelling of each title as I enter it;  however, one of the reasons I
do not want to go the "standardize spelling" route is because many of the
titles are in languages other than English, e.g. Mediaeval and Renaissance
Italian, French, Spanish and German, in which I do not have enough
background knowledge to attempt normalization of the spelling. 
 
     I have already tried the Soundex feature in dBase IV.  Perhaps I am
not using this feature correctly but so far I have found that it only
works on very minor deviations from modern English and crashes on any text
that contains hi-ascii codes.  Furthermore, some of the earlier and
dialect versions of English deviate so far from Modern English that they
are languages unto themselves.  I would welcome any suggestions on how to
expand Soundex so it can handle this kind situation or perhaps suggestions
for a different technique for searching this kind of database.  I thank
you all in advance. 
 
David Solly
 
--
David Solly                  ac355@freenet.carleton.ca
Ottawa, Ontario, CANADA      FidoNet:  1:163/215
Voice: (613)731-2120
