From tchrist@wraeththu.cs.colorado.edu Tue Dec 21 18:04:09 EST 1993
Article: 25630 of sci.lang
Xref: glinda.oz.cs.cmu.edu sci.lang:25630 comp.lang.perl:23420 comp.programming:7761
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: honeydew.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!library.ucla.edu!agate!boulder!wraeththu.cs.colorado.edu!tchrist
From: tchrist@wraeththu.cs.colorado.edu (Tom Christiansen)
Subject: soundex questions
Message-ID: <CI6wA1.1nC@Colorado.EDU>
Sender: news@Colorado.EDU (USENET News System)
Organization: University of Colorado, Boulder
Date: Fri, 17 Dec 1993 17:22:49 GMT
Lines: 166

I have been thinking of soundex matching, except I know nearly nothing
about it.  I do have a function someone posted once that attempts to deal
it.  I'm not sure whether the algorithm here is "right" or not.  The
algorithm is:

# return the Soundex value of a string using the following rules:
#
#   1) remove W and H
#   2) remove all vowels except in the first position (A E I O U Y)
#   3) recode characters per table:
#           A E I O U Y             0
#           B F P V                 1
#           C G J K Q S X Z         2
#           D T                     3
#           L                       4
#           M N                     5
#           R                       6
#
#   4) if two adjacent digits are now identical, remove one
#   5) truncate to six digits or pad out the result with zeroes to
#   make six digits  
#   6) replace the first digit with the first character from the
#   original word 

Here's what it shows for some sample misspellings

    S52350  sunstem
    S52350  sunstom
    Z52350  zonstem
    Z52350  zonstum
    S52365  sonsterm
    S52365  sonstrom
    S52365  sonstromb
    S52365  sonstromm
    S52365  sunstorm
    S52365  sunstromb
    S52365  sunstrum
    Z52365  zonstorm
    S53236  sondstrom
    S53236  sondstrum
    S53236  soundstorm
    S53236  soundstromboner
    S53236  sundstrom

Have you ever played with soundex?   What might one do with these?
Well, you should be able to look up hits that are close to you
numerically and suggest them as possible alternatives.  It would
take a different database format of course, but that's ok.

The problem is that it's not too smart.  
Some questions/issues:

1.  What does it only produce with a six-characters return key?  

2.  Why doesn't it collapse the initial character as well (S and Z, P
    and B, etc).   

3.  Some of consonant clusters could stand being munged up a bit, like
    -mb -nd, etc.

5.  Maybe vowels should have their own series?  

    1.  Y -> I
	W -> U
    2.  Collapse duplicates
    3.  Score remaining vowel clusters into two or three sets,
	based on open/closed:
	    O U OU EU UE AU EAU
	    I E EI IE AE EA AI 
	The problem with A is "father", "cat", "cake".  I'd say 
	more often it's with the latter set than the former.

    I don't think "coil" and "cowl" should be so close, either.

6.  The liquids (L's and R's) in <VOWEL><L or R><CONSONANT> seem too 
    significant, R's perhaps more than L's.  "order" and "odor" are
    closer than it things.

7.  What about all the digraphs?  Do you dare think about them
    or not?  TH SH CH GN KN PH GH all come to mind.  The problem
    is that they all can give false readings in medial positions, 
    as in "cathair" versus "catheter".  Perhaps only in initial
    and final positions?  Some should know about leading 
    silent letter and throw them out (GN KN), others maps
    into single letter (PH => F), whereas others just go 
    with whatever series they would normally go in, e.g.
    TH would be in the "D T" series, SH would be in with the 
    S's, etc.  Hm... I guess that's why they throw the H's
    out?   But I don't like this:

	C30000  cot
	C23000  caught

    That might not be able to be done right, since then you 
    have to discern "draught" is closer to "raft" than it 
    is to "route", which is itself closer to "drought".

    Ug.

Code follows for people wanting to sample it.

#!/usr/bin/perl

while (<>) {
    chop;
    print &soundex($_), "\t", $_, "\n"; 
} 

# soundex.pl
# by George Armhold <armhold@dimacs.rutgers.edu> 3/22/92
# improvements by Marc Arnold <marc@mit.edu>

# return the Soundex value of a string using the following rules:
#
#   1) remove W and H
#   2) remove all vowels except in the first position (A E I O U Y)
#   3) recode characters per table:
#           A E I O U Y             0
#           B F P V                 1
#           C G J K Q S X Z         2
#           D T                     3
#           L                       4
#           M N                     5
#           R                       6
#
#   4) if two adjacent digits are now identical, remove one
#   5) truncate to six digits or pad out the result with zeroes to
#   make six digits  
#   6) replace the first digit with the first character from the
#   original word 

sub soundex {
# takes a string as an argument, and returns its soundex value

    local($pattern) = @_;

    # upper-case the pattern to normalize matches
    $pattern =~ tr/a-z/A-Z/;

    # remove all but alphanumerics, and H,W
    $pattern =~ tr/A-GI-VX-Z0-9//cd;

    # remove all vowels after 1st letter
    ## substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d;

    # save first char
    local($first) = substr($pattern, 0, 1);
   
    # replaces letters with numbers and squish identical numbers
    $pattern =~ tr/BFPVCGJKQSXZDTLMNR0-9/1111222222223345560-9/ds;

    # remove all vowels after 1st letter
    substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d;

    # replace first letter
    substr($pattern, 0, 1) = $first;

    # pad on zeroes if necessary and truncate
    substr($pattern."000000", 0, 6); 
}

1;				# because this is a require'd file
-- 
    Tom Christiansen      tchrist@cs.colorado.edu       
      "Will Hack Perl for Fine Food and Fun"
	Boulder Colorado  303-444-3212


Article 25636 of sci.lang:
Xref: glinda.oz.cs.cmu.edu sci.lang:25636 comp.lang.perl:23433 comp.programming:7767
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: honeydew.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!cs.utexas.edu!uunet!spsgate!mogate!newsgate!jims-new!markp
From: markp@jims-new.vlsi-az.sps.mot.com (Mark Pease)
Subject: Re: soundex questions
Message-ID: <1993Dec17.214518.24312@newsgate.sps.mot.com>
Sender: news@newsgate.sps.mot.com
Nntp-Posting-Host: 219.1.74.13
Organization: Motorola/Codex VLSI Design Center, Tempe AZ
References: <CI6wA1.1nC@colorado.edu>
Date: Fri, 17 Dec 1993 21:45:18 GMT
Lines: 36

In article <CI6wA1.1nC@colorado.edu>,
Tom Christiansen <tchrist@wraeththu.cs.colorado.edu> wrote:
>I have been thinking of soundex matching, except I know nearly nothing
>about it. 
....

Soundex (as related from Kunth's The Art of Computer Programming, Vol 3 - Sorting
and Searching c. 1973 pp. 391) was developed to work on peoples names, such as
airline reservations systems, "to transform the arguments [ie: the surname of
someone] into some code that tends to bring together all variants of the same
name."

....
>
>The problem is that it's not too smart.  
>Some questions/issues:
>
>1.  What does it only produce with a six-characters return key? 
....

The method from Kunth does not limit you to 6 characters. In fact, in his
example, he only uses 4! I think it is more a data storage/search speed issue.

The rest of your points are very interesting.

>-- 
>    Tom Christiansen      tchrist@cs.colorado.edu       
>      "Will Hack Perl for Fine Food and Fun"
>	Boulder Colorado  303-444-3212


-- 
Mark Pease                             markp@vlsi-az.sps.mot.com
Motorola CODEX VLSI Design Center
2710 S Roosevelt St.                   Mail Stop: AZ28 BB106
Tempe, AZ 85282         Phone:(602)784-2725    FAX:(602)784-2759


Article 25638 of sci.lang:
Xref: glinda.oz.cs.cmu.edu sci.lang:25638 comp.lang.perl:23443 comp.programming:7768
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: honeydew.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!usc!elroy.jpl.nasa.gov!decwrl!netcomsv!netcom.com!jfh
From: jfh@netcom.com (Jack Hamilton)
Subject: Re: soundex questions
Message-ID: <jfhCI7wLu.MI1@netcom.com>
Organization: Netcom - Online Communication Services (408 241-9760 guest)
References: <CI6wA1.1nC@Colorado.EDU>
Date: Sat, 18 Dec 1993 06:27:29 GMT
Lines: 96

tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) wrote:

Well, here we were talking about you on the train just the other day, and
Bang, you post about a subject I'm interested in.  (I decided you ought to
look like Larry Wall and Larry Wall ought to look like you, by the way.) 

>I have been thinking of soundex matching, except I know nearly nothing
>about it.  I do have a function someone posted once that attempts to deal
>it.  I'm not sure whether the algorithm here is "right" or not.

I don't think there is a "right" algorithm, although the one in Knuth is
probably the "standard" algorithm. 

Soundex attempts to map the sound of a name to the spelling of a name, and
how words are pronounced depends on a lot of different things.  It
certainly depends on the language (the standard algorithm wouldn't work very
well for French, for example) and on the regional and personal speech
patterns of the speaker.  Proper names tend to preserve complicated
spellings with simplified pronunciations - think of Chomondeley-Magdalen
(which I've probably misspelled) or Leichester Square.   

>1.  What does it only produce with a six-characters return key?  

That varies according to the implementation.  I'm more accustomed to 5
letter keys.  The theory is probably that if you can't remember how a name
is spelled you also won't be completely sure of how it's pronounced (if
it's long), so you'll increase the number of reasonable hits by keeping the
key short.  

>2.  Why doesn't it collapse the initial character as well (S and Z, P
>    and B, etc).   

Because people usually remember the first letter of a name or can make a
good guess at it, and having that first letter decreases the number of 
probable wrong answers. 

If I were writing a version of the algorithm, I'd convert initial Kn to
just N.  There are probably some others I'd change, but that's the
biggest one. 

>3.  Some of consonant clusters could stand being munged up a bit, like
>    -mb -nd, etc.

Yup.  I'd also convert a final -tion to sn. 

>5.  Maybe vowels should have their own series?  
>
>    1.  Y -> I
>	W -> U
>    2.  Collapse duplicates
>    3.  Score remaining vowel clusters into two or three sets,
>	based on open/closed:
>	    O U OU EU UE AU EAU
>	    I E EI IE AE EA AI 
>	The problem with A is "father", "cat", "cake".  I'd say 
>	more often it's with the latter set than the former.

I don't think so.  People are more likely to get the vowels wrong than the
consonants. 

>    I don't think "coil" and "cowl" should be so close, either.

I think they're pretty close, especially in some accents.  "Cowell" and
"cowl" are very similar. 

>6.  The liquids (L's and R's) in <VOWEL><L or R><CONSONANT> seem too 
>    significant, R's perhaps more than L's.  "order" and "odor" are
>    closer than it things.

The person who thought up the algorithm probably wasn't from Boston, and
pronounced all his/her R's.  "Order" and "Odor" are very different to me, 
but some people pronounce "car" the way I'd pronounce "cah".  Soundex just
loses on those, I guess. 

>7.  What about all the digraphs?

>    But I don't like this:
>
>	C30000  cot
>	C23000  caught
>
>    That might not be able to be done right, since then you 
>    have to discern "draught" is closer to "raft" than it 
>    is to "route", which is itself closer to "drought".
>
>    Ug.

Yeah, ugh.  Perhaps there aren't many names where that difference occurs.  
That's really the type of data you need to try it on, since names are what 
the algorithm was designed to handle. 

-- 

----------------------------------------------------
Jack Hamilton            POB 281107 SF CA 94128  USA 
jfh@netcom.com           kd6ttl@w6pw.#nocal.ca.us.na 


Article 25644 of sci.lang:
Xref: glinda.oz.cs.cmu.edu sci.lang:25644 comp.lang.perl:23458 comp.programming:7772
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: honeydew.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!spool.mu.edu!sgiblab!sgigate.sgi.com!olivea!pagesat!news.cerf.net!netlabs!lwall
From: lwall@netlabs.com (Larry Wall)
Subject: Re: soundex questions
Message-ID: <1993Dec19.041550.4229@netlabs.com>
Sender: news@netlabs.com
Nntp-Posting-Host: scalpel.netlabs.com
Organization: NetLabs, Inc.
References: <CI6wA1.1nC@Colorado.EDU> <jfhCI7wLu.MI1@netcom.com>
Date: Sun, 19 Dec 1993 04:15:50 GMT
Lines: 69

In article <jfhCI7wLu.MI1@netcom.com> jfh@netcom.com (Jack Hamilton) writes:
: tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) wrote:
: 
: Well, here we were talking about you on the train just the other day, and
: Bang, you post about a subject I'm interested in.  (I decided you ought to
: look like Larry Wall and Larry Wall ought to look like you, by the way.) 

Not unless you think a cute bald viking looks like a Honda mechanic.  :-)

: >I have been thinking of soundex matching, except I know nearly nothing
: >about it.  I do have a function someone posted once that attempts to deal
: >it.  I'm not sure whether the algorithm here is "right" or not.
: 
: I don't think there is a "right" algorithm, although the one in Knuth is
: probably the "standard" algorithm. 

It's hard to claim that any algorithm is "right" for a problem in fuzzy
logic.  The basic problems with soundex is that it's trying to solve a
number of problems at once, and getting about half of the way there.
There are several sources of error in the process.

	Misperception of spoken sounds.
	Mistranscription of perceived sounds to writing.
	Inadequacy of writing to convey spoken distinctions.
	Quantization boundary effects of the algorithm.

Ideally, the computer should be taking the actual spoken sounds and
computing the distance in "speech" space to all potential matches
(I'll let the linguists argue about whether it should be etic or emic
(not to be confused with emetic :-), and if emic, how you handle
dialectic differences while doing phoneme recognition), then displaying
the list in increasing order of linguistic distance.  The soundex
algorithm has a rather crude notion of distance: it only distinguishes
"short" from "long", just like area codes in the phone system (no pun
intended) back in the days when you could get charged long distance
for calling someone across the street.

Even if you limit yourself to processing written text (this is, after
all, cross-posted to comp.lang.perl), you could probably do much better
with an approximate matching algorithm that tried not to throw so much
information away at the outset, but kept a better notion of linguistic
distance.  One thing the soundex system does do pretty good at is
regularizing the dimensionality of the linguistic space.  Perhaps
if each "chunk" of soundex data that currently turns into a byte could
instead be turned into a location in a small space of its own, then
a larger space could be constructed of all the smaller spaces.  The
question then becomes how many different kinds of small spaces you
need.  Minimally, a vowel cluster space and a consonent cluster space,
but you could differentiate word initial and word final, or use alternate
spaces depending on surrounding choices.  The phonologist in me is
starting to go nuts.  How many megabytes am I allowed to use?

: Soundex attempts to map the sound of a name to the spelling of a name, and
: how words are pronounced depends on a lot of different things.  It
: certainly depends on the language (the standard algorithm wouldn't work very
: well for French, for example) and on the regional and personal speech
: patterns of the speaker.  Proper names tend to preserve complicated
: spellings with simplified pronunciations - think of Chomondeley-Magdalen
: (which I've probably misspelled) or Leichester Square.   

The constuction of the overall space from the small spaces could
probably make some guesses about this sort of thing.  The prototypical
pronunciation of a given name could be stored in a dictionary, and
distances compared with that.

At some point it becomes more efficient to simply ask, "How do you spell that?"

[lERiy ual]
lwall@netlabs.com


Article 25653 of sci.lang:
Xref: glinda.oz.cs.cmu.edu sci.lang:25653 comp.lang.perl:23479 comp.programming:7780
Path: honeydew.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!math.ohio-state.edu!uunet.ca!uunet.ca!ecicrl!clewis
From: clewis@ferret.ocunix.on.ca (Chris Lewis)
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Subject: Re: soundex questions
Message-ID: <4899@ecicrl.ocunix.on.ca>
Date: 20 Dec 93 06:12:08 GMT
References: <CI6wA1.1nC@Colorado.EDU> <jfhCI7wLu.MI1@netcom.com>
Followup-To: sci.lang
Lines: 76

In article <jfhCI7wLu.MI1@netcom.com> jfh@netcom.com (Jack Hamilton) writes:
>tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) wrote:

>>I have been thinking of soundex matching, except I know nearly nothing
>>about it.  I do have a function someone posted once that attempts to deal
>>it.  I'm not sure whether the algorithm here is "right" or not.

>I don't think there is a "right" algorithm, although the one in Knuth is
>probably the "standard" algorithm. 

>Soundex attempts to map the sound of a name to the spelling of a name, and
>how words are pronounced depends on a lot of different things.  It
>certainly depends on the language (the standard algorithm wouldn't work very
>well for French, for example) and on the regional and personal speech
>patterns of the speaker.  Proper names tend to preserve complicated
>spellings with simplified pronunciations - think of Chomondeley-Magdalen
>(which I've probably misspelled) or Leichester Square.   

Agreed.  The best way to consider soundex is as a crude and primitive hack
that does surprisingly well at the job it is intended for - finding english
names when the spelling is inexact.  Soundex wasn't created in
some analytic fashion - it simply embodies a number of desirable "features"
and compromises that work reasonably well doing the job.

In order to use it properly, you have to recognize its limitations, and
put considerably more thought in the algorithms that call the soundex
lookup than a simple 'desired_record = dbfetch("fuzzykey")'.

About 13 years ago I implemented and used 4-character soundex in a corporate
phone directory lookup system.  About 6,000 users then, which has reached about
25,000 now.  The soundex was easy.  Data structures and ancilliary driving
algorithms were the hard part.  The only thing I'd change now is to use
5 character soundex.

Effective use of soundex sometimes needs help from the user too - sometimes
you intentionally misspell things, type in only the first bit etc.

If there's interest, I can outline some of the algorithms here.

>>2.  Why doesn't it collapse the initial character as well (S and Z, P
>>    and B, etc).   

>Because people usually remember the first letter of a name or can make a
>good guess at it, and having that first letter decreases the number of 
>probable wrong answers. 

>If I were writing a version of the algorithm, I'd convert initial Kn to
>just N.  There are probably some others I'd change, but that's the
>biggest one. 

Yes, it's REALLY embarassing when your brand new snazzy algorithm doesn't
work on the algorithm author's name ;-)

>>3.  Some of consonant clusters could stand being munged up a bit, like
>>    -mb -nd, etc.

>Yup.  I'd also convert a final -tion to sn. 

Not usually a concern with names...

>>7.  What about all the digraphs?

>>    But I don't like this:

>>	C30000  cot
>>	C23000  caught

>>    That might not be able to be done right, since then you 
>>    have to discern "draught" is closer to "raft" than it 
>>    is to "route",

Depends on where you're from...

>>which is itself closer to "drought".

"ght" -> "t" would work in these cases, because the vowels are ignored.


