Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!warwick!uknet!pipex!howland.reston.ans.net!news.moneng.mei.com!uwm.edu!fnnews.fnal.gov!att-in!att-out!walter!din!spiegel
From: spiegel@din.bellcore.com (Murray Spiegel)
Subject: Re: phonetic info REWORDED (long)
Message-ID: <CKCGKx.Mxn@walter.bellcore.com>
Sender: news@walter.bellcore.com
Nntp-Posting-Host: din.bellcore.com
Organization: Bellcore (Bell Communications Research)
References:  <karenj.759727711@mullian.ee.Mu.OZ.AU>
Date: Fri, 28 Jan 1994 14:36:33 GMT
Lines: 116

|> .... here's literally what my friend requires:
|> 1. To store a person's full name using "phonetic spelling" ...
|> 2. Because there are no tags, ... need an algorithm ... to determine which
|> 	name is the first name and which name is the surname.

Let me address Q #2 first.

  For a problem in a completely different application context, I briefly 
investigated identifying first names vs last names (surnames).  
My comments strictly apply only to names of people living in the USA,
although my suspicion is that many countries the same conclusions will apply.

This cannot be done with any reasonable accuracy because of the very significant
overlap between the sets of first and last names.   There are very few
names that can be _unambiguously_ identified as a first name or last name.
Thus, the only thing achievable most of the time would be 
a probabilistic answer that will have low confidence values for much
of the population.  This is unlikely to be useful.

It isn't hard to think of several examples of famous people to make
the point:

Ray Charles, Michael Jackson, Clarence Thomas, 
Warren Christopher, Robert Morris, Dean Martin.

Every name can be a first name, and every one is a common last name:

RANK: NAME      # HOUSEHOLDS IN US
12:   MARTIN       194059
14:   THOMAS       182078
17:   JACKSON      159754
44:   MORRIS        90836
141:  WARREN        44111
174:  RAY           38753
217:  DEAN          32244
737:  MICHAEL       11210
829:  CHARLES       10116
1148: CHRISTOPHER    7523
2431: ROBERT         3653
49317:CLARENCE        123


Two more sets of data to emphasize the point.

a) Here are the most frequent FIRST names in my databases, 
and their corresponding SURname counts for the US:

ROBERT    3653
WILLIAM   3098
JOSEPH   13994
RICHARD  14184
THOMAS  182078
MICHAEL  11210
JAMES    57619
GEORGE   32862
CHARLES  10116
DAVID    10653
EDWARD    1497
FRANK    22314
PAUL     20627

b) And, the most frequent SURnames in the US, and their corresponding
FIRST name counts in my databases (the absolute numbers are lower
because my name databases are not anywhere near comprehensive with
respect to the USA:

SMITH      16
JOHNSON     6
WILLIAMS   13
BROWN       5
JONES      18
MILLER      5
DAVIS      17
ANDERSON    5
WILSON     50
MOORE       4
TAYLOR     16
MARTIN    988
THOMPSON    5
THOMAS   5830
WHITE       4
CLARK      60
JACKSON    17
HARRIS     36
LEWIS     316

	Side point:
 I must add for those of you who would like me to search these
 databases for personal interests:   These databases are proprietary to
 Bellcore's clients, the regional telephone companies, and unfortunately
 we cannot share them or their contents with anyone else - such as those
 interested in geneology, locating friends, etc, etc.

 However, CD-ROMs based on scanned telephone directories are now
 commonly available from some companies - don't ask me for addresses, 
 check a good research reference librarian for pointers.

For your friend I have this suggestion:  Why can't they add a special, 
"out-of-band" character when storing the phonetic representation of the names 
to encode whether the name was a first name or last name?

Thus, my name (murray spiegel) would be stored as: /FmRi/ /Lspigxl/  
Note: I use schwa-l /xl/ rather than a syllabic /L/, so that "L"  
can represent "last name".   I know you said there are no tags, 
but perhaps given this information regarding the reliability of identifying 
first vs last names, they'll reconsider using some form of sag.

As to Q #1:  For accurate name pronunciation, there are a few synthesis systems
that purport to do an adequate job.  Bellcore's ORATOR synthesizer has 
consistently been rated the best at pronouncing names (at least American ones).
(See the FAQ for licensing info, or contact alin1@panix.com for details.)

Karen: For someone local to you who has ORATOR, your friend might want 
to talk to Dr Julie Vonwiller at the EE dept in U of Sydney.

I hope this is helpful.
