Newsgroups: comp.speech
Path: lyra.csx.cam.ac.uk!warwick!slxsys!pipex!howland.reston.ans.net!news.moneng.mei.com!hookup!yeshua.marcam.com!MathWorks.Com!zombie.ncsc.mil!romulus.ncsc.mil!afterlife!jpcampb
From: jpcampb@afterlife.ncsc.mil (Joe Campbell)
Subject: Comments: Testing with the YOHO CD-ROM Voice Verification Corpus
Message-ID: <1994Aug25.231157.14095@afterlife.ncsc.mil>
Organization: The Great Beyond
Date: Thu, 25 Aug 1994 23:11:57 GMT
Lines: 354

I would be most grateful for your comments and discussion on the
attached paper.  Furthermore, I'd like to urge you to test your
voice verifiers on the YOHO corpus in the manner described below.
There are some difficult issues to resolve (especially w.r.t.
systems that use different size cohort sets).  I look forward
to your comments and the day when we can fairly compare the
performance of different voice verifiers.

Regards,
Joe
_____________________________________________________________________________
| Dr. Campbell  N3JBC  jpcampb@alpha.ncsc.mil   jpcampb@afterlife.ncsc.mil  |
| Speaking for myself     Happiness = Reality - Expectations, Click & Clack |
|___________________ Sun Mail Tool attachments welcomed ____________________|


TESTING WITH THE YOHO CD-ROM VOICE VERIFICATION CORPUS

Joseph P. Campbell, Jr. <jpcampb@alpha.ncsc.mil>


ABSTRACT

A standard database for testing voice verifi-
cation systems, called YOHO, is now available 
from the Linguistic Data Consortium (LDC). 
The purpose of this database is to enable 
research, spark competition, and provide a 
means for comparative performance assess-
ments between various voice verification sys-
tems. A test plan for the suggested use of the 
LDC's YOHO CD-ROM for testing voice veri-
fication systems is presented. This plan is based 
upon ITT's voice verification test methodology 
as described in Higgins, et al. [1], but differs 
slightly in order to match the LDC's CD-ROM 
version of YOHO and to accommodate different 
systems. Test results using YOHO are also pre-
sented. 


INTRODUCTION

The YOHO voice verification corpus was 
collected while testing ITT's prototype speaker 
verification system in an office environment. 
This database is the largest supervised speaker 
verification database known to the author. The 
number of trials and the number of test subjects 
were determined to allow testing at the 75% 
confidence level to determine whether a system 
meets 1% false rejection and 0.1% false accep-
tance. The test subjects spanned a wide range of 
ages, job descriptions, and educational back-
grounds. Most subjects were from the New York 
City area, although there were many exceptions, 
including some nonnative English speakers. A 
high-quality telephone handset (Shure XTH-
383) was used to collect the speech and channel 
impairments were not added. When the system 
was used in an enrollment or verification ses-
sion, a sampled waveform file was created for 
each phrase-length utterance. A subset of these 
waveform files comprises the LDC's YOHO 
CD-ROM. 

The LDC release of YOHO was designed, 
with regard to the quantity and collection of 
data, to answer the following question: does a 
speaker verification system perform at 0.01% 
false acceptance and 0.1% false rejection at 
75% confidence with a 50% probability of pass-
ing the test? There are 138 speakers (106 males 
and 32 females*); for each speaker, there are 4 
enrollment sessions of 24 utterances each, and 
10 verification sessions of 4 utterances each. In 
a text-dependent speaker verification scenario, 
phrases are prompted and the claimant is 
requested to say them. The syntax used in the 
YOHO database is "combination lock" phrases. 
For example, the prompt might read: "Say: 
twenty-six, eighty-one, fifty-seven." Where the 
claimant is to speak the phrase as three doublets. 
The LDC YOHO CD-ROM can be summarized 
as

o "Combination lock" phrases
o 138 subjects: 106* males, 32* females
o Collected over 3-month period
o Approximately 3-day verification intervals
o Real-world office environment
o 4 enrollment sessions per subject
o 24 phrases per enrollment session
o 10 verification sessions per subject
o 4 phrases per verification session
o Total of 1,380* verification sessions
o 8 kHz sampling with 3.8 kHz bandwidth
o 1.2 gigabytes of data (when uncompressed)


* Contrary to the CD's 0readme.txt file.


ENROLLMENT

Speaker enrollment models should be con-
structed from enrollment sessions 1 through 3. 
Session 4 can be used to determine cohort (also 
known as ratio or likelihood [1]) set speakers 
and used for building a speech segmenter. 

Unlike some text-dependent speaker verifi-
cation systems, not all possible verification 
phrases are available from enrollment (this 
would lead to excessive enrollment time). 
Enrollment does, however, cover the acoustic 
space of all possible speech that could be 
prompted during verification. For example, dur-
ing enrollment, models for a given speaker's 
"fif"-"tee"-"three" can be obtained without 
actually collecting "Fifty-Three" by using sub-
words from other prompts; e.g., 51, 52, 63, and 
73 (minus coarticulation effects). Because of the 
difficulty this may cause for some, text-indepen-
dent test results also can be reported using 
YOHO. 

The enrollment file structure of the disc is 
enroll/speaker#/session#/prompted_phrase.wav. For 
example, speaker 101's enrollment session 1 
phrase "26_81_57" file is

enroll/101/1/26_81_57.wav

Each session of each speaker's enrollment 
directory contains 24 *.wav files. 

There is a total of 138 speakers, numbered 
from 101 to 277 (there are gaps in the 
sequence).


VERIFICATION

A single trial can use all the speech in a 
given speaker's verification session (i.e., up to 
four phrases). Each speaker can have 10 verifi-
cation tests against him/herself. (If the four 
phrases were used for separate verification tests, 
the independence of the tests would be weak.) 

The verification file structure of the disc is 
verify/speaker#/session#/prompted_phrase.wav. For 
example, speaker 101's verification session 
1320 consists of the following set of 4 speech 
files:

verify/101/1320/41_34_23.wav
verify/101/1320/57_92_26.wav
verify/101/1320/73_61_31.wav
verify/101/1320/86_79_65.wav

There is a total of 1,380* sessions, num-
bered from 528 to 2527 (there are gaps in the 
sequence). 

False-rejection measurements are based on 
the 1,380 valid session trials. Impostor trials are 
simulated by presenting the system with one 
subject's speech and prompted text (embedded 
in the file name) under a different subject's 
hypothesized identity.


Impostor Selection

Each of the 138 subjects is treated in turn as 
a claimant. For each claimant, sessions spoken 
by subjects of the same gender other than the 
claimant and his/her cohorts are selected as 
imposters, with no more than one session per 
subject (for independent tests). The sessions are 
processed using the normal verification proce-
dure, resulting in accept/reject decisions. If 
13,862 simulated impostor trials are performed, 
the most stringent test below can be evaluated. 

Males should be compared only with other 
males (see the speaker.doc file for speaker gen-
ders). There are not enough females for large 
same gender female impostor trials, so female 
impostor results can be reported separately. 

To be consistent with ITT, speaker depen-
dent cohort sets can be used consisting of the 
five "closest" speakers as determined from the 
enrollment data. If cohort scoring is used, cohort 
set speakers should be excluded as impostors 
(cohort set speakers are usually closest to their 
targets, but likely would be rejected; thus, opti-
mistically biasing the results). Determining a 
fair way to compare systems using different size 
cohort sets is a difficult problem. Cross-valida-
tion could be used to iteratively partition impos-
tor and cohort sets, but this may reduce the 
statistical confidence of the tests.


Critical Number of Errors

To test the hypothesis that the actual false 
rejection (FR) rate is less than or equal to 1% at 
75% confidence requires 8 or fewer errors in 
1,080 tests (for a 70% probability of passing the 
test if the ratio of the true system error rate to 
the target error rate (e) = 2/3) [1]. Likewise, as 
shown in Table 1, to test the hypothesis that the 
actual false acceptance (FA) rate is less than or 
equal to 0.1% at 75% confidence requires 8 or 
fewer errors in 10,802 tests. These tests are 
based upon the independence assumptions used 
in the collection and proper use of the YOHO 
database, Poisson's approximation to the bino-
mial, error rates less than 5%, and sample sizes 
greater than 100.

Table 1: Critical Number of Errors

Mode	Conf	Target	P(pass)	 e	Size	Critical Errors
FR	75%	1.0%	0.7	2/3	 1,080	  8
FA	75%	0.1%	0.7	2/3	10,802	  8
FR	75%	0.1%	0.5	1/2	 1,386	  0
FA	75%	0.01%	0.5	1/2	13,862	  0


REPORTING RESULTS

In addition to using the above critical num-
ber of errors test, raw error rates (relative fre-
quency) are also of interest. Receiver operating 
curves are also of interest, especially if they are 
bracketed by error bars. A histogram and an 
average of identification rank are also of inter-
est. Fine-grain results on problem speakers can 
be informative (e.g., for false acceptance errors, 
plots of an attacker's identification number vs. 
an attackee's identification number vs. fre-
quency). For text-dependent verification, errors 
due solely to speech misrecognition should be 
reported. 

In order for the community to make compar-
ative assessments, please explicitly state any 
variations on this suggested test plan used to 
obtain your results. Researchers are requested to 
submit results to the voice verification commu-
nity in accordance with these guidelines, but not 
to any specific party. 


TEST RESULTS

The author knows of three tests using the 
YOHO database. ITT's results [1] are for the 
full 186 speaker YOHO database, MIT Lincoln 
Lab's results [2] are for LDC's YOHO CD-
ROM, and Campbell's results [3] are for an 87 
speaker subset of the YOHO database. Equal-
error rate verification and closed-set speaker 
identification error rates are given in Table 2.

Table 2: YOHO Speaker Recog Error Rates

		Verification		Speaker Id
		EER			closed-set
ITT		1.7%			-
MIT/LL		0.2% males		0.8%
		2.2% females
Campbell	-			0.05%

Since these tests were not performed under 
identical conditions, they cannot be compared 
directly with each other. They are presented to 
show a variety of test scenarios and correspond-
ing performance.


PROBLEMS WITH YOHO

The following files were not compressed and 
contain empty headers (the speech data is 
intact), thus, w_decode is not needed for these 
files:

verify/277/538/29_51_23.wav
verify/277/538/65_56_74.wav
verify/277/538/74_31_67.wav
verify/277/538/96_85_43.wav

The LDC promises to provide a script to 
solve this problem. It should be available via 
anonymous ftp to ftp.cis.upenn.edu as /pub/ldc/
yohosphr.prl. 

Speaker 240 used a falsetto voice in test ses-
sion 969.


LDC INFORMATION

For information about the LDC, including 
obtaining copies of YOHO, please contact the 
Linguistic Data Consortium, 441 Williams Hall, 
University of Pennsylvania, Philadelphia, PA 
19104-6305, USA. Information about the LDC 
is also available via anonymous ftp from 
ftp.cis.upenn.edu in the /pub/ldc directory, tele-
phone Jack Godfrey at 215-573-3595, or e-mail 
jgodfrey@unagi.cis.upenn.edu.


CONCLUSIONS

This test plan will hopefully unify the 
reporting of the performance of speaker verifi-
cation systems.


ACKNOWLEDGMENTS

The assistance and contributions of Alan 
Higgins, Jack Porter, Doug Reynolds, Scott 
Reider, Tom Crystal, and David Graff are grate-
fully acknowledged.


REFERENCES

[1] Higgins, A., L. Bahler, and J. Porter. 
"Speaker Verification Using Randomized 
Phrase Prompting." Digital Signal Processing 1, 
no. 2 (1991): 89 - 106.

[2] Reynolds, D. A., Private communication, 
July 1994.

[3] Campbell, J. P., Jr. "Features and Mea-
sures for Speaker Recognition." Ph.D. Disserta-
tion, Oklahoma State University, 1992.
-- 
_____________________________________________________________________________
| Dr. Campbell  N3JBC  jpcampb@alpha.ncsc.mil   jpcampb@afterlife.ncsc.mil  |
| Speaking for myself     Happiness = Reality - Expectations, Click & Clack |
|___________________ Sun Mail Tool attachments welcomed ____________________|
