Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!pipex!uknet!yorkohm!mike
From: mike@ohm.york.ac.uk (Mike D Edgington)
Subject: Re: Algorithm for pitch detection
Message-ID: <1993May19.090541.11471@ohm.york.ac.uk>
Organization: Electronics Department, University of York, UK
References: <1993May17.200530.16738@en.ecn.purdue.edu> <1993May18.162011.3214@Princeton.EDU>
Date: Wed, 19 May 93 09:05:41 GMT
Lines: 50

In <1993May18.162011.3214@Princeton.EDU> lseltzer@phoenix.Princeton.EDU (L. Seltzer) writes:

>The book entitled Pitch Determination of Speech Signals reports that
>you can get 3% accuracy with an autocorrelation pitch detector.

>It would be interesting to hear other people's experiences, but I
>think really accurate pitch determination is quite difficult.

>I had mentioned 3% accuracy with autocorrelation.  That is not
>good enough for applications such as ethnomusicology.  In spite of
>the claims people make in ethnomusicology articles about the
>results from commercial pitch trackers (they don't even know what's
>inside the box, in terms of smoothers, etc.), I don't know whether
>anyone has a system up and running that can determine f0 with
>a half a percent in accuracy, which is more like what you need for
>really accurate work.

	The question of accuracy is of course relative to the `signal
sink', which in my line of work is a human listener (but could be speech
recogniser, etc.). One of my collegues has done some work on trying
to specify the accuracy with which a human can detect changes in F0,
since he was designing a real time singing to MIDI converter as part
of his PhD work. It turns out that while for steady F0 values,
humans can detect very small changes in pitch, we are insensitive to
large changes if they happen quickly enough. Basically for a pitch glide
lasting less than 10 cycles of F0, you can get away with about 10% accuracy
(from memory, I can check and quote references if necessary). As the glide
gets longer, humans can detect a smaller change, until at above 30
cycles we are at the 0.5% to 1% accuracy limit. This means that if a
F0 contour is moving about, as it does in speech, you can have quite
a few glitches of 5% to 10% without a human ever perceiving them.
	Of course this is from experiment. When you start trying it
in the real world you find that people detect amplitude changes in
higher harmonics due to acoustic characteristics of their listening
environment, and interactions with other sources in ensemble
playing, and everthing becomes complicated and scarey.
	The moral of the story is that by all means develop
algorithms to be as accurate as possible, but whats the point of
being an order more accurate than a human can detect? 


Oops, I should have said that my collegue is Chris Barnes, same
address as me.

      _/_/  _/_/  _/_/_/  _/    _/  _/_/_/_/  % Mike Edgington      %
     _/ _/_/ _/    _/    _/  _/    _/ 	      % Dept of Electronics %	
    _/  _/  _/    _/    _/_/      _/_/_/      % University of York  %
   _/      _/    _/    _/  _/    _/           % YORK   U.K.         %
  _/      _/  _/_/_/  _/    _/  _/_/_/_/      % Voice (0904) 432418 %

