Newsgroups: comp.speech
Path: cantaloupe.srv.cs.cmu.edu!europa.chnt.gtegsc.com!gatech!bloom-beacon.mit.edu!news.kei.com!news.ssd.intel.com!ornews.intel.com!chnews!ennews!trcsun3!tucker
From: tucker@trcsun3.eas.asu.edu (Greg Tucker)
Subject: Re: Any Suggestions?
Message-ID: <DAzyG3.EEz@ennews.eas.asu.edu>
Sender: news@ennews.eas.asu.edu (USENET News System)
Summary: HMM-based endpoint detector
Organization: Arizona State University
Date: Fri, 30 Jun 1995 17:40:51 GMT
References: <8jwD6R600iVCA_MnZ6@andrew.cmu.edu>
X-Nntp-Posting-Host: enws125.eas.asu.edu
Lines: 87

> Does anyone out there have any good techniques in isolating speech?  By
> this, I mean cutting out the silence from the front and end of the
> speech, but at the same time, not cutting off the signal.

> For instance, I used to use threshold values, if the signal goes above
> the threshold value, the sound starts.  But, I am having trouble with
> the ending of the letter "F".  The ffffff sound looks a lot like noise,
> with it's low amplitude, random shape.

> Anyway, back to the question, what sort of techniques are being used by
> you all?  Any help would be greatly appreciated.


I group the methods of endpoint detection into three general
categories; energy-based methods, pattern recognition, and
implicit/hybrid methods.

Energy-based methods usually calculate a small number of features and
apply thresholds and heuristic rules to make decisions.  The
short-time energy is usually the main feature and additional features
are often added to make adjustments (zero crossing rate).  This
technique can work well enough in low noise environments but has a
tendency to cut off the low energy unvoiced speech (fricatives such as
ff, thh, sss and shh).

Pattern recognition is the statistical modeling approach.  For
instance you can hypothesize that a frame of speech was generated from
one of a number of distributions (voiced, unvoiced, or silence?) and
divide the space up to minimize the probability of error.  Here you
can take advantage of the covariance between frames.  One disadvantage
however, is that it is difficult to be adaptive.

Implicit/hybrid methods use the application or user of the endpoint
information to make the best choice.  A connected-word HMM speech
recognizer does this.

*BUT* if you want really good endpoints without any prior knowledge of
what the speech is, you might want to try using an explicit, HMM-based
endpoint detector (and the subject of my thesis).  I've found a total
of two articles addressing this issue that you see below and I'm
writing my own for ICSPAT95.  You can construct a very simple HMM
network to model speech and silence and when you find the most likely
state and model sequence through the network, you have the endpoints.

A good question is; what finite-state grammar should best model
generalized speech and silence?  Ergodic models or some other
structure?  Also, what features should be best?

Anybody have any suggestions on this?

--Greg


==========================================================================
 Gregory B. Tucker                     Telecommunications Research Center
 tucker@dspsun.eas.asu.edu             Dept. of Electrical Engineering
 http://yamuna.eas.asu.edu/~tucker/    Arizona State University
 (602) 965-0396                        Tempe, AZ 85287-7206
==========================================================================


@article{wilp87,
  author="J.G. Wilpon and L. R. Rabiner",
  title="Application of hidden Markov models to automatic speech
                  endpoint detection", 
  journal="Comput. Speech Language", 
  volume={2},
  number={3/4},
  month={December},
  year={1987},
  pages={321--341}
}

@inproceedings{pawa94,
  author={B.I. Pawate and Eric Dowling},
  institution={TI, Japan and Dept. Elect. Eng. University of Texas at Dallas},
  title={A new method for segmenting continuous speech},
  booktitle={ICASSP-94: 1994 IEEE International Conference on
    Acoustics, Speech and Signal Processing},
  year={1994},
  pages={I-53--I-56},
  organization={IEEE},
  publisher={IEEE},
  address={Adelaide, South Australia},
  month={April}
}

