Newsgroups: comp.speech
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!fas-news.harvard.edu!newspump.wustl.edu!trinews.sbc.com!news.mid.net!news.ksu.ksu.edu!vixen.cso.uiuc.edu!howland.reston.ans.net!cs.utexas.edu!asuvax!chnews!ennews!trcsun3!deisher
From: deisher@trcsun3.eas.asu.edu (Michael E. Deisher)
Subject: speech distortion measure
Keywords: Itakura Saito speech distortion measure
Message-ID: <D40LD4.6K5@ennews.eas.asu.edu>
Nntp-Posting-Host: enws125.eas.asu.edu
Sender: news@ennews.eas.asu.edu (USENET News System)
Organization: Arizona State University
Date: Wed, 15 Feb 1995 00:00:40 GMT
Lines: 54

A recent publication in the IEEE Transactions on Speech and Audio
Processing ("Dual Channel Iterative Speech Enhancement..." by
Nandkumar and Hansen, Jan. 95) lists the average Itakura Saito
distortion of noisy speech at 5 dB SNR as 4.11.  The speech record in
this case was a sentence from the TIMIT database and the noise was
described as additive, white, and Gaussian.

When I compute the modified Itakura-Saito distortion under similar
conditions, I get an average of 12739.23!  Obviously, we are computing
the distortion measure differently.  I've checked my implementation
and it seems sound (e.g., I get nice clusters when doing AR VQ).  The
definition I am using is shown below (in LaTeX).  I have seen similar
definitions of the modified I-S distortion in several references.
What might the authors of the article be doing differently?  How is
the I-S distortion usually computed?

Thanks in advance to anyone who wishes to comment on this.

--Mike
deisher@dspsun.eas.asu.edu


Note: If you don't have LaTeX and want a more readable version of this,
      I would be happy to send you a postscript file.


\documentstyle []{article}
\begin{document}

The modified Itakura-Saito distortion measure \cite{gray81} is
\begin{equation}
\rho({\bf x}_1,{\bf x}_2) = \frac{\delta({\bf x}_1;{\bf a}_2)}{\sigma_2^2}
                      - \log \frac{\sigma_1^2}{\sigma_2^2} - 1
\end{equation}
where ${\bf x}_1, {\bf x}_2$ are blocks of speech, ${\bf a}_2$ is the
set of linear prediction coefficients computed from ${\bf x}_2$,
$\sigma_1^2$ is the linear prediction residual computed from
${\bf x}_1$, and $\delta(\cdot\,;\cdot)$ is defined as
\begin{equation}
\delta({\bf x}_1;{\bf a}_2) = r_{a_2}(0) r_{x_1}(0)
                     + 2 \sum_{i=1}^p r_{a_2}(i) r_{x_1}(i)
\end{equation}
\begin{equation}
r_{x_1}(i) = \frac{1}{K} \sum_{n=0}^{K-i-1} x_1(n) x_1(n+i)
\end{equation}
and
\begin{equation}
r_{a_2}(i) = \sum_{n=0}^{p-i} a_2(n) a_2(n+i)
\end{equation}
with $a_2(0) = 1$.

\end{document}


