# swrnn

SNR Wiener filtering using Recurrent Neural Network
Author: Raymond Xia (yangyanx@andrew.cmu.edu)

# Version History

- 12/20/2017 - Version 3.
  - SWRNN added.
  - Classic Wiener filtering by a priori SNR estimation added.
- 10/15/2017 - Version 2.
  - Wiener filtering structure added.
  - More training options added.
- 09/15/2017 - Version 1.
  - Original MRNN.

# Requirement

- Python 2.7
- audiolib 2.0 (included, but need to be added to PYTHONPATH)
- PyTorch (**required** for new SWRNN)
- CUDA nvcc 9.0 (optional)

Theano is no longer required for the new SWRNN.

# Updates in Version 3 (as of 12/20/2017)

Version 3 introduces a new recurrent neural network for speech de-noising. We
give it a preliminary name SWRNN for SNR-based Wiener filtering using Recurrent
Neural Network. This new system is intended to supersede the old MRNN for
de-noising a noisy magnitude spectrogram. Apart from this, the overall
de-noising workflow (SSB demodulation applied to channel D in particular)
remains unchanged.

The SWRNN is designed to have three major advantages over the old MRNN. They
are:

1. Exploiting voice presence/absence regions. The frame-level voice
presence/absence labels (easily obtained from the metadata in LDC2015S02) play
an important role in SWRNN. Specifically, the cost function of SWRNN is a
weighted combination of the reconstruction error of the magnitude spectrogram,
and the cross-entropy between the predicted frame-level VAD labels and the
ground truth labels. This addition of VAD-dependent cost function has
experimentally proven to control the trade-off between the level of suppression
of musical artifacts and the level of smearing of speech.
2. Robustness. SWRNN is believed to be more robust than MRNN because it does
not learn specific noise or speech patterns, but rather learns the pattern of
the a priori SNR (as well as noise power) ratio of adjacent frames. We believe
that this is more robust than MRNN when encountering new noise type. This
however, needs further proof from experiments.
3. Comprehensibility. The biggest drawback of MRNN is the incomprehensible
intermediate variables before the output of G3 (output of G1 and G2 in
particular). The system is not guaranteed to do spectral subtraction because
of lack of constraints. SWRNN has no such issue because it is derived from the
classic a priori SNR estimation approach. With a specific set of freezed neural
network parameters, the entire system can be easily proved to be equivalent to
the classic approach. Thus the neural network is an extension to the classic
approach, and the intermediate variables can be viewed the same way as in the
classic approach.

In addition to the advantages above, SWRNN is implemented in PyTorch, a new
deep learning software package that runs faster than Theano and is overall
easier to use. PyTorch is easy to install on unix systems with Anaconda. Check
the website [http://pytorch.org/](http://pytorch.org/) for more information.

Classic Wiener filtering with a priori SNR estimation is also implemented and
is available as functions in `wiener.py` in audiolib. The implementation is
based on Loizou's description in his book Speech Enhancement.

# Usage

To train a SWRNN system from nothing, and assuming training data is in the
correct path, run
 `python train.py -o path/to/output`

To train a SWRNN system from an existing set of NN weights, run
 `python train.py -m path/to/existing/weights -o path/to/output`

To enable CUDA on a GPU machine, simply switch on the `-gpu` flag
  `python train.py -gpu -o path/to/output`

To denoise an audio signal using a learned SWRNN system, run
 `python denoise.py -m path/to/existing/weights.npz -i path/to/noisy/audio.wav	-o path/to/output.wav`

To denoise an audio signal using classic a priori SNR estimation with Wiener
filtering, enable `-w` flag and run

  `python denoise.py -w -i path/to/noisy/audio.wav -o path/to/output.wav`

It is recommended to use GPU mode for training (because of batch processing),
and CPU for denoising.

# Configurations & Hyper-parameters

The classic Wiener filtering method has a few important parameters:

- `MODE`
  - `asnr` - classic a priori SNR estimation method.
  - `activate` - classic a priori SNR estimation method with noise update
  interpreted as a nonlinear activate function.
  - `recurrent` - same as `activate` with logistic activation, but written in a
  recurrent way that it can be directly translated into a recurrent neural
  network version with minor modifications.
  - These three modes are the only ones relevant to SWRNN, and could be used to
  directly compare with SWRNN.
- `ASNR_MU` - smoothing factor for noise estimation. A good number is 0.98
according to Loizou.
- `ASNR_VAD` - decision threshold for voice presence based on log likelihood
ratio.

SWRNN has the following hyper-parameters in the beginning of the file
`mrnn_asnr_pyrorch.py` :

- `ASNR_MU` - smoothing factor for noise estimation. This has the exact same
meaning as in the classic approach, and should be set as the same value for
direct comparison.
- `ASNR_ETA` - soft decision threshold for voice presence based on log
likelihood ratio. This has the exact same meaning as `ASNR_VAD` in the classic
approach, and should be set as the same value for direct comparison.
- `ASNR_DELTA` - Only necessary if the nonlinear activation function in noise
update step is switched from logistic to piecewise linear. This parameter
controls the linear region, which in turn controls the slope of the linear
function.
- `COST_WEIGHT` - Weighting factor in range [0,1] for reconstruction MSE and
VAD cross-entropy cost. 0.2 is empirically determined that appears to have a
good tradeoff. The smaller the value, the higher suppression of musical
artifacts, and increasing smearing of speech.
- `LEARNING_RATE` - Learning rate for stochastic gradient descent.
- `MOMENTUM` - Learning update momentum.

In general, `ASNR_MU` and `ASNR_ETA` should be set the same as the classic
approach, and `COST_WEIGHT` is the essential parameter to control voice quality
and the level of noise suppression.

# Trained Models

One model is provided in `model/` that is trained using default setup in
`mrnn_asnr_pytorch.py` . Two sample audio files in `result/` are extracted
using the following commanded:

 `python denoise.py -m model/cost_.2_lr_1e-4.dat.npz -i path/to/noisy/10013_A.wav -o result/10013_A_swrnn.wav`

and

 `python denoise.py -w -i path/to/noisy/10013_A.wav -o result/10013_A_classic.wav`

A good way to check if all packages are installed correctly is to replicate the
two audio filee by running the two commands above.
