pnb-image

Pallavi Baljekar

Speech Synthesis, Recognition, Machine Learning and Signal Processing

Who?

I graduated in May 2018. I am currently a software engineer in the Google Brain team in Cambridge, MA working on ML Fairness -- making services and products offered by Google more inclusive and less biased.

PhD, Language Technologies Institute (F'14 - S'18)
Advisor: Prof. Alan Black

In my PhD I was looking into acoustic modelling in speech synthesis using "found data" -- speech data available in the wild, with an emphasis toward low-resource languages. A short three minute description of it can be found here, and a longer version can be read here.

Previously, Masters Student, Language Technologies Institute (F'12 - F'14) working on Keyword spotting in children's speech using LSTMs.
Advisor: Dr. Rita Singh

Summer internships at Google Machine Perception, Mountainview, CA (SU'17) Google Research, London (SU'16), Sony Playstation R&D (SCEA), San Mateo,CA. (SU'15) and Disney Research, Pittsburgh,PA. (SU'13, '14)

TA for Speech Processing (F'15), Machine learning for signal processing (S'15)

Why?

The motivation for my thesis topic was motivated by the fact that current TTS pipelines are designed to use clean, phonetically balanced data, recorded from single speaker for the purpose of speech synthesis and they are not very robust to modelling noisy data with a lot of variation. However, it is diffciult to get such clean data for low resource languages or languages where there is no access to a native speaker for recording. On the other hand, there is an ever increasing pool of data available online in the form of podcasts, YouTube data and audiobooks. Thus the question I try to address through my research is how can we best make use of this found data to build TTS systems which are understandable. In my thesis I plan to address the three issues:

  1. Data Selection: How do we select good data for synthesis using machine learning techniques.
  2. Cross-lingual data augmentation: How can we leverage external data, either from the same language or another higher resource language to augment data. Specifically, I am looking at seq-2-seq attention based models for unsupervised speech synthesis in a target language, as well as generative models for cross-lingual acoustic modeling in speech synthesis.
  3. Prosody Modelling of long-form audio: The problem with found data, generally in the form of audiobooks and podcasts is that it is prosodically very rich and has very long utterances. Both of these issues make it hard to build TTS models using current systems. So in this part I look at how we can use more structured approaches from metrical phonology to learn a materical tree over the utterance. I am exploring supervised and unsupervised grammar induction methods in order to learn this metrical grammar.

Where?

Office Address
5401 Gates Hillman Complex
5000 Forbes Avenue
Pittsburgh, PA 15213, USA