An English corpus of data for the medical domain had already been elicited as part of our team's BABYLON effort. A selection of the English sentences were first translated by native Arabic speakers. This Arabic data was then expanded. Without looking at the source English sentences, the Arabic speakers were asked to provide up to ten possible rephrasings of each Arabic sentence, in the target Egyptian dialect. The rephrasings were generated in a verbal brainstorming session, with one speaker transcribing the sentences that were spoken in order to capture the naturalness of spoken language.
This process yielded approximately 5000 sentences. A subset was selected to maximize phonetic and prosodic coverage, and these sentences were recorded by our model speaker for the unit database.