The aim of this work was to deliver two-way speech-to-speech translation on a handheld. The original Babylon project intention was to use an updated platform that was currently used in the one-way Phraselator system . But as that platform was not yet ready we decided to aim for a consumer off-the-shelf (COTS) PDA.
From our experience in other project in delivering on limited hardware platforms we have felt it better to aim for the intended hardware platform at the start rather than assume the platform will improve of the length of the project.
Our first intention was to design a architecture that would allow component to reside either on the device itself or on external servers accessed through a wireless link. Although the ultimate result should not rely on wireless links to server, this architecture would allow us develop and test the components before they were ported to the PDA itself.
Cepstral and Multimodal Technologies had already spent significant effort in produce speech synthesizers and speech recognizers respectively, that were optimized for the StrongARM platform. The process available on most PDAs today is the StrongARM SA1110 (206MHz) or the XScale, used directly as a StrongARM replacement. Neither of these processors are particularly fast and neither offer floating point. Although floating point instructions can be emulated this is far too slow for core functions. Although the Arabic speech models were built just for this projects, and the English speech models were adapted, it was necessary to have already developed the core engines and related model building systems before hand in order to delivery a full two way system in a new language in such a short time.
The engines used for interlingua translation had not yet been ported to Windows CE. Thus at first we used a wireless connection between the PDA and a Linux server to provide the interlingua-to-text part of the both the English and Arabic generation. We later replaced this with an on-device statistical generator that was computationally light enough to run on the device itself. That this statistical generation could be run on the device was not because statistical generation is inherently more less computationally intensive than the rule based generator but that the statistical generator was developed with a port of Windows CE in mind.
The parser part of the system was moved into the recognizer so that the parsing restrictions could better constrain recognition. On such a limited device efficient recognition is important and linking the ASR decoder with a strong appropriate language model is a good thing to do.
The whole system was built as a single binary, being the best use of the process model under Windows CE. Each module maps in its appropriate language data files. Although at present we only have examples in English and Arabic there is nothing language specific in the basic engines.
At run time the system uses around 28Mb of memory and hence can comfortably run on a 64Mb PDA. However as the run-time memory and the storage memory are distinct on most PDAs we also need a separate storage card installed to hold the system (about 30Mb). We have been using Compaq (HP) iPaq 3800 series machines (StrongARM 206MHz) and 3900 series machines (XScale 400MHz) for basic development but also have ported the system to the Dell Axim and the one-way Phraselator hardware. The Dell Axim (XScale 300MHz) has only 32Mb, but we found the system ran well, though slower than on 64Mb machines.