As we were to build these system in a short period and on a small budget, data driven approaches were the only reasonable method. Such techniques must be used for each of the three core components: machine translation, speech recognition engines, and speech synthesis engines.
Thus at the very start we arranged to record a number of chaplains in role-playing conversations of the type we expected the device to encounter. Fortunately, the chaplains were familiar with role-playing exercises, and all had relevant field experiences to re-enact. Both sides of the conversations were in English. These were digitally recorded with head mounted microphones at 16KHz in stereo (one speaker on each channel), as this was closest to the intended audio channel characteristics of the eventual system. In all, we recorded 46 conversations, ranging from a few minutes to 20 minutes length. In total there was 4.25 hours of actual speech.
These conversations were then hand-transcribed at the word level, identifying false starts, filled pauses and the complete words.
Next the transcriptions of these English-English conversations were translated into Croatian by native Croatian speakers by hand.
This data provided the basic information from which we could boot strap the rest of the speech-to-speech translation system.