In April 2001, a group organized by the US Army Chaplain school took two versions of the device to Zagreb, where it was tested with non-English-speaking Croatians. A number of scenarios were prepared in English and Croatian, and were given to each participant to act out using the translation device. The scenarios were in the intended domain, involving refugees, medical supplies and getting general directions.
In all, 21 dialogs took place, between different Croatian speakers and one of 5 chaplains. After the test, the Croatian participants were given a questionnaire to fill out. Their overall impression was as follows:
Our own observations of the basic system were that it did actually work to a level that was useful about one half of the time (it was not clear in advance that this necessarily would be the case). The participants were capable of communicating through the system and real information was transferred.
Overall Good 5 OK 11 Bad 3
However, as expected, there were a number of specific problems. One that we noted immediately was a frustrating slowness of communication, due to required user clarifications, though it was much faster than if a bi-lingual dictionary were the only translation device available.
On asking the participants to identify the most difficult problems, they replied as follows:
Hardware issues with the volume of the built-in speakers were a clear (and easily solvable) problem. But other aspects of the core technology were both harder to identify and harder to fix. The system includes a facility to allow the user (typically the chaplain) to explicitly add new words and phrases to the system so that common errors can be minimized over time. Although this facility was not used often, it is clear that supporting a greater level of adaptation would allow the device to become more useful over time.
User difficulties grammar/case 5 loudspeakers 4 translation 3 recognition 2 synthesis 2 speed 1
Unlike the English recognizer, the Croatian recognizer did not support filled pauses and hesitations. The effect was that extra short words (typically function words) were often erroneously hypothesized by the system. As the system displayed what was being recognized, it was easy for the speaker to delete those extra words by hand, which they often did. However the speakers also learned to speak more fluently and less conversationally as they used the system, improving recognition accuracy.
Similarly, we asked what they found easy:
It was quickly discovered by most participants that the system did not translate long, rambling sentences well. Short, direct sentences were much more likely to produce good translations. This was not surprising, given the limitations of the platform and the deliberate limiting of development time to see if such limitations still allowed a useful translation device. We were actually pleased to see that the system provided adequate coverage for successful translation of unrehearsed, naive dialogues.
What works? short sentences 10 nothing 4
Other specific observations we noted were that the users could not easily identify where the problems lay with the system. (For example, if speech recognition produced and displayed a correct transcript, and then translation produced an unacceptable result, they would usually respeak the same utterance using the same words.) Thus even if we provided separate user methods to add words to the recognizer, language model and translation engine, it is clear that the user would not be able to identify which part (or parts) need to be updated. As we feel that such systems need to provide methods of adaptation in the field, it is clear that the interface presented to the user to offer that adaptation needs more work.
Although there were problems with the volume of the output through the small built-in speakers on the device, which many users commented on, mistakes in the synthesizer were often erroneously attributed to the translator (and vice versa).
A second observation was that the participants continued to use speech and did not resort to the alternative typing interface (although they were clearly aware of it), and only resorted to typing as a last resort. This may have been due to the fact the participants were told to use the speech-to-speech translation device rather than have the more abstract goal of successful communication by the best means. The very small keyboard on the (required) small device may also have been a factor.
Further details of the evaluation are described in .