Next: Conclusions Up: LET'S GO: Improving Spoken Previous: Discussion

Evaluation

So far, we have only carried out empirical evaluations of the system.

An intial experiment was to try to elicit how users might speak to a bus information system. The idea was to see how they would formulate their queries in specific situations. We designed five scenarios for which the user needed to get some specific information on a bus (e.g. line number between a start point and an end point or time of the next bus at a given stop). We set up a dedicated phone line in our office and asked people in the Language Technologies Institute to pick one or two scenarios and call us. We did not try to emulate human-machine conversations and rather acted as if we were operators from the Port Authority. In all, we recorded 28 phone calls from 17 different callers (7 native and 10 non-native speakers of English). This data was used to manually extend the initial set of grammar rules for parsing and refine our dialog model.

The information gathered from this experiment was used in designing the input language for the system.

Since our initial telephone-based system has only recently become operable, we have not as yet carried out any formal tests. However, we have made the following observations.

The system works well for simple requests. When some information is missing, it is able to request it explicitly from the user. Hence, the dialog can be very short when the user expresses a complete query in one sentence (e.g. ``When is the next bus leaving X going to Y?''). It can be longer and more system-directed if part of the request is missing or not recognized (see Figure 2 for an example of such a dialog). Systematic explicit confirmation from the system can be annoying for some users but we found that, given the current number of speech recognition errors, it is important for the user to monitor the understanding of the system.

As said above, speech recognition is acceptable but far from perfect. We think that this is mainly due to the limitations of the ``artificial'' language model. As we get more experience with the system and collect data from a wider range of users, we are adjusting the generative grammar and thus improving the LM's quality. Ultimately, we will collect enough real data to train a model directly on it.

Our baseline synthesizer was the standard diphone synthesizer in Festival which is not sufficient (particularly over the telephone), hence our move to a domain synthesizer. Although building a domain synthesizer is more work, it is clear that a better output voice is necessary before we can make the system available to a wider populations.

Naming bus stops is a non-trivial problem, and we are looking at general techniques to be able to match what our users may say when referring to stops.

Next: Conclusions Up: LET'S GO: Improving Spoken Previous: Discussion

Alan W Black 2003-10-27