Sphinx II uses a statistical language model (n-grams) for recognition. The result of the recognition is then parsed by Phoenix, a robust parser based on an extended Context Free Grammar allowing the system to skip unknown words and perform partial parsing .
Ideally, we would like to train the statistical language model on a corpus of transcribed dialogs corresponding to our particular task. Since the project started relatively recently and it took time to obtain proper permission to record calls to the Port Authority, we have just begun to receive specific data for our task and have not yet had time to preprocess it. The only Port Authority data we have used in the system so far is the set of official names of the bus stops, as stored in the schedule database.
Our approach to language modeling was to first write a grammar for our parser, then generate an artificial corpus of text from the parsing grammar and third, train a statistical language model on the artificial corpus. We wrote the grammar based on a combination of our own intuition and a small scale Wizard-of-Oz experiment we ran. The grammar rules used to identify bus stops were generated automatically from the schedule database.
In order to make the parsing grammar robust enough to parse fairly ungrammatical, yet understandable sentences, it was kept as general as possible. When used for speech generation, however, a very general grammar produces a very large amount of not only ungrammatical, but unnatural sentences. We therefore modified the grammar to make it suitable for speech generation and enhanced it by weighting the rules according to our observations of how frequent they are in natural language. We also adjusted the weight of the bus stop names according to how frequently they are likely to be present in user requests, again based on our own observations. Using the modified grammar, we generated a 200,000-sentence corpus which is large enough to cover most of the bus stop and time expressions in the domain. We trained a 3-gram model on the corpus using the CMU-Cambridge Statistical Language Modeling Kit .
Although the resulting language model is not as good as one built from real data, it allows us to obtain a usable prototype with which we can now collect and transcribe dialogs that take place in the experiments with extreme populations, while we await preprocessed real training data.
We are approaching the language modeling and dialog management with one of the main goals of the project in mind -- detecting incorrect lexical and grammatical structures in non-native speech and offering correction. The language model on the one hand needs to be general enough to accept sentence structures and use of expressions that are not quite correct. For example, asking for ``the coming bus'' instead of ``the next bus'', or ``when the bus is coming'' instead of ``when is the next bus coming'' should be acceptable to our system. However, the phrase ``when done bus come here'' would be difficult to accept. By accepting the former examples, we then want to give the user subtle correction help so that the next time he/she uses the word or expression it is correct. But this is not a language learning system. Some of the users are calling just before they run out the door to catch the bus. We therefore have at the most two short sentences for the correction. We are starting to build utterances where we take the incorrect response, ``the coming bus'', for example, and respond with ``You want the next bus?'', with higher pitch and intensity on the correct word, ``next''. This advances the dialog while giving corrective information at the same time.