next up previous
Next: Temporal Reference Resolution Rules Up: Algorithm Previous: Overview

   
Architecture


  
Figure 5: The Enthusiast System
\begin{figure}\begin{center}
\centerline{
\psfig{figure=f2id.eps,width=0.99\textwidth}}
\end{center}\end{figure}

Our system was developed to be integrated into the Enthusiast system developed at Carnegie Mellon University (see [30,21,31,20]). Enthusiast is a speech-to-speech machine translation system from Spanish into English. The aspects of the system needed for this paper are shown in Figure 5. The system processes all the utterances of a single speaker turn together (utterances 1 through n in the figure). Each spoken Spanish utterance is input to the speech recognizer, which produces one or more transcriptions of the utterance. The output of the speech recognition system is the input to a semantic parser [20,21], which produces a representation of the literal meaning of the sentence. This representation is called an Interlingual Text (ILT). The output of the semantic parser is ambiguous, consisting of multiple ILT representations of the input transcription. All of the ILT representations produced for an utterance are input to the discourse processor, which produces the final, unambiguous representation of that utterance. This representation is called an augmented ILT.

The discourse processor can be configured to be our system alone, a plan-based discourse processor developed at CMU [31], or the two working together in integrated mode. The main results, presented in Tables 2 and 3 in Section 6, are for our system working alone, taking as input the ambiguous output of the semantic parser. For the CMU dialogs, the input to the semantic parser is the output of the speech recognition system. The NMSU dialogs were input to the semantic parser directly in the form of transcriptions.3

To produce one ILT, the semantic parser maps the main event and its participants into one of a small set of case frames (for example, a meet frame or an is busy frame). It also produces a surface representation of the temporal information in the utterance, which mirrors the form of the input utterance. Although the events and states discussed in the NMSU data are often outside the coverage of this parser, the temporal information generally is not. Thus, the parser provides a sufficient input representation for our purposes on both sets of data.

As the Enthusiast system is configured, the input is presented to our discourse processor in the form of alternative sequences of ILTs. Each sequence contains one ILT for each utterance. For example, using the notation in Figure 5, a sequence might consist of ILT1,2,3, ILT2,1,1, $\ldots$, ILTn,2,1. Our system resolves the ambiguity in batches. Specifically, it produces a sequence of Augmented ILTs for each input sequence, and then chooses the best sequence as its final interpretation of the corresponding utterances. In this way, the input ambiguity is resolved as a function of finding the best temporal interpretations of the utterance sequences in context (as suggested by [30]). However, the number of alternative sequences of ILTs for a set of utterances can be prohibitively large for our system. The total number of sequences considered by the system is limited to the top 125, where the sequences are ordered using statistical rankings provided by the Enthusiast system.

Our method for performing semantic disambiguation is appropriate for this project, because the focus is on temporal reference resolution and not on semantic disambiguation. However, much semantic ambiguity cannot be resolved on the basis of the temporal discourse context alone, so this represents a potential area for improvement in the system performance results presented in Section 6. In fact, the Enthusiast researchers have already developed better techniques for resolving the semantic ambiguity in these dialogs [32].

Because the ILT representation was designed to support various projects in discourse, semantic interpretation, and machine translation, the representation produced by the semantic parser is much richer than is required for our temporal reference resolution algorithm. We recommend that others who implement our algorithm for their application build an input parser to produce only the necessary temporal information. The specification of our input is available in Online Appendix 2.

As described in Section 4.3, a focus list records the Temporal Units that have been discussed so far in the dialog. After a final Augmented ILT has been created for the current utterance, the Augmented ILT and the utterance are placed together on the focus list. In the case of utterances that specify more than one Temporal Unit, a separate entity is added for each to the focus list, in order of mention. Otherwise, the system architecture is similar to a standard production system, with one major exception: rather than choosing the results of just one of the rules that fires, multiple results can be merged. This is a flexible architecture that accommodates sets of rules targeting different aspects of the interpretation.

Following are the basic steps in processing a single ILT.

Step 1. The input ILT is normalized. In producing the ILTs that serve as input to our system, the semantic parser often represents pieces of information about the same time separately, mirroring the surface form of the utterance. This is done in order to capture relationships, such as topic-comment relationships, among clauses. Our system needs to know which pieces of information are about the same time, but does not need to know about the additional relationships. Thus, the system maps the input representation into a normalized form, to shield the reasoning component from the idiosyncracies of the input representation. A specification of the normalized form is given in Online Appendix 2.

The goal of the normalization process is to produce one Temporal Unit per distinct time specified in the utterance. The normalization program is quite detailed (since it must account for the various structures possible in the CMU input ILT), but the core strategy is straightforward: it merges information provided by separate noun phrases into one Temporal Unit, if it is consistent to do so. Thus, new Temporal Units are created only if necessary. Interestingly, few errors result from this process. Following are some examples.

I can meet Wednesday or Thursday. Represented as two disjoint TUs.
I can meet from 2:00 until 4:00 on the 14th. Represented as one TU.
I can meet Thursday the 11th of August. Represented as one TU.

After the normalization process, highly accurate, obvious inferences are made and added to the representation.

Step 2. All of the rules are applied to the normalized input. The result of a rule application is a Partial Augmented ILT--information this rule will contribute to the interpretation of the utterance, if it is chosen. This information includes a certainty factor representing an a priori preference for the type of anaphoric or deictic relation being established. In the case of anaphoric relations, this factor is adjusted by a term representing how far back on the focus list the antecedent is (in the anaphoric rules in Section 5.3, the adjustment is represented by distance factor in the calculation of the certainty factor CF). The result of this step is the set of Partial Augmented ILTs produced by the rules that fired (i.e., those that succeeded).

In the case of multiple Temporal Units in the input ILT, each rule is applied as follows. If the rule does not access the focus list, the rule is applied to each Temporal Unit. A list of Partial Augmented ILTs is produced, containing one entry for each successful match, retaining the order of the Temporal Units in the original input. If the rule does access the focus list, the process is the same, but with one important difference. The rule is applied to the first Temporal Unit. If it is successful, then the same focus list entity used to apply the rule to this Temporal Unit is used to interpret the remaining Temporal Units in the list. Thus, all the anaphoric temporal references in a single utterance are understood with respect to the same focus list element. So, for example, the anaphoric interpretations of the temporal expressions in ``I can meet Monday or Tuesday'' both have to be understood with respect to the same entity in the focus list.

When accessing entities on the focus list, an entry for an utterance that specifies multiple Temporal Units may be encountered. In this case, the Temporal Units are simply accessed in order of mention (from most to least recent).

Step 3. All maximal mergings of the Partial Augmented ILTs are created. Consider a graph in which the Partial Augmented ILTs are the vertices, and there is an edge between two Partial Augmented ILTs if they are compatible. Then, the maximal cliques of the graph (i.e., the maximal complete subgraphs) correspond to the maximal mergings. Each maximal merging is then merged with the normalized input ILT, resulting in a set of Augmented ILTs.

Step 4. The Augmented ILT chosen is the one with the highest certainty factor. The certainty factor of an Augmented ILT is calculated as follows. First, the certainty factors of the constituent Partial Augmented ILTs are summed. Then, critics are applied to the resulting Augmented ILT, lowering the certainty factor if the information is judged to be incompatible with the dialog state.

The merging process might have yielded additional opportunities for making obvious inferences, so this process is performed again, to produce the final Augmented ILT.

To process the alternative input sequences, a separate invocation to the core system is made for each sequence, with the sequence of ILTs and the current focus list as input. The result of each call is a sequence of Augmented ILTs, which are the system's best interpretations of the input ILTs, and a new focus list, representing the updated discourse context corresponding to that sequence of interpretations. The system assigns a certainty factor to each sequence of Augmented ILTs, specifically, the sum of the certainty factors of the constituents. It chooses the sequence with the highest certainty factor, and updates the focus list to the focus list calculated for that sequence.


next up previous
Next: Temporal Reference Resolution Rules Up: Algorithm Previous: Overview