next up previous
Next: Model Up: An Empirical Approach to Previous: The Corpora

The Temporal Annotations and Intercoder Reliability Study

Consider the passage shown in Figure 1, which is from the CMU corpus (translated into English). An example of temporal reference resolution is that utterance (2) refers to 2-4pm, Thursday 30 September.

Figure 1: Corpus Example
\begin{tabular}{\vert lll\vert}
...sday the thirtieth of September \\

Because the dialogs are centrally concerned with negotiating an interval of time in which to hold a meeting, our representations are geared toward such intervals. The basic representational unit is given in Figure 2. It is referred to throughout as a Temporal Unit (TU).

Figure 2: The Temporal Unit Representation
\begin{tabular}{\vert lllll\vert}
...r\&minute, &
end-time-of-day)) \\

For example, the time specified1 in ``From 2 to 4, on Wednesday the 19th of August'' is represented as follows:

((August, 19, Wednesday, 2, pm)
(August, 19, Wednesday, 4, pm))

Thus, the information from multiple noun phrases is often merged into a single representation of the underlying interval specified by the utterance.

Temporal references to times in utterances such as ``The meeting starts at 2'' are also represented in terms of intervals. An issue this kind of utterance raises is whether or not a speculated end time of the interval should be filled in, using knowledge of how long meetings usually last. In the CMU data, the meetings all last two hours, by design. However, our annotation instructions are conservative with respect to filling in an end time given a starting time (or vice versa), specifying that it should be left open unless something in the dialog explicitly suggests otherwise. This policy makes the instructions applicable to a wider class of dialogs.

Weeks, months, and years are represented as intervals starting with the first day of the interval (for example, the first day of the week), and ending with the last day of the interval (for example, the last day of the week).

Some times are treated as points in time (for example, the time specified in ``It is now 3pm''). These are represented as Temporal Units with the same starting and end times (as in [2]). If just the starting or end time is specified, all the fields of the other end of the interval are null. And, of course, all fields are null for utterances that do not contain any temporal information. In the case of an utterance that specifies multiple, distinct intervals, the representation is a list of Temporal Units (for further details of the coding scheme, see [27]).

Temporal Units are also the representations used in the evaluation of the system. That is, the system's answers are mapped from its more complex internal representation (an ILT, see Section 5.2) into this simpler vector representation before evaluation is performed.

The evaluation Temporal Units used to assess the system's performance were annotated by personnel working on the project. The training data were annotated by the second author of this paper, who also worked on developing the rules and other knowledge used in the system. However, the test data were annotated by another project member, Karen Payne, who contributed to the annotation instructions and to the integration of the system with the Enthusiast system (see below in Section 5.2), but did not contribute to developing the rules and other knowledge used in the system.

As in much recent empirical work in discourse processing (see, for example, [4,16,22,25,13]), we performed an intercoder reliability study investigating agreement in annotating the times. The main goal in developing annotation instructions is to make them precise but intuitive so that they can be used reliably by non-experts after a reasonable amount of training (see [28,8,13]). Reliability is measured in terms of the amount of agreement among annotators; high reliability indicates that the encoding scheme is reproducible given multiple annotators. In addition, the instructions also serve to document the annotations.

The subjects were three people with no previous involvement in the project. They were given the original Spanish and the English translations. However, as they have limited knowledge of Spanish, in essence they annotated the English translations.

The subjects annotated two training dialogs according to the instructions. After receiving feedback, they annotated four unseen test dialogs. Intercoder reliability was assessed using Cohen's Kappa statistic ($\kappa$) [35,6]. Agreement for each Temporal Unit field (for example, start-month) was assessed independently.

$\kappa$ is calculated as follows:

\begin{displaymath}\kappa = \frac{Pa - Pe}{1 - Pe} \end{displaymath}

The numerator is the average percentage agreement among the annotators (Pa) less a term for expected chance agreement (Pe), and the denominator is 100% agreement less the same term for chance agreement (Pe).

Pa and Pe are calculated as follows [35]. Suppose that there are N objects, M classes, and K taggers. We have the following definitions.

We can now define Pe:

\begin{displaymath}Pe = \sum_{j=1}^{M} p^{2}_{j} \end{displaymath}

The extent of agreement among the taggers concerning the ith object is Si, defined as follows. It is the total number of actual agreements for object i, over the maximum possible agreement for one object:

\begin{displaymath}S_i = \frac{\sum_{j=1}^{M} \left( \begin{array}{c} n_{ij} \\ ...
... \right)}
{\left( \begin{array}{c} K \\ 2 \end{array} \right)} \end{displaymath}

Finally, Pa is the average agreement over objects:

\begin{displaymath}Pa = \frac{1}{N} \sum_{i=1}^{N} S_i \end{displaymath}

$\kappa$ is 0.0 when the agreement is what one would expect under independence, and it is 1.0 when the agreement is exact [11]. A $\kappa$ value of 0.8 or greater indicates a high level of reliability among raters, with values between 0.67 and 0.8 indicating only moderate agreement [13,6].

In addition to measuring intercoder reliability, we compared each coder's annotations to the gold standard annotations used to assess the system's performance. Results for both types of agreement are shown in Table 1. The agreement among coders is shown in the column labeled $\kappa$, and the average pairwise $\kappa$ values for the coders and the expert who performed the gold standard annotations are shown in the column labeled $\kappa_{avg}$. This was calculated by averaging the individual $\kappa$ scores (which are not shown).

Table 1: Agreement Among Coders
Field Pa Pe $\kappa$ $\kappa_{avg}$
Month .96 .51 .93 .94
Date .95 .50 .91 .93
DayofWeek .96 .52 .91 .92
HourMin .98 .82 .89 .92
TimeDay .97 .74 .87 .74
Month .97 .51 .93 .94
Date .96 .50 .92 .94
DayofWeek .96 .52 .92 .92
HourMin .99 .89 .90 .88
TimeDay .95 .85 .65 .52

There is a high level of agreement among annotators in all cases except the end time of day field, a weakness we are investigating. There is also good agreement between the evaluation annotations and the naive coders' evaluations: with the exception of the time of day fields, $\kappa_{avg}$ indicates high average pairwise agreement between the expert and the naive subjects.

Busemann et al. [5] also annotate temporal information in a corpus of scheduling dialogs. However, their annotations are at the level of individual expressions rather than at the level of Temporal Units, and they do not present the results of an intercoder reliability study.

next up previous
Next: Model Up: An Empirical Approach to Previous: The Corpora