We had two aims in using the training corpus: (1) to estimate the importance of the structural anaphoric accessibility space, and (2) to define an adequate set of constraints and preferences (experiments 0, 1, 2, and 3). The test corpus was reserved to obtain the final evaluation.
In addition, the entire corpus was manually annotated with two different goals: (1) to identify further discourse structural properties such as adjacency pairs and topics, and (2) to identify anaphors and antecedents. Although we annotated the corpus manually, there are at present some automatic systems for performing adjacency pair tagging (the Basurde Project [Basurde Project 1998], for example), as well as for automatic topic tagging [Reynar 1999] or automatic topic extraction (see the method for anaphora resolution described in Section 4).
The annotation of conversational structure was carried out as described in the next paragraph. An important aspect of dialogue structure annotation is the training phase, which assures reliability among annotators.
The annotation phase was accomplished as follows: (1) two annotators were selected, (2) an agreement7 was reached between the two annotators with regard to the annotation scheme using a training corpus, (3) the annotation was then carried out by both annotators in parallel over the test corpus, and (4) a reliability study was carried out on the annotation [Carletta et al. 1997]. The reliability study used the kappa statistic that measures the affinity between the annotations of the two annotators by making judgments about categories. For computing the kappa (k) statistic, see Siegel and Castellan 1988.8
Because turns are marked during the transcription phase, the annotator merely classifies turns according to the turn types described in Section 3 and then relates each initiative intervention ITI to its reaction intervention ITR, thereby defining adjacency pairs. Since this task simply requires classification, it is easily measured using the kappa statistic.
Concurrently, topics were identified. This task was also simple, since the corpus used for these experiments is organized into short dialogues and each dialogue has only one main topic or theme, and since these are introduced clearly by means of the client's intervention at the beginning of each dialogue. As a result, we detected no discrepancies between annotators with regard to the topic identification. Therefore, there was no need to measure this task using the kappa statistic.
According to Carletta et al., a k measurement between 0.68 and 0.80 allows us to make positive conclusions, and if k is greater than 0.80, we have total reliability between the results of the annotators.
In those cases where a discrepancy was found between the annotators, the following criterion was applied: each dialogue was assigned a main annotator whose annotation was considered definitive in the event that there were discrepancies between the two accounts. In order to guarantee balance, each annotator was the main annotator for exactly 50% of the dialogues.
Once both annotators had finished the annotation, the reliability study was carried out, with a resultant kappa measurement of 0.91. We therefore consider the annotation obtained for the evaluation to be totally reliable.
Since the annotated texts would be processed by an anaphora resolution system, we developed an SGML tagging format.
Generally, this SGML markup will have the following form:
<ELEMENT-NAME ATTRIBUTE-NAME="VALUE" ...> text-string </ELEMENT-NAME>Thus, the following notations are used in each case:
<TOPIC> Topic-entity </TOPIC>
<AP ID="number"> Adjacency-pair </AP>where ID is an identification number used to arrange the adjacency pairs in sequential order
<IT TYPE="R|I" SPEAKER="speaker"> Intervention-turn </IT>where TYPE may be ``R'' or ``I'' (Reaction or Initiative) and SPEAKER is the indicator for the speaker whose turn it is
<CT SPEAKER="speaker"> Continuing-turn </CT>The format is exemplified in Figure 4.
In addition, the corpus was tagged using the POS tagger technique described by Pla and Prieto 1998. From this, we obtained morphological and lexical information. The corpus was then parsed using the SUPP partial parser proposed by Martínez-Barco et al. 1998 in order to obtain syntactic information. Finally, the proposed anaphora resolution algorithm was applied.
Several studies were then carried out in order to identify the importance of defining an adequate anaphoric accessibility space and of defining a constraint and preference system based on this space. In these studies, we compared the output of the anaphora resolution system with the manual annotation and generated several statistical results.