Next:Importance of the anaphoricUp:Experimental workPrevious:Experimental work

Corpora, tools, and description of experiments

In order to evaluate the anaphora resolution algorithm proposed in this paper, the general process outlined in Figure 3 was followed.
\includegraphics[width=7cm, clip]{fig01.eps}
Figure 3: Full evaluation process
Data for the evaluation were taken from the Corpus InfoTren: Person, a corpus of 204 transcribed spoken Spanish dialogues provided by the Basurde Project [Basurde Project 1998]. These dialogues are conversations between a railway company employee and a client. The transcriptor used in the Basurde Project provides turn and speaker markup. Out of 204 dialogues, 40 were selected for the training (training corpus) and the remaining 164 were reserved for the final evaluation (test corpus). These 204 dialogues contain 345 pronominal anaphors and 257 adjectival anaphors.

We had two aims in using the training corpus: (1) to estimate the importance of the structural anaphoric accessibility space, and (2) to define an adequate set of constraints and preferences (experiments 0, 1, 2, and 3). The test corpus was reserved to obtain the final evaluation.

In addition, the entire corpus was manually annotated with two different goals: (1) to identify further discourse structural properties such as adjacency pairs and topics, and (2) to identify anaphors and antecedents. Although we annotated the corpus manually, there are at present some automatic systems for performing adjacency pair tagging (the Basurde Project [Basurde Project 1998], for example), as well as for automatic topic tagging [Reynar 1999] or automatic topic extraction (see the method for anaphora resolution described in Section 4).

The annotation of conversational structure was carried out as described in the next paragraph. An important aspect of dialogue structure annotation is the training phase, which assures reliability among annotators.

The annotation phase was accomplished as follows: (1) two annotators were selected, (2) an agreement7 was reached between the two annotators with regard to the annotation scheme using a training corpus, (3) the annotation was then carried out by both annotators in parallel over the test corpus, and (4) a reliability study was carried out on the annotation [Carletta et al. 1997]. The reliability study used the kappa statistic that measures the affinity between the annotations of the two annotators by making judgments about categories. For computing the kappa (k) statistic, see Siegel and Castellan 1988.8

Because turns are marked during the transcription phase, the annotator merely classifies turns according to the turn types described in Section 3 and then relates each initiative intervention ITI to its reaction intervention ITR, thereby defining adjacency pairs. Since this task simply requires classification, it is easily measured using the kappa statistic.

Concurrently, topics were identified. This task was also simple, since the corpus used for these experiments is organized into short dialogues and each dialogue has only one main topic or theme, and since these are introduced clearly by means of the client's intervention at the beginning of each dialogue. As a result, we detected no discrepancies between annotators with regard to the topic identification. Therefore, there was no need to measure this task using the kappa statistic.

According to Carletta et al., a k measurement between 0.68  and 0.80 allows us to make positive conclusions, and if k is greater than 0.80, we have total reliability between the results of the annotators.

In those cases where a discrepancy was found between the annotators, the following criterion was applied: each dialogue was assigned a main annotator whose annotation was considered definitive in the event that there were discrepancies between the two accounts. In order to guarantee balance, each annotator was the main annotator for exactly 50% of the dialogues.

Once both annotators had finished the annotation, the reliability study was carried out, with a resultant kappa measurement of 0.91. We therefore consider the annotation obtained for the evaluation to be totally reliable.

Since the annotated texts would be processed by an anaphora resolution system, we developed an SGML tagging format.

Generally, this SGML markup will have the following form:

Thus, the following notations are used in each case: The format is exemplified in Figure 4.
<TOPIC> tren 
<AP ID=''4''>
<IT TYPE=''I'' SPEAKER=''CL''> el de las seis y media ¿llega a Monzón?
(the one at half-past six, does it go to Monzon?) 
<AP ID=''5''>
<IT TYPE=''I'' SPEAKER=''OP''> a ver. el de las seis y media me ha preguntado ¿verdad?
(let me see. you've asked about the one at half-past six,
<IT TYPE=''R'' SPEAKER=''CL''> si 
<IT TYPE=''R'' SPEAKER=''OP''> a las nueve y veinticinco.
(twenty-five past nine.)
Figure 4: Example of SGML annotation
As explained above, in addition to this structural annotation, the corpus was also anaphorically annotated by marking up the anaphoric relations between all pronominal and adjectival anaphors and their correct antecedents. In order to guarantee the results, this annotation was performed by two different annotators in parallel and a reliability study of the subsequent annotation was then carried out. Once again, the annotation was treated as a classification task, consisting of selecting the appropriate elements in the candidate list (we estimated an average of 6.5 possible antecedents per anaphor after applying constraints). The reliability study of the manual anaphoric annotation resulted in a kappa measurement of 0.87.

In addition, the corpus was tagged using the POS tagger technique described by Pla and Prieto 1998. From this, we obtained morphological and lexical information. The corpus was then parsed using the SUPP partial parser proposed by Martínez-Barco et al. 1998 in order to obtain syntactic information. Finally, the proposed anaphora resolution algorithm was applied.

Several studies were then carried out in order to identify the importance of defining an adequate anaphoric accessibility space and of defining a constraint and preference system based on this space. In these studies, we compared the output of the anaphora resolution system with the manual annotation and generated several statistical results.

Next:Importance of the anaphoricUp:Experimental workPrevious:Experimental work
patricio 2001-10-17