next up previous
Next: Testing Methodology Up: Data and Evaluation Previous: Data and Evaluation

Performance Criteria

Performance is assessed with reference to N, the total number of junctures in the test set, and B, the total number of junctures which are breaks. A deletion error (D) occurs when a break is marked in the reference sentence but not in the test sentence. An insertion error (I) occurs when a break is marked in the test sentence but is not in the reference. A substitution error (S) occurs when a break occurs in the right place but is of the wrong type. This type of error is only relevant when more than one type of break is being recognised. There is no single best way to measure the performance of a phrase break assignment algorithm, and a variety of approaches have been proposed in the literature. We explain these performance measures below.


\begin{displaymath}
\mbox{Breaks-correct} = \frac{B -D - S}{B} \times \mbox{100\%}
\end{displaymath}


\begin{displaymath}
\mbox{Non-breaks-correct} = \frac{N -I - S}{N} \times \mbox{100\%}
\end{displaymath}


\begin{displaymath}
\mbox{Junctures-correct} = \frac{N -D -S -I}{N} \times \mbox{100\%}
\end{displaymath}


\begin{displaymath}
\mbox{False insertions w.r.t junctures} = \frac{I}{N} \times \mbox{100\%}
\end{displaymath}


\begin{displaymath}
\mbox{False insertions w.r.t breaks} = \frac{I}{B} \times \mbox{100\%}
\end{displaymath}

The difference between breaks-correct and junctures-correct lies in whether non-breaks are included in the calculation. The junctures-correct score gives credit when both the test and reference sentences have a non-break at the same juncture, while the breaks correct score only looks at junctures with breaks. In our data, the number of non-breaks outnumbers the number of breaks by a ratio of about 4:1, and hence an algorithm which marks everything as non-break will score about 80% junctures-correct, but 0% breaks-correct. (Wang and Hirschberg hirschberg:92 gives nearly identical figures for the relative number of non-breaks to breaks.) Because the breaks-correct score is not dependent on the relative distributions of breaks and non-breaks, we regard this as a better indicator of algorithm performance. The assessment of insertions is more troublesome: one can either calculate them as a percentage of the number of breaks in the test set or of the number of junctues, as in Ostendorf and Veilleux ostendorf:94. For reasons of readability and succinctness we use only the breaks-correct, jucntures-correct and juncture-insertion scores in this paper.


next up previous
Next: Testing Methodology Up: Data and Evaluation Previous: Data and Evaluation
Alan W Black
1999-03-20