Unfortunately, it is not easily to judge the success of a phrase break assignment algorithm. As there are typically more non-breaks than breaks (in our data about 4:1), failure to predict a break can be judged better than over-predicting a break if a simple percentage overall correct score is used. Counting just the correct breaks is useful but only if some measure of over-prediction is included (if you massively over predict, the percentage breaks correct score will be high).
Another more serious problem is that there can be different but valid ways for a speaker to phrase an utterance. As the assigned results are compared against actual examples they may differ in acceptable ways, as well as unacceptable ways, and there is no easy way to find out the type of error. Ostendorf and Veilleux  deal with this problem by having five different speakers read each test utterance. Assignment is considered correct if the whole utterance matches any of the five samples. Unfortunately, we did not have the resources to re-record our database examples and hence could only do a direct match to one example. However, the results in  indicate that the best results when comparing with a single speaker are likely to still be the best when compared with multiple examples, even though some assignments are judged incorrect by the measurement.
Here we present results with three figures, percentage breaks correct, overall (breaks and non-breaks) correct and percentage non-breaks incorrect (a measure of break over prediction).