next up previous
Next: Character Identification in a Up: ESPER: architecture Previous: Identifying Quoted Speech Types

Using a Decision Tree For Quoted-Speech Type Identification

We trained a decision tree (CART) to identify the aforementioned types of quoted speech using local feature information in the story text. The collection of training data consisted of 16 children's stories taken from works by Hans Christian Andersen and Lewis Carroll, with a total of 1198 pieces of quoted speech. In order to ensure that the training data are correctly labeled, we performed a first approximation of quoted speech types over the training data using a naive rule such that if the first word in the quoted speech is not capitalized, then the quote is classified as type ``CONT''; otherwise it is classified as type ``NEW''. The resulting output from this initial pass was then hand-corrected to eliminate any incorrect type assignment resulting from the application of this rule. From this training data, we then extracted a number of features for each piece of quoted speech in order to train the decision tree. These features include:

The performance of the decision tree after training was as follows:


Table 2: Performance of Decision Tree on Identifying Type of Quoted Speech
New Cont.
98.8% 82.6%

From examining the tree it is interesting to notice that the feature which serves as the most reliable predictor of quoted-speech types is the capitalization feature. Even though intuitively, other features, such as the punctuation of the previous token before the quote, might also seem like good predictors of quote types, statistically they were deemed to be less reliable.


next up previous
Next: Character Identification in a Up: ESPER: architecture Previous: Identifying Quoted Speech Types
Alan W Black 2003-10-20