This note refers to the following paper:

M. Heilman, K. Collins-Thompson, and M. Eskenazi. 2008. An analysis of statistical models and features for reading difficulty prediction. In Proc. of The 3rd Workshop on Innovative Use of NLP for Building Educational Applications.

In section 4.1, the paper states that "The corpus consisted of approximately 150,000 words, distributed among 289 texts." However, that statement actually pertained to the training set portion only. The entire corpus consisted of 373 texts with a total of roughly 185,000 words. The held-out test set had 84 texts with roughly 35,000 words.

Click here to go to Michael Heilman's home page.