Language Technologies Institute
Student Research Symposium 2006

Filtering Web documents for the Classroom in the REAP tutor

Michael Heilman

The REAP tutor is an intelligent tutoring system that provides individualized and authentic vocabulary practice for English as a Second Language (ESL) students. In order to create a sufficiently large corpus of reading material to cover the variety of vocabulary words that students may need to learn, REAP must crawl for documents on the World Wide Web. The problem is that most of the retrieved pages, though they contain target vocabulary words, are not suitable reading materials for ESL students. Therefore, the REAP system must apply a set of filters on retrieved documents in order to maintain a corpus of high quality reading materials. These filters are effective enough that REAP is able to maintain a corpus of over 50,000 readings without teachers or researchers reviewing readings. In fact, the student is actually the first human to see the text in REAP.

In this talk, I will discuss some of the filters that are applied to documents in order to identify good reading materials. A major problem we encountered is that in many Web pages, the text is ill-structured in various ways: some pages are primarily lists of products and features or proper names; some contain most of the text in menu structure; some are part of casually written blogs or discussion board postings; some are technical documents not following normal writing styles; etc. I will discuss three approaches to identifying and filtering out such documents in which the text is not contained in the cohesive paragraph form that is valuable for learners.

Second, I will discuss the use of Support Vector Machines for topic detection of reading materials in REAP. A classifier was trained on the top level of the Open Directory Project, from which ten general topic categories were chosen (e.g., Arts, Business, Science). The classifier achieved a macro-averaged F1-measure of 0.76 in leave-one-out cross validation tests, and a multi-way classification accuracy of 0.79 on held-out test data. The classifier was used to categorize readings in REAP in an attempt to increase motivation of students. In a task-oriented evaluation of the topic detection system, the effectiveness at increasing motivation and the effects on learning outcomes are being evaluated in a current study.

Finally, I will discuss the "Lois Measure," a quantitative measure of teacher acceptance that illustrates the improvement of the REAP system as these filters were implemented and deployed in the classroom.