Filtering Web documents for the Classroom in the REAP tutor
The REAP tutor is an intelligent tutoring system that provides individualized
and authentic vocabulary practice for English as a Second Language (ESL)
students. In order to create a sufficiently large corpus of reading material to
cover the variety of vocabulary words that students may need to learn, REAP must
crawl for documents on the World Wide Web. The problem is that most of the
retrieved pages, though they contain target vocabulary words, are not suitable
reading materials for ESL students. Therefore, the REAP system must apply a set
of filters on retrieved documents in order to maintain a corpus of high quality
reading materials. These filters are effective enough that REAP is able to
maintain a corpus of over 50,000 readings without teachers or researchers
reviewing readings. In fact, the student is actually the first human to see the
text in REAP.
In this talk, I will discuss some of the filters that are applied to documents
in order to identify good reading materials. A major problem we encountered is
that in many Web pages, the text is ill-structured in various ways: some pages
are primarily lists of products and features or proper names; some contain most
of the text in menu structure; some are part of casually written blogs or
discussion board postings; some are technical documents not following normal
writing styles; etc. I will discuss three approaches to identifying and
filtering out such documents in which the text is not contained in the cohesive
paragraph form that is valuable for learners.
Second, I will discuss the use of Support Vector Machines for topic detection of
reading materials in REAP. A classifier was trained on the top level of the
Open Directory Project, from which ten general topic categories were chosen
(e.g., Arts, Business, Science). The classifier achieved a macro-averaged
F1-measure of 0.76 in leave-one-out cross validation tests, and a multi-way
classification accuracy of 0.79 on held-out test data. The classifier was used
to categorize readings in REAP in an attempt to increase motivation of students.
In a task-oriented evaluation of the topic detection system, the effectiveness
at increasing motivation and the effects on learning outcomes are being
evaluated in a current study.
Finally, I will discuss the "Lois Measure," a quantitative measure of teacher
acceptance that illustrates the improvement of the REAP system as these filters
were implemented and deployed in the classroom.