Project LISTEN
A Reading Tutor that Listens
Last updated: July 1, 2008

 

Summary
Awards 
In the News 
Progress 

 
Research Basis 
Publications

Photos
Videos
People

Project LISTEN Publications

[Note:  Links to full text are included when possible, e.g. after publication or conference presentation.

* marks publications by others.
See In the News for articles by others in newpapers, magazines, etc.
See Research Basis for a brief summary of published intervention studies and research underlying the Reading Tutor.
Most of these conferences and workshops involve two or more stringent peer reviews of the full paper (not just the abstract), including suggested revisions.  Publication in these proceedings is considered archival:  ITS2004 accepted only 73 of over 180 submissions as full papers; UM2003 26 of 105; ICMI2002 87 of 165; ITS2002 93 of 167; AIED2001 45 of 112; and AAAI2000 143 of 432.]

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

 


 

[ITS 2008 help] Beck, J. E., Chang, K.-m., Mostow, J., & Corbett, A. (2008, June 23-27). Does help help?  Introducing the Bayesian Evaluation and Assessment methodology. 9th International Conference on Intelligent Tutoring Systems, Montreal, 383-394.  ITS2008 Best Paper Award.  Click here for .pdf file.

 

Abstract:  Most ITS have a means of providing assistance to the student, either on student request or when the tutor determines it would be effective.  Presumably, such assistance is included by the ITS designers since they feel it benefits the students.  However, whether-and how-help helps students has not been a well studied problem in the ITS community.  In this paper we present three approaches for evaluating the efficacy of the Reading Tutor's help:  creating experimental trials from data, learning decomposition, and Bayesian Evaluation and Assessment, an approach that uses dynamic Bayesian networks.  We have found that experimental trials and learning decomposition both find a negative benefit for help--that is, help hurts!  However, the Bayesian Evaluation and Assessment framework finds that help both promotes student long-term learning and provides additional scaffolding on the current problem.  We discuss why these approaches give divergent results, and suggest that the Bayesian Evaluation and Assessment framework is the strongest of the three.  In addition to introducing Bayesian Evaluation and Assessment, a method for simultaneously assessing students and evaluating tutorial interventions, this paper describes how help can both scaffold the current problem attempt as well as teach the student knowledge that will transfer to later problems.

 


 

[ITS 2008 LD] Beck, J. E., & Mostow, J. (2008, June 23-27). How who should practice:  Using learning decomposition to evaluate the efficacy of different types of practice for different types of students. 9th International Conference on Intelligent Tutoring Systems, Montreal, 353-362.  Nominated for ITS2008 Best Paper.  Click here for .pdf file.

 

Abstract:  A basic question of instruction is how much students will actually learn from it.  This paper presents an approach called learning decomposition, which determines the relative efficacy of different types of learning opportunities.  This approach is a generalization of learning curve analysis, and uses non-linear regression to determine how to weight different types of practice opportunities relative to each other.  We analyze 346 students reading 6.9 million words and show that different types of practice differ reliably in how efficiently students acquire the skill of reading words quickly and accurately.  Specifically, massed practice is generally not effective for helping students learn words, and rereading the same stories is not as effective as reading a variety of stories.  However, we were able to analyze data for individual student's learning and use bottom-up processing to detect small subgroups of students who did benefit from rereading (11 students) and from massed practice (5 students).  The existence of these has two implications:  1) one size fits all instruction is adequate for perhaps 95% of the student population using computer tutors, but as a community we can do better and 2) the ITS community is well poised to study what type of instruction is optimal for the individual.

 


 

[ITS 2008 compare] Zhang, X., Mostow, J., & Beck, J. E. (2008). A Case Study Empirical Comparison of Three Methods to Evaluate Tutorial Behaviors. 9th International Conference on Intelligent Tutoring Systems, Montreal, 122-131.  Click here for .pdf file.

 

Abstract:  Researchers have used various methods to evaluate the fine-grained interactions of intelligent tutors with their students.  We present a case study comparing three such methods on the same data set, logged by Project LISTEN's Reading Tutor from usage by 174 children in grades 2-4 (typically 7-10 years) over the course of the 2005-2006 school year.  The Reading Tutor chooses randomly between two different types of reading practice.  In assisted oral reading, the child reads aloud and the tutor helps.  In "Word Swap," the tutor reads aloud and the child identifies misread words.  One method we use here to evaluate reading practice is conventional analysis of randomized controlled trials (RCTs), where the outcome is performance on the same words when encountered again later.  The second method is learning decomposition, which estimates the impact of each practice type as a parameter in an exponential learning curve.  The third method is knowledge tracing, which estimates the impact of practice as a probability in a dynamic Bayes net.  The comparison shows qualitative agreement among the three methods, which is evidence for their validity.

 


 

[EDM 2008 freeform] Zhang, X., Mostow, J., Duke, N. K., Trotochaud, C., Valeri, J., & Corbett, A. (2008, June 20-21). Mining Free-form Spoken Responses to Tutor Prompts. Proceedings of the First International Conference on Educational Data Mining, Montreal, 234-241.  Click here for .pdf file.

 

Abstract:  How can an automated tutor assess children's spoken responses despite imperfect speech recognition?  We address this challenge in the context of tutoring children in explicit strategies for reading comprehension.  We report initial progress on collecting, annotating, and mining their spoken responses. Collection and annotation yield authentic but sparse data, which we use to synthesize additional realistic data.  We train and evaluate a classifier to estimate the probability that a response mentions a given target.

 


 

[EDM 2008 analytic] Mostow, J., & Zhang, X. (2008, June 20-21). Analytic Comparison of Three Methods to Evaluate Tutorial Behaviors. Proceedings of the First International Conference on Educational Data Mining, Montreal, 28-37.  Click here for .pdf file.

 

Abstract:  We compare the purposes, inputs, representations, and assumptions of three methods to evaluate the fine-grained interactions of intelligent tutors with their students.  One method is conventional analysis of randomized controlled trials (RCTs).  The second method is learning decomposition, which estimates the impact of each practice type as a parameter in an exponential learning curve.  The third method is knowledge tracing, which estimates the impact of practice as a probability in a dynamic Bayes net.  The comparison leads to a generalization of learning decomposition to account for slips and guesses.

 


 

[IES 2008] Mostow, J., Corbett, A., Valeri, J., Bey, J., Duke, N. K., & Trotochaud, C. (2008, June 10-12). Explicit Comprehension Instruction in an Automated Reading Tutor that Listens:  Year 1 [poster and handout]. IES Third Annual Research Conference, Washington, DC.

 


 

[FLET 2008] Mostow, J. (2008). Experience from a Reading Tutor that listens:  Evaluation purposes, excuses, and methods. In C. K. Kinzer & L. Verhoeven (Eds.), Interactive Literacy Education:  Facilitating Literacy Environments Through Technology, pp. 117-148. New York: Lawrence Erlbaum Associates, Taylor & Francis Group.  Click here to order book from Amazon.com.

Abstract:  This chapter gives three good reasons to evaluate reading software, identifies three methods for doing so, and refutes three excuses for not evaluating – namely, that evaluation is premature, unnecessary, or will be done by others:

(1) Wizard of Oz experiments help test whether (and clarify how) a proposed approach might work, and refute the excuse that evaluation is premature because the approach has not yet been implemented in a proposed system that may take years to develop.

(2) Conventional controlled studies help determine whether an implemented system helps children gain more in reading than they would otherwise.  This criterion is necessary to improve on the status quo, but the difficulty of meeting it refutes the excuse that evaluation is unnecessary due to the supposedly innate superiority of learning on computers, or of a proposed way to use them.

(3) Experiments embedded in an automated tutor help analyze which tutorial actions help which students and words, thereby guiding improvement of the tutor in ways that third party evaluation cannot, thus refuting the excuse that evaluation can be left to others. 

The chapter details some practical lessons learned from designing, performing, and analyzing experiments embedded in Project LISTEN’s school-deployed Reading Tutor, which uses speech recognition to listen to children read aloud, and is helping hundreds of children learn to read. 


[STLL 2008 SC]  Aist, G., & Mostow, J. (2008). Faster, better task choice in a reading tutor that listens. In V. M. Holland & F. P. Fisher (Eds.), The Path of Speech Technologies in Computer Assisted Language Learning:  From Research Toward Practice (pp. 220-240). New York: Routledge.

Abstract:  We analyze the efficiency and effectiveness of task choice in the context of a reading tutor that listens to children read aloud.  We define efficiency as the time to pick a story, and effectiveness in terms of exposing students to new material.  We describe design features we added to improve the Reading Tutor’s efficiency and effectiveness, and evaluate the resulting systems quantitatively, as follows. First, we made the story menu child-friendlier by incorporating two improvements: (a) to support use by nonreaders, the new menu spoke all items on the list; (b) to speed up choice, the new menu required just one click to select an item. Second, we instituted a mixed-initiative story choice policy where the Reading Tutor and the student took turns choosing stories. These improvements made story choice measurably more efficient and effective. 


[STLL 2008 S98]  Mostow, J., Aist, G., Huang, C., Junker, B., Kennedy, R., Lan, H., Latimer, D., O'Connor, R., Tassone, R., Tobin, B., & Wierman, A. (2008). 4-Month evaluation of a learner-controlled Reading Tutor that listens. In V. M. Holland & F. P. Fisher (Eds.), The Path of Speech Technologies in Computer Assisted Language Learning:  From Research Toward Practice (pp. 201-219). New York: Routledge.

 

Abstract:  We evaluated an automated Reading Tutor that let children pick stories to read, and listened to them read aloud. All 72 children in three classrooms (grades 2, 4, 5) were independently tested on the nationally normed Word Attack, Word Identification, and Passage Comprehension subtests of the Woodcock Reading Mastery Test (where they averaged nearly 2 standard deviations below national norms), and on oral reading fluency.  We split each class into 3 matched treatment groups:  Reading Tutor, commercial reading software, or other activities.  In 4 months, the Reading Tutor group gained significantly more in Passage Comprehension than the control group (effect size = 1.2, p=.002) - even though actual usage was a fraction of the planned daily 20-25 minutes.  To help explain these results, we analyzed relationships among gains in Word Attack, Word Identification, Passage Comprehension, and fluency by 108 additional children who used the Reading Tutor in 7 other classrooms (grades 1-4). Gains in Word Identification predicted Passage Comprehension gains only for Reading Tutor users, both in the controlled study (n=21, p=.042, regression coefficient B=.495± s.e. .227) and in the other classrooms (n=108, p=.005, B=.331±.115), where grade was also a significant predictor (p=.024, B=2.575±1.127). 


* [JECR 2007] Poulsen, R., Wiemer-Hastings, P., & Allbritton, D. (2007). Tutoring Bilingual Students with an Automated Reading Tutor That Listens. Journal of Educational Computing Research, 36(2), 191-221.  Click here for .pdf file.

 

Abstract:  Children from non-English-speaking homes are doubly disadvantaged when learning English in school. They enter school with less prior knowledge of English sounds, word meanings, and sentence structure, and they get little or no reinforcement of their learning outside of the classroom. This article compares the classroom standard practice of sustained silent reading with the Project LISTEN Reading Tutor which uses automated speech recognition to "listen" to children read aloud, providing both spoken and graphical feedback. Previous research with the Reading Tutor has focused primarily on native speaking populations. In this study 34 Hispanic students spent one month in the classroom and one month using the Reading Tutor for 25 minutes per day. The Reading Tutor condition produced significant learning gains in several measures of fluency. Effect sizes ranged from 0.55 to 1.27. These dramatic results from a one-month treatment indicate this technology may have much to offer English language learners.

 


[SLaTE 2007 ASL] Xu, L., Varadharajan, V., Maravich, J., Tongia, R., & Mostow, J. (2007, October 1-3). DeSIGN: An Intelligent Tutor to Teach American Sign Language. SLaTE workshop on Speech and Language Technology for Education, ISCA Tutorial and Research Workshop, The Summit Inn, Farmington, Pennsylvania.  Click here for .pdf file.

 

Abstract:  This paper presents the development of DeSIGN, an educational software application for those deaf students who are taught to communicate using American Sign Language (ASL). The software reinforces English vocabulary and ASL signs by providing two essential components of a tutor, lessons and tests. The current version was designed for 5th and 6th graders, whose literacy skills lag by a grade or more on average. In addition, a game that allows the students to be creative has been integrated into the tests.  Another feature of DeSIGN is its ability to intelligently adapt its tests to the changing knowledge of the student as determined by a knowledge tracing algorithm. A separate interface for the teacher enables additions and modifications to the content of the tutor and provides progress monitoring. These dynamic aspects help motivate the students to use the software repeatedly. This software prototype aims at a feasible and sustainable approach to increase the participation of deaf people in society. DeSIGN has undergone an iteration of testing and is currently in use at a school for the deaf in Pittsburgh.

 


[AIED 2007 motivation] Beck, J. E. (2007, July 9-13). Does learner control affect learning? Proceedings of the 13th International Conference on Artificial Intelligence in Education, Los Angeles, CA, 135-142.  Click here for .pdf file.

 

Abstract:  Many intelligent tutoring systems permit some degree of learner control. A natural question is whether the increased student engagement and motivation such control provides results in additional student learning. This paper uses a novel approach, learning decomposition, to investigate whether students do in fact learn more from a story they select to read than from a story the tutor selects for them. By analyzing 346 students reading approximately 6.9 million words, we have found that students learn approximately 25% more in stories they choose to read, even though from a purely pedagogical standpoint such stories may not be as appropriate as those chosen by the computer. Furthermore, we found that (for our instantiation of learner control) younger students may derive less benefit from learner control than older students, and girls derive less benefit than boys.

 


[AIED 2007 comprehension] Zhang, X., Mostow, J., & Beck, J. E. (2007, July 9-13). Can a Computer Listen for Fluctuations in Reading Comprehension? Proceedings of the 13th International Conference on Artificial Intelligence in Education, Los Angeles, CA, 495-502.  Click here for .pdf file.

 

Abstract:  The ability to detect fluctuation in students' comprehension of text would be very useful for many intelligent tutoring systems. The obvious solution of inserting comprehension questions is limited in its application because it interrupts the flow of reading. To investigate whether we can detect comprehension fluctuations simply by observing the reading process itself, we developed a statistical model of 7805 responses by 289 children in grades 1-4 to multiple-choice comprehension questions in Project LISTEN's Reading Tutor, which listens to children read aloud and helps them learn to read.  Machine-observable features of students' reading behavior turned out to be statistically significant predictors of their performance on individual questions.

 


[EDM 2007 LFA transfer] Leszczenski, J. M., & Beck, J. E. (2007, July 9). What’s in a word? Extending learning factors analysis to modeling reading transfer. Proceedings of the AIED2007 Workshop on Educational Data Mining, Marina del Rey, CA, 31-39.  Click here for .pdf file.

 

Abstract:  Learning Factors Analysis (LFA) has been proposed as a generic solution to evaluate and compare cognitive models of learning [1]. By performing a heuristic search over a space of statistical models, the researcher may evaluate different cognitive representations of a set of skills. We introduce a scalable application of this framework in the context of transfer in reading and demonstrate it upon Reading Tutor data. Using an assumption of a word-level model of learning as a baseline, we apply LFA to determine whether a representation with fewer word independencies will produce a better fit for student learning data. Specifically, we show that representing some groups of words as their common root leads to a better fitting model of student knowledge, indicating that this representation offers more information than merely viewing words as independent, atomic skills. In addition, we demonstrate an approximation to LFA which allows it to scale tractably to large datasets. We find that using a word root-based model of learning leads to an improved model fit, suggesting students make use of this information in their representation of words. Additionally, we present evidence based on both model fit and learning rate relationships that low proficiency students tend to exhibit a lesser degree of transfer through the word root representation than higher proficiency students.

 


[EDM 2007 LD transfer] Zhang, X., Mostow, J., & Beck, J. E. (2007, July 9). All in the (word) family:  Using learning decomposition to estimate transfer between skills in a Reading Tutor that listens. AIED2007 Educational Data Mining Workshop, Marina del Rey, CA.  Click here for .pdf file.

 

Abstract:  In this paper, we use the method of learning decomposition to study students’ mental representations of English words. Specifically, we investigate whether practice on a word transfers to similar words. We focus on the case where similar words share the same root (e.g., “dog” and “dogs”). Our data comes from Project LISTEN’s Reading Tutor during the 2003—2004 school year, and includes 6,213,289 words read by 650 students. We analyze the distribution of transfer effects across students, and identify factors that predict the amount of transfer. The results support some of our hypotheses about learning, e.g., the transfer effect from practice on similar words is greater for proficient readers than for poor readers. More significant than these empirical findings, however, is the novel analytic approach to measure transfer effects.

 


[EDM 2007 Dirichlet] Beck, J. E. (2007, July 9). Difficulties in inferring student knowledge from observations (and why you should care). Proceedings of the AIED2007 Workshop on Educational Data Mining, Marina del Rey, CA, 21-30.  Click here for .pdf file.

 

Abstract:  Student modeling has a long history in the field of intelligent educational software and is the basis for many tutorial decisions. Furthermore, the task of assessing a student’s level of knowledge is a basic building block in the educational data mining process. If we cannot estimate what students know, it is difficult to perform fine-grained analyses to see if a system’s teaching actions are having a positive effect. In this paper, we demonstrate that there are several unaddressed problems with student model construction that negatively affect the inferences we can make. We present two partial solutions to these problems, using Expectation Maximization to estimate parameters and using Dirichlet priors to bias the model fit procedure. Aside from reliably improving model fit in predictive accuracy, these approaches might result in model parameters that are more plausible. Although parameter plausibility is difficult to quantify, we discuss some guidelines and propose a derived measure of predicted number of trials until mastery as a method for evaluating model parameters.

 


[UM 2007] Beck, J. E., & Chang, K.-m. (2007, June 25-29). Identifiability: A Fundamental Problem of Student Modeling.  Proceedings of the 11th International Conference on User Modeling (UM 2007), Corfu, Greece.  Click here for .pdf file.

 

Abstract:  In this paper we show how model identifiability is an issue for student modeling: observed student performance corresponds to an infinite family of possible model parameter estimates, all of which make identical predictions about student performance. However, these parameter estimates make different claims, some of which are clearly incorrect, about the student’s unobservable internal knowledge. We propose methods for evaluating these models to find ones that are more plausible. Specifically, we present an approach using Dirichlet priors to bias model search that results in a statistically reliable improvement in predictive accuracy (AUC of 0.620 ± 0.002 vs. 0.614 ± 0.002). Furthermore, the parameters associated with this model provide more plausible estimates of student learning, and better track with known properties of students’ background knowledge. The main conclusion is that prior beliefs are necessary to bias the student modeling search, and even large quantities of performance data alone are insufficient to properly estimate the model.

 


[ICASSP 2007] Anumanchipalli, G. K., Ravishankar, M., & Reddy, R. (2007, April 15-20). Improving Pronunciation Inference Using N-Best List, Acoustics and Orthography. Proc.  32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, Paper 4151.  Click here for .pdf file.

 

Abstract:  In this paper, we tackle the problem of pronunciation inference and Out-of-Vocabulary (OOV) enrollment in Automatic Speech Recognition (ASR) applications. We combine linguistic and acoustic information of the OOV word using its spelling and a single instance of its utterance to derive an appropriate phonetic baseform. The novelty of the approach is in its employment of an orthography-driven n-best hypothesis and rescoring strategy of the pronunciation alternatives. We make use of decision trees and heuristic tree search to construct and score the n-best hypotheses space. We use acoustic alignment likelihood and phone transition cost to leverage the empirical evidence and phonotactic priors to rescore the hypotheses and refine the baseforms.

 


[IERI 2007] Mostow, J., & Beck, J. (2007). When the Rubber Meets the Road:  Lessons from the In-School Adventures of an Automated Reading Tutor that Listens. In B. Schneider & S.-K. McDonald (Eds.), Scale-Up in Education (Vol. 2, pp. 183-200).  © Rowman & Littlefield Publishers, Lanham, MD.  Click here for .pdf file.

 

Abstract:  Project LISTEN's Reading Tutor (www.cs.cmu.edu/~listen) uses automatic speech recognition to listen to children read aloud, and helps them learn to read.  Its experimental deployment in schools has expanded from a single computer used by eight third graders in one school in 1996 to two hundred computers used by children in grades 1-3 in nine schools in 2003.  This project illustrates how technology can not just scale up an intervention, but instrument its implementation.  For example, analysis of 2002-2003 usage showed that session frequency and duration averaged significantly higher in lab settings than in classrooms.

 


[ICSLP2006] Mostow, J. (2006, September 17-21). Is ASR accurate enough for automated reading tutors, and how can we tell? Ninth International Conference on Spoken Language Processing (Interspeech 2006 — ICSLP), Pittsburgh, PA, 837-840.  Click here for .pdf file.

 

Abstract:  We discuss pros and cons of several ways to evaluate ASR accuracy in automated tutors that listen to students read aloud.  Whether ASR is accurate enough for a particular reading tutor function depends on what ASR-based judgment it requires, the visibility of that judgment to students and teachers, and the amount of input speech on which it is based.  How to tell depends on the purpose, criterion, and space of the evaluation.

 


[AAAI2006 help] Chang, K., Beck, J. E., Mostow, J., & Corbett, A. (2006, July 17). Does Help Help?  A Bayes Net Approach to Modeling Tutor Interventions. AAAI2006 Workshop on Educational Data Mining, Boston, MA.  Click here for .pdf file.

 

Abstract:  This paper describes an effort to measure the effectiveness of tutor help in an intelligent tutoring system. Conventional pre- and post- test experimental methods can determine whether help is effective but are expensive to conduct.  Furthermore, a pre and post- test methodology ignores a source of information: students request help about words they do not know. Therefore, we propose a dynamic Bayes net (which we call the help model) that models tutor help and student knowledge in one coherent framework. The help model distinguishes two different effects of help:  scaffolding immediate performance vs. teaching persistent knowledge that improves long term performance. We train the help model to fit the student performance data gathered from usage of Reading Tutor. The parameters of the trained model suggest that students benefit from both the scaffolding and teaching effects of help. Thus, our framework is able to distinguish two types of influence that help has on the student, and can determine whether help helps learning without an explicit controlled study.

 


[SSSR2006 cloze] Hensler, B. S., & Beck, J. (2006, July 6-8). Are all questions created equal?  Factors that influence cloze question difficulty. Thirteenth Annual Meeting of the Society for the Scientific Study of Reading, Vancouver, BC, Canada.  Click here for .ppt file.

 

Abstract:  The multiple choice cloze (MCC) assessment methodology is widely used in assessing reading comprehension; therefore an improved scoring methodology would have broad impact within the reading research community.  We have constructed an MCC question model that simultaneously estimates the student's comprehension proficiency and the impact of various terms on MCC difficulty. To build the model, we analyzed 16,161 MCC question responses that were administered by a computer reading tutor over the course of a school year.  Participants were 373 students in grades 1 through 6 (ages 5-12) in urban and suburban public schools in Pennsylvania.  Students reading stories on the Reading Tutor were presented with cloze questions with the goal of assessing reading comprehension.  MCC questions were generated randomly by the computer without using a fixed deletion ratio.  A maximum of one word was deleted per sentence, and the distractors were selected from the story being read and were of similar frequency as the deleted target word.  MCC questions and the response choices were read aloud by the computer to the students. 

 

To develop our model of MCC difficulty, we used multinomial logistic regression to calculate the relative impact of a number of factors.  Our model includes the location of the deleted target word within the sentence and question length as covariates.  As factors, we used student identity, reaction time (rounded to the nearest second) and level of difficulty of the target word.  We hypothesized that more proficient readers would use syntactic cues while less proficient readers would not.  To add syntax to the model, we used the TreeTagger part of speech tagger to annotate the part of speech of the correct answer for each cloze question.  We then computed how many of the distractors could have the same part of speech as the answer.  Presumably questions with many distractors able to take on the same part of speech as the answer would be harder.

 

After training the model on our 16,161 MCC questions, there were two main findings.  First, our model found that students who had a second grade reading proficiency (as measured by Woodcock Reading Comprehension Cluster) or higher were sensitive to how many of the possible responses could take on the same part of speech as the correct answer (p= 0.002) for the cloze sentence, while students below second grade proficiency were insensitive to this term (p=0.467).  This result suggests that students' syntactic awareness, at least within the context of MCC questions, begins at around the second grade.  The second main finding was the degree of correlation of each student's Beta parameter, the model's estimate of her ability to answer MCC questions, with her associated Woodcock test score.  The mean within-grade correlation between Beta and the Reading Comprehension Cluster score was 0.69, a very strong fit.

 


[SSSR2006 fluency] Mostow, J. and J. Beck (2006, July 6-8). Refined micro-analysis of fluency gains in a Reading Tutor that listens. Thirteenth Annual Meeting of the Society for the Scientific Study of Reading, Vancouver, BC, Canada.  Click here for .ppt file.

 

Abstract:  Our SSSR2005 talk presented a linear model of speedup in word reading between successive encounters in connected text, based on a quarter of a million such encounters.  The model indicated that reading a word in a new context contributed more to speedup than re-encountering it in an old context, implying that wide reading builds fluency more than rereading.  Our new, improved model uses a growth curve to model word reading time as a function of the number and types of encounters of the word.  This approach lets us  estimate -- both overall and at different reading levels -- the relative value of encountering a word in a new context versus an old one, and for the first time on a given day versus subsequently.

 


[ITS2006 gaming] Baker, R. S. J. d., Corbett, A. T., Koedinger, K. R., Evenson, S., Roll, I., Wagner, A. Z., Naim, M., Raspat, J., Baker, D. J., & Beck, J. E. (2006, June 26-30). Adapting to When Students Game an Intelligent Tutoring System [Best Paper]. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, 392-401.  Click here for .pdf file.

 

Abstract:  It has been found in recent years that many students who use intelligent tutoring systems game the system, attempting to succeed in the educational environment by exploiting properties of the system rather than by learning the material and trying to use that knowledge to answer correctly. In this paper, we introduce a system which gives a gaming student supplementary exercises focused on exactly the material the student bypassed by gaming, and which also expresses negative emotion to gaming students through an animated agent. Students using this system engage in less gaming, and students who receive many supplemental exercises have considerably better learning than is associated with gaming in the control condition or prior studies.

 


[ITS2006 BNT-SM] Chang, K., Beck, J., Mostow, J., & Corbett, A. (2006, June 26-30). A Bayes Net Toolkit for Student Modeling in Intelligent Tutoring Systems. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan.  Click here for .pdf file.

 

Abstract:  This paper describes an effort to model a student’s changing knowledge state during skill acquisition. Dynamic Bayes Nets (DBNs) provide a powerful way to represent and reason about uncertainty in time series data, and are therefore well-suited to model student knowledge.  Many general-purpose Bayes net packages have been implemented and distributed; however, constructing DBNs often involves complicated coding effort. To address this problem, we introduce a tool called BNTSM.  BNT-SM inputs a data set and a compact XML specification of a Bayes net model hypothesized by a researcher to describe causal relationships among student knowledge and observed behavior. BNT-SM generates and executes the code to train and test the model using the Bayes Net Toolbox [1]. Compared to the BNT code it outputs, BNT-SM reduces the number of lines of code required to use a DBN by a factor of 5. In addition to supporting more flexible models, we illustrate how to use BNT-SM to simulate Knowledge Tracing (KT) [2], an established technique for student modeling. The trained DBN does a better job of modeling and predicting student performance than the original KT code (Area Under Curve = 0.610 > 0.568), due to differences in how it estimates parameters.

 


[ITS2006 cloze] Hensler, B. S., & Beck, J. (2006, June 26-30). Better student assessing by finding difficulty factors in a fully automated comprehension measure. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan. Click here for .pdf file.

 

Abstract:  The multiple choice cloze (MCC) question format is commonly used to assess students' comprehension. It is an especially useful format for ITS because it is fully automatable and can be used on any text.  Unfortunately, very little is known about the factors that influence MCC question difficulty and student performance on such questions. In order to better understand student performance on MCC questions, we developed a model of MCC questions. Our model shows that the difficulty of the answer and the student’s response time are the most important predictors of student performance. In addition to showing the relative impact of the terms in our model, our model provides evidence of a developmental trend in syntactic awareness beginning around the 2nd grade. Our model also accounts for 10% more variance in students’ external test scores compared to the standard scoring method for MCC questions.

 


[ITS2006 vocabulary] Heiner, C., Beck, J., & Mostow, J. (2006, June 26-30). Automated Vocabulary Instruction in a Reading Tutor. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan.  Click here for .pdf file.

 

Abstract:  This paper presents a within-subject, randomized experiment to compare automated interventions for teaching vocabulary to young readers using Project LISTEN's Reading Tutor. The experiment compared three conditions: no explicit instruction, a quick definition, and a quick definition plus a post-story battery of extended instruction based on a published instructional sequence for human teachers. A month long study with elementary school children indicates that the quick instruction which lasts about seven seconds has immediate effects on learning gains that did not persist. Extended instruction which lasted about thirty seconds longer than the quick instruction had a persistent effect and produced gains on a posttest one week later.

 


[ITS2006 decomposition] Beck, J. (2006, June 26). Using learning decomposition to analyze student fluency development. ITS2006 Educational Data Mining Workshop, Jhongli, Taiwan.  Click here for .pdf file.

 

Abstract:  This paper introduces an approach called learning decomposition to analyze what types of practice are most effective for helping students learn a skill. The approach is a generalization of learning curve analysis, and uses non-linear regression to determine how to weight different types of practice opportunities relative to each other. We are able to show that different types of practice differ reliably in how quickly students acquire the skill of reading words quickly and accurately. Specifically, massed practice is generally not effective for helping students learn words, but may be acceptable for less proficient readers. Rereading the same story is generally not as effective as reading a variety of stories, but might be beneficial for more proficient readers.

 


[JNLE2006] Mostow, J. and J. Beck (2006). Some useful tactics to modify, map, and mine data from intelligent tutors. Natural Language Engineering (Special Issue on Educational Applications) 12(2),195-208.  © 2006 Cambridge University Press.  Click here for .pdf file.

 

Abstract:  Mining data logged by intelligent tutoring systems has the potential to discover information of value to students, teachers, authors, developers, researchers, and the tutors themselves -- information that could make education dramatically more effcient, effective, and responsive to individual needs. We factor this discovery process into tactics to modify tutors, map heterogeneous event streams into tabular data sets, and mine them. This model and the tactics identified mark out a roadmap for the emerging area of tutorial data mining, and may provide a useful vocabulary and framework for characterizing past, current, and future work in this area. We illustrate this framework using experiments that tested interventions by an automated reading tutor to help children decode words and comprehend stories.


[IJAIED2006] Beck, J. E., & Sison, J. (2006). Using knowledge tracing in a noisy environment to measure student reading proficiencies. International Journal of Artificial Intelligence in Education, 16, 129-143.  (In Special “Best of ITS 2004” Issue.)  Click here for .pdf file.

Abstract:  Constructing a student model for language tutors is a challenging task. This paper describes using knowledge tracing to construct a student model of reading proficiency and validates the model. We use speech recognition to assess a student’s reading proficiency at a subword level, even though the speech recognizer output is at the level of words and is statistically noisy. Specifically, we estimate the student’s knowledge of 80 letter to sound mappings, such as ch making the sound /K/ in “chemistry.” At a coarse level, the student model did a better job at estimating reading proficiency for 47.2% of the students than did a standardized test designed for the task. Although not quite as strong as the standardized test, our assessment method can provide a report on the student at any time during the year and requires no break from reading to administer. Our model’s estimate of the student’s knowledge on individual letter to sound mappings is a significant predictor of whether he will ask for help on a particular word. Thus, our student model is able to describe student performance both at a coarse- and at a fine-grain size.


[AIED2005 event] Mostow, J., Beck, J., Cen, H., Gouvea, E., & Heiner, C. (2005, July). Interactive Demonstration of a Generic Tool to Browse Tutor-Student Interactions. Interactive Events Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, 29-32.  Click here for .pdf file.

 

Abstract:  Project LISTEN's Session Browser is a generic tool to browse a database of students' interactions with an automated tutor.  Using databases logged by Project LISTEN's Reading Tutor, we illustrate how to specify phenomena to investigate, explore events and the context where they occurred, dynamically drill down and adjust which details to display, and summarize events in human-understandable form.   The tool should apply to MySQL databases from other tutors as well.


[AIED2005 browser] Mostow, J., Beck, J., Cuneo, A., Gouvea, E., & Heiner, C. (2005, July 18-22). A Generic Tool to Browse Tutor-Student Interactions:  Time Will Tell! Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, 884-886.  Click here for .pdf file.

 

Abstract:  A basic question in mining data from an intelligent tutoring system is, "What happened when…?"  A generic tool to answer such questions should let the user specify which phenomenon to explore; explore selected events and the context in which they occurred; and require minimal effort to adapt the tool to new versions, to new users, or to other tutors.  We describe an implemented tool and how it meets these requirements. The tool applies to MySQL databases whose representation of tutorial events includes student, computer, start time, and end time.  It infers the implicit hierarchical structure of tutorial interaction so humans can browse it. A companion paper [1] illustrates the use of this tool to explore data from Project LISTEN's automated Reading Tutor.


[AIED2005 interruption] Heiner, C., Beck, J., & Mostow, J. (2005, July 18-22). When do students interrupt help?  Effects of individual differences.  Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, 819-826.  Note:  This paper was accepted as a poster, but due to a publishing error, the printed proceedings include the original submitted version instead of the 3-page revised version.  Click here for 3-page accepted version.  Click here for 8-page published version.

 

Abstract. When do students interrupt help to request different help? To study this question, we analyze a within-subject experiment in the 2003-2004 version of Project LISTEN's Reading Tutor. From 168,983 trials of this experiment, we report patterns in when students choose to interrupt help. To improve model fit for individual data, we adjust our model to account for individual differences. We report small but significant correlations between a student parameter in our model and gender as well as external measures of motivation and academic performance.

 


[AIED2005 engagement] Beck, J. (2005, July 18-22). Engagement tracing:  using response times to model student disengagement. Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED 2005), Amsterdam, 88-95.  Click here for .pdf file.

 

Abstract:  Time on task is an important predictor for how much students learn.  However, students must be focused on the learning for the time invested to be productive.  Unfortunately, students do not always try their hardest to solve problems presented by computer tutors.  This paper explores student disengagement and proposes an approach, engagement tracing, for detecting whether a student is engaged in answering questions.  This model is based on item response theory, and uses as input the difficulty of the question, how long the student took to respond, and whether the response was correct.  From these data, the model determines the probability a student was actively engaged in trying to answer the question.  The model has a reliability of 0.95, and its estimate of student engagement correlates at 0.25 with student gains on external tests.  Finally, the model is sensitive enough to detect variations in student engagement within a single tutoring session.  The novel aspect of this work is that it requires only data normally collected by a computer tutor, and the affective model is validated against student performance on an external measure.  


[AIED2005 ASR] Beck, J. E., Chang, K., Mostow, J., & Corbett, A. (2005, July 19). Using a student model to improve a computer tutor's speech recognition. Proceedings of the AIED 05 Workshop on Student Modeling for Language Tutors, 12th International Conference on Artificial Intelligence in Education, Amsterdam, 2-11.  Click here for .pdf file.

 

Abstract:  Intelligent computer tutors can derive much of their power from having a student model that describes the learner’s competencies.  However, constructing a student model is challenging for computer tutors that use automated speech recognition (ASR) as input.  This paper reports using ASR output from a computer tutor for reading to compare two models of how students learn to read words:  a model that assumes students learn words as whole-unit chunks, and a model that assumes students learn the individual letteràsound mappings that make up words.  We use the data collected by the ASR to show that a model of letteràsound mappings better describes student performance.  We then compare using the student model and the ASR, both alone and in combination, to predict which words the student will read correctly, as scored by a human transcriber.  Surprisingly, majority class has a higher classification accuracy than the ASR.  However, we demonstrate that the ASR output still has useful information, and that classification accuracy is not a good metric for this task, and the Area Under Curve (AUC) of ROC curves is a superior scoring method.  The AUC of the student model is statistically reliably better (0.670 vs. 0.550) than that of the ASR, which in turn is reliably better than majority class.  These results show that ASR can be used to compare theories of how students learn to read words, and modeling individual learner’s proficiencies may enable improved speech recognition.


[AIED 2005 model] Chang, K.., Beck, J. E., Mostow, J., & Corbett, A. (2005, July 19). Using speech recognition to evaluate two student models for a reading tutor. Proceedings of the AIED 05 Workshop on Student Modeling for Language Tutors, 12th International Conference on Artificial Intelligence in Education, Amsterdam, 12-21.  Click here for .pdf file.

 

Abstract:  Intelligent Tutoring Systems derive much of their power from having a student model that describes the learner's competencies. However, constructing a student model is challenging for computer tutors that use automated speech recognition (ASR) as input, due to inherent inaccuracies in ASR. We describe two extremely simplified models of developing word decoding skills and explore whether there is sufficient information in ASR output to determine which model fits student performance better, and under what circumstances one model is preferable to another.

 

The two models that we describe are a lexical model that assumes students learn words as whole-unit chunks, and a grapheme-to-phoneme (G-to-P) model that assumes students learn the individual letter-to-sound mappings that compose the words. We use the data collected by the ASR to show that the G-to-P model better describes student performance than the lexical model. We then determine which model performs better under what conditions. On one hand, the G-to-P model better correlates with student performance data when the student is older or when the word is more difficult to read or spell. On the other hand, the lexical model better correlates with student performance data when the student has seen the word more times.


[AAAI 2005 workshop] Beck, J. (Ed.). (2005, July 10). Proceedings of the AAAI2005 Workshop on Educational Data Mining. Pittsburgh, PA.


[AAAI2005 browser] Mostow, J., Beck, J., Cen, H., Cuneo, A., Gouvea, E., & Heiner, C. (2005, July 10). An Educational Data Mining Tool to Browse Tutor-Student Interactions:  Time Will Tell! Proceedings of the Workshop on Educational Data Mining, National Conference on Artificial Intelligence, Pittsburgh, 15-22.  Click here for .pdf file.

Abstract:  A basic question in mining data from an intelligent tutoring system is, "What happened when…?"  We identify requirements for a tool to help answer such questions by finding occurrences of specified phenomena and browsing them in human-understandable form.  We describe an implemented tool and how it meets the requirements.  The tool applies to MySQL databases whose representation of tutorial events includes student, computer, start time, and end time.  It automatically computes and displays the temporal hierarchy implicit in this representation.  We illustrate the use of this tool to mine data from Project LISTEN's automated Reading Tutor.


[AAAI2005 usage]  Arnold, A., Scheines, R., Beck, J. E., & Jerome, B. (2005, July 10). Time and attention:  students, sessions, and tasks. Proceedings of the AAAI2005 Workshop on Educational Data Mining, Pittsburgh, PA, 62-66.  Click here for .pdf file.

Abstract:  Students in two classes in the fall of 2004 making extensive use of online courseware were logged as they visited over 500 different “learning pages” which varied in length and in difficulty.  We computed the time spent on each page by each student during each session they were logged in.  We then modeled the time spent for a particular visit as a function of the page itself, the session, and the student. Surprisingly, the average time a student spent on learning pages (over their whole course experience) was of almost no value in predicting how long they would spend on a given page, even controlling for the session and page difficulty.  The page itself was highly predictive, but so was the average time spent on learning pages in a given session.  This indicates that local considerations, e.g., mood, deadline proximity, etc., play a much greater role in determining student pace and attention than do intrinsic student traits.  We also consider the average time spent on learning pages as a function of the time of semester.  Students spent less time on pages later in the semester, even for more demanding material.


[SSSR 2005] Mostow, J., & Beck, J. (2005). Micro-analysis of fluency gains in a Reading Tutor that listens:  Wide vs. repeated guided oral reading.  Talk at Twelfth Annual Meeting of the Society for the Scientific Study of Reading. Toronto.  Click here to download PowerPoint presentation.

Abstract:  Fluency growth is essential but imperfectly understood.  By using automatic speech recognition to listen to children read aloud, Project LISTEN's Reading Tutor provides a novel instrument to study fluency development.  During the 2002-2003 school year, hundreds of children in grades 1-4 used the Reading Tutor, which recorded them reading millions of words of text.  The latency preceding each word reflects the reader’s cognitive effort to identify the word.  Using automatic speech recognition to analyze latency changes between successive encounters of words in the same or different contexts provides new data about how fluency grows.


* [Toronto 2005] Cunningham, T., & Geva, E. (2005, June 24). The effects of reading technologies on literacy development of ESL students [poster presentation]. Twelfth Annual Meeting of the Society for the Scientific Study of Reading, Toronto.

 


* [UBC 2005] Reeder, K., Early, M., Kendrick, M., Shapiro, J., & Wakefield, J. (2005, April). The Role of L1 in Young Multilingual Readers' Success With a Computer-Based Reading Tutor. Fifth International Symposium on Bilingualism, Barcelona, Spain.

 


[AERA 2005] Beck, J. E., & Mostow, J. (2005). Mining Data from Randomized Within-Subject Experiments in an Automated Reading Tutor (poster in session 34.080, "Logging Students' Learning in Complex Domains:  Empirical Considerations and Technological Solutions"). American Educational Research Association 2005 Annual Meeting:  Demography and Democracy in the Era of Accountability, Montreal, Canada.  Click here to download PowerPoint poster.

Abstract:  Experiments embedded in the Reading Tutor help evaluate its decisions in tutoring decoding, vocabulary, and comprehension.



[Kant masters thesis] Kant, P. M. (2004). The Influence of Teachers' Perceptions on Usage of an Educational Technology:  A study of Project LISTEN's Reading Tutor. Unpublished Master's Thesis, University of Pittsburgh, Pittsburgh, PA.

Abstract: This study looked at factors influencing teachers’ perception and usage of Project LISTEN’s Reading Tutor, a computerized tutor used with elementary students in 9 classroom-based, 10 computer lab-based, and 3 specialist-room school settings.  Thirteen interviews and 22 survey responses (of a possible 28 teachers) examined teachers’ perception of the Reading Tutor and suggested that teachers’ belief in the Tutor influenced their usage of it (r = .46, p < .03).  Three factors seemed to influence teacher belief: 1) perceived ease of use (r = .52, p < .01), 2) teachers’ reported experience with computers (r = .41, p < .04) and instructional technology (r = .48, p < .03), and 3) perceived technical problems such as frequency of technical problems (r = -.44, p < .04) and speed with which problems were fixed (r = .49, p < .02).  Analysis of these factors suggested four themes that cut-across factors and seem to influence the way teachers evaluate and use the Reading Tutor – the technology’s degree of convenience, competition from other educational priorities and practices, teacher experience and/or interest with technology, and data available to teachers and the way teachers prioritize that data.  These results suggest that improving convenience of the Reading Tutor, instituting specialized training programs, and improving feedback mechanisms for teachers by providing relevant, situated data may influence teacher belief in the Reading Tutor and thereby increase teacher usage.  This study contributes to current literature on educational technology usage by supporting previous literature suggesting that teacher belief in the importance of a technology influences their use of it.  One unique feature of this study is that is uses both quantitative and qualitative methods to look at the research questions from two different research perspectives.



* [ESL 2004] Poulsen, R. (2004). Tutoring Bilingual Students With an Automated Reading Tutor That Listens:  Results of a Two-Month Pilot Study. Unpublished Masters Thesis, DePaul University, Chicago, ILClick here to download .pdf file.

Abstract:  A two-month pilot study comprised of 34 second through fourth grade Hispanic students from four bilingual education classrooms was conducted to compare the efficacy of the 2004 version of the Project LISTEN Reading Tutor against the standard practice of sustained silent reading (SSR).  The Reading Tutor uses automated speech recognition to listen to children read aloud.  It provides both spoken and graphical feedback in order to assist the children with the oral reading task.  Prior research with this software has demonstrated its efficacy within populations of native English speakers.  This study was undertaken to obtain some initial indication as to whether the tutor would also be effective within a population of English language learners. 

The study employed a crossover design where each participant spent one month in each of the treatment conditions.  The experimental treatment consisted of 25 minutes per day using the Reading Tutor within a small pullout lab setting.  Control treatment consisted of the students who remained in the classroom where they participated in established reading instruction activities.  Dependent variables consisted of the school districts curriculum based measures for fluency, sight word recognition and comprehension.

The Reading Tutor group out-gained the control group in every measure during both halves of the crossover experiment.  Within subject results from a paired T-Test indicate these gains were significant for one sight word measure (p = .056) and both fluency measures (p < .001).  Effect sizes were 0.55 for timed sight words, a robust 1.16 for total fluency and an even larger 1.27 for fluency controlled for word accuracy.  These dramatic results observed during a one-month treatment indicate this technology may have much to offer English language learners.



[TICL questions] Mostow, J., Beck, J., Bey, J., Cuneo, A., Sison