Project LISTEN Publications
[Note: Links to full text are included when possible, e.g. after publication or conference presentation.
* marks publications by others.
The research reported in each publication listed here was supported under the grant(s) it acknowledges from the Institute of Education Sciences, U.S. Department of Education, by the National Science Foundation, and by other sources. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views or official policies, either expressed or implied of the Institute, the U.S. Department of Education, the National Science Foundation, or the United States Government. Thank you to the educators and students who helped generate our data, and the many LISTENers over the years who contributed to the work cited here.
[AETLI 2015] Mostow, J. (to appear). Project LISTEN’s Reading Tutor. In S. A. Crossley & D. S. McNamara (Eds.), Adaptive Educational Technologies for Literacy Instruction. NY: Taylor & Francis, Routledge.
[EDM 2015] Mostow, J., Gates, D., Ellison, R., & Goutam, R. (2015, June 26-29). Automatic Identification of Nutritious Contexts for Learning Vocabulary Words In Proceedings of the 8th International Conference on Educational Data Mining, Madrid, Spain.
Abstract: Vocabulary knowledge is crucial to literacy development and academic success. Previous research has shown learning the meaning of a word requires encountering it in diverse informative contexts. In this work, we try to identify “nutritious” contexts for a word – contexts that help students build a rich mental representation of the word’s meaning. Using crowdsourced ratings of vocabulary contexts retrieved from the web, AVER learns models to score unseen contexts for unseen words. We specify the features used in the models, measure their individual informativeness, evaluate AVER’s cross-validated accuracy in scoring contexts for unseen words, and compare its agreement with the human ratings against the humans’ agreement with each other. The automated scores are not good enough to replace human ratings, but should reduce human effort by identifying contexts likely to be worth rating by hand, subject to a tradeoff between the number of contexts inspected by hand, and how many of them a human judge will consider nutritious.
[AIED 2015] Huang, Y.-T. & Mostow, J. (2015, June 22-26). Evaluating Human and Automated Generation of Distractors for Diagnostic Multiple-Choice Cloze Questions to Assess Children’s Reading Comprehension. In Proceedings of the 17th International Conference on Artificial Intelligence in Education, Madrid, Spain.
Abstract: We report an experiment to evaluate DQGen’s performance in generating three types of distractors for diagnostic multiple-choice cloze (fill-in-the-blank) questions to assess children’s reading comprehension processes. Ungrammatical distractors test syntax, nonsensical distractors test semantics, and locally plausible distractors test inter-sentential processing. 27 knowledgeable humans rated candidate answers as correct, plausible, nonsensical, or ungrammatical without knowing their intended type or whether they were generated by DQGen, written by other humans, or correct. Surprisingly, DQGen did significantly better than humans at generating ungrammatical distractors and slightly better than them at generating nonsensical distractors, albeit worse at generating plausible distractors. Vetting its output and writing distractors only when necessary would take half as long as writing them all, and improve their quality.
[EDM 2014] Xu, Y., Chang, K.-m., Yuan, Y., & Mostow, J. (2014, July 4-7). Using EEG in Knowledge Tracing. In Proceedings of the 7th International Conference on Educational Data Mining, 361-362. Institute of Education, London, UK. Click here for .pdf file.
Abstract: Knowledge tracing (KT) is widely used in Intelligent Tutoring Systems (ITS) to measure student learning. Inexpensive portable electroencephalography (EEG) devices are viable as a way to help detect a number of student mental states relevant to learning, e.g. engagement or attention. This paper reports a first attempt to improve KT estimates of the student’s hidden knowledge state by adding EEG-measured mental states as inputs. Values of learn, forget, guess and slip differ significantly for different EEG states.
[BKT20y 2014] Xu, Y. & Mostow, J. (2014, July 4). A Unified 5-Dimensional Framework for Student Models. In Proceedings of the EDM2014 Workshop on Approaching Twenty Years of Knowledge Tracing: Lessons Learned, Open Challenges, and Promising Developments, 122-129. Institute of Education, London, UK. Click here for .pdf file.
Abstract: This paper defines 5 key dimensions of student models: whether and how they model time, skill, noise, latent traits, and multiple influences on student performance. We use this framework to characterize and compare previous student models, analyze their relative accuracy, and propose novel models suggested by gaps in the multi-dimensional space. To illustrate the generative power of this framework, we derive one such model, called HOT-DINA (Higher Order Temporal, Deterministic Input, Noisy-And) and evaluate it on synthetic and real data. We show it predicts student performance better than previous methods, when, and why.
[WSEEG 2014 KT] Xu, Y., Chang, K.-m., Yuan, Y., & Mostow, J. (2014, June 5-6). EEG Helps Knowledge Tracing! In Proceedings of the ITS2014 Workshop on Utilizing EEG Input in Intelligent Tutoring Systems, 43-48. Honolulu. Click here for .pdf file.
Abstract: Knowledge tracing (KT) is widely used in Intelligent Tutoring Systems (ITS) to measure student learning. Inexpensive portable electroencephalography (EEG) devices are viable as a way to help detect a number of student mental states relevant to learning, e.g. engagement or attention. In this paper, we combine such EEG measures with KT to improve estimates of the students' hidden knowledge state. We propose two approaches to insert the EEG measured mental states into KT as a way of fitting parameters learn, forget, guess and slip specifically for the different mental states. Both approaches improve the original KT prediction, and one of them outperforms KT significantly.
[WSEEG 2014 toolkit] Yuan, Y., Chang, K.-m., Xu, Y., & Mostow, J. (2014, June 5-6). A Public Toolkit and ITS Dataset for EEG. In Proceedings of the ITS2014 Workshop on Utilizing EEG Input in Intelligent Tutoring Systems, 49-54. Honololu. Click here for .pdf file.
Abstract: We present a data set collected since 2012 containing children’s EEG signals logged during their usage of Project LISTEN’s Reading Tutor. We also present EEG-ML, an integrated machine learning toolkit to preprocess EEG data, extract and select features, train and cross-validate classifiers to predict behavioral labels, and analyze their statistical reliability. To illustrate, we describe and evaluate a classifier to estimate a student’s amount of prior exposure to a given word. We make this dataset and toolkit publically available to help researchers explore how EEG might improve intelligent tutoring systems.
[LAK 2014] Yuan, Y., Chang, K.-m., Taylor, J. N., & Mostow, J. (2014, March 24-28). Toward Unobtrusive Measurement of Reading Comprehension Using Low-Cost EEG. In Proceedings of the 4th International Conference on Learning Analytics and Knowledge, 54-58. Indianapolis, IN, USA. Click here for .pdf file.
Abstract: Assessment of reading comprehension can be costly and obtrusive. In this paper, we use inexpensive EEG to detect reading comprehension of readers in a school environment. We use EEG signals to produce above-chance predictors of student performance on end-of-sentence cloze questions. We also attempt (unsuccessfully) to distinguish among student mental states evoked by distracters that violate either syntactic, semantic, or contextual constraints. In total, this work investigates the practicality of classroom use of inexpensive EEG devices as an unobtrusive measure of reading comprehension.
[IJAIED 2013] Chang, K.-m., Nelson, J., Pant, U., & Mostow, J. (2013). Toward Exploiting EEG Input in a Reading Tutor. International Journal of Artificial Intelligence in Education 22 (1, Special “Best of AIED2011” Issue), 29-41. Click here for .pdf file.
Abstract: A new type of sensor for students’ mental states is a single-channel portable EEG headset simple enough to use in schools. To gauge its potential, we recorded its signal from children and adults reading text and isolated words, both aloud and silently. We used this data to train and test classifiers to detect a) when reading is difficult, b) when comprehension is lacking, and c) lexical status and word difficulty. To avoid exploiting the confound of word and sentence difficulty with length, we truncated signals to a uniform duration. The EEG data discriminated reliably better than chance between reading easy and difficult sentences. We found weak but above-chance performance for using EEG to distinguish among easy words, difficult words, pseudo-words, and unpronounceable strings, or to predict correct versus incorrect responses to a comprehension question about the read text. We also identified which EEG components appear sensitive to which lexical features. We found a strong relationship in children between a word’s age-of-acquisition and activity in the Gamma frequency band (30-100 Hz). This pilot study gives hope that a school-deployable EEG device can capture information that might be useful to an intelligent tutor.
[JECR 2013] Mostow, J., Nelson, J., & Beck, J. E. (2013). Computer-Guided Oral Reading versus Independent Practice: Comparison of Sustained Silent Reading to an Automated Reading Tutor that Listens. Journal of Educational Computing Research, 49(2): 249-276. Click here for .pdf file.
Abstract: A 7-month study of 178 students in grades 1-4 at two Blue Ribbon schools compared two daily 20-minute treatments. 88 students used the 2000-2001 version of Project LISTEN’s Reading Tutor (www.cs.cmu.edu/~listen) in 10-computer labs, averaging 19 hours over the course of the year. The Reading Tutor served as a computerized implementation of the National Reading Panel’s recommended guided oral reading instruction (NRP, 2000). The Reading Tutor listened to students read aloud, giving spoken and graphical help when it noticed them click for help, make a mistake, or get stuck. Students using the Reading Tutor averaged significantly higher gains across measures of reading ability, especially those involving word level skills (word identification, blending words, and spelling) than their matched classmates who spent that time doing Sustained Silent Reading (SSR) in their classrooms. Additionally, these students trended towards higher gains in fluency and reading comprehension. Overall, use of the Reading Tutor resulted in the types of improvement that would be expected from guided oral reading, but with the benefit of scalability, a problem for human-guided oral reading practice.
[AIED2013 keynote] Mostow, J. (2013). Lessons from Project LISTEN: What Have We Learned from a Reading Tutor that Listens? (Keynote). In Proceedings of the 16th International Conference on Artificial Intelligence in Education, Memphis, TN, Springer-Verlag: 557-558. Click here for .pdf file. Click here for .pptx. Click here for Youtube video of talk.
Abstract: For 20+ years, Project LISTEN (www.cs.cmu.edu/~listen) has made computers listen to children read aloud, and help them learn to read. Along the way we have learned lessons about children, reading, speech technology, intelligent tutors, educational data mining, and doing AIED research in schools.
[AIED2013 models] Lallé, S., Mostow, J., Luengo, V., & Guin, N. (2013). Comparing Student Models in Different Formalisms by Predicting their Impact on Help Success [Finalist for Best Paper Award]. In Proceedings of the 16th International Conference on Artificial Intelligence in Education, Memphis, TN, Springer-Verlag: 161-170. Click here for .pdf file.
Abstract: We describe a method to evaluate how student models affect ITS decision quality – their raison d’être. Given logs of randomized tutorial decisions and ensuing student performance, we train a classifier to predict tutor decision outcomes (success or failure) based on situation features, such as student and task. We define a decision policy that selects whichever tutor action the trained classifier predicts in the current situation is likeliest to lead to a successful outcome. The ideal but costly way to evaluate such a policy is to implement it in the tutor and collect new data, which may require months of tutor use by hundreds of students. Instead, we use historical data to simulate a policy by extrapolating its effects from the subset of randomized decisions that happened to follow the policy. We then compare policies based on alternative student models by their simulated impact on the success rate of tutorial decisions. We test the method on data logged by Project LISTEN’s Reading Tutor, which chooses randomly which type of help to give on a word. We report the cross-validated accuracy of predictions based on four types of student models, and compare the resulting policies’ expected success and coverage. The method provides a utility-relevant metric to compare student models expressed in different formalisms.
[EDM2013 joint] González-Brenes, J. P. & Mostow, J. (2013). What and When do Students Learn? Fully Data-Driven Joint Estimation of Cognitive and Student Models. In Proceedings of the 6th International Conference on Educational Data Mining, S. K. D’Mello, R. A. Calvo and A. Olney, Eds. Memphis, TN, International Educational Data Mining Society: 236-239. Click here for .pdf file.
Abstract: We present the Topical Hidden Markov Model method, which infers jointly a cognitive and student model from longitudinal observations of student performance. Its cognitive diagnostic component specifies which items use which skills. Its knowledge tracing component specifies how to infer students' knowledge of these skills from their observed performance. Unlike prior work, it uses no expert engineered domain knowledge --- yet predicts future student performance in an algebra tutor as accurately as a published expert model.
Abstract: Free-form spoken input would be the easiest and most natural way for young children to communicate to an intelligent tutoring system. However, achieving such a capability poses a challenge both to instruction design and to automatic speech recognition. To address the difficulties of accepting such input, we adopt the framework of predictable response training, which aims at simultaneously achieving linguistic predictability and educational utility. We design instruction in this framework to teach children the reading comprehension strategy of self-questioning. To filter out some misrecognized speech, we combine acoustic confidence with language modeling techniques that exploit the predictability of the elicited responses. Compared to a baseline that does neither, this approach performs significantly better in concept recall (47% vs. 28%) and precision (61% vs. 39%) on 250 unseen utterances from 34 previously unseen speakers. We conclude with some design implications for future speech enabled tutoring systems.
[EDM2013 IRT] Xu, Y. & Mostow, J. (2013). Using Item Response Theory to Refine Knowledge Tracing. S. K. D’Mello, R. A. Calvo and A. Olney, Eds. Memphis, TN, International Educational Data Mining Society: 356-357. Click here for .pdf file.
Abstract: Previous work on knowledge tracing has fit parameters per skill (ignoring differences between students), per student (ignoring differences between skills), or independently for each <student, skill> pair (risking sparse training data and overfitting, and under-generalizing by ignoring overlap of students or skills across pairs). To address these limitations, we first use a higher order Item Response Theory (IRT) model that approximates students’ initial knowledge as their one-dimensional (or low-dimensional) overall proficiency, and combines it with the estimated difficulty and discrimination of each skill to estimate the probability knew of knowing a skill before practicing it. We then fit skill-specific knowledge tracing probabilities for learn, guess, and slip. Using synthetic data, we show that Markov Chain Monte Carlo (MCMC) model-fitting can recover the parameters of this Higher-Order Knowledge Tracing (HO-KT) model. Using real data, we show that HO-KT predicts performance in an algebra tutor better than fitting knowledge tracing parameters per student or per skill
[JNLE 2013] Liu, L., Mostow, J., & Aist, G. S. (2013). Generating Example Contexts to Help Children Learn Word Meaning. Journal of Natural Language Engineering, 19(2), 187-212. doi: 10.1017/S1351324911000374
Abstract: This article addresses the problem of generating good example contexts to help children learn vocabulary. We describe VEGEMATIC, a system that constructs such contexts by concatenating overlapping five-grams from Google’s N-gram corpus. We propose and operationalize a set of constraints to identify good contexts. VEGEMATIC uses these constraints to filter, cluster, score, and select example contexts. An evaluation experiment compared the resulting contexts against human authored example contexts (e.g., from children’s dictionaries and children’s stories). Based on rating by an expert blind to source, their average quality was comparable to story sentences, though not as good as dictionary examples. A second experiment measured the percentage of generated contexts rated by lay judges as acceptable, and how long it took to rate them. They accepted only 28% of the examples, but averaged only 27 seconds to find the first acceptable example for each target word. This result suggests that hand-vetting VEGEMATIC’s output may supply example contexts faster than creating them by hand.
[Tan 2012 MS] Tan, B. H. (2012). Using a Low-cost EEG Sensor to Detect Mental States. M.S. Thesis, Carnegie Mellon University, Pittsburgh, PA. Click here for .pdf file.
Abstract: The ability to detect mental states, whether cognitive or affective, would be useful in intelligent tutoring and many other domains. Newly available, inexpensive, single-channel, dry-electrode devices make electroencephalography (EEG) feasible to use outside the lab, for example in schools. Mostow et al.  used such a device to record the EEG of adults and children reading various types of words and easy and hard sentences; the purpose of this experimental manipulation was to induce distinct mental states. They trained classifiers to predict from the reader‘s EEG signal the type of the text read. The classifiers achieved better than chance accuracy despite the simplicity of the device and machine learning techniques employed.
Their work serves as a pilot study for this thesis and provides the data set for all analyses in this work. This thesis further explores the properties and temporal structure of the EEG signal with the aim of improving the accuracy of detecting mental states. The EEG signals associated with the word stimuli are analyzed for the existence of event-related potentials (ERPs) that could distinguish the word type, which in turn could be exploited in classification. The EEG signals for the sentence stimuli are subjected to various feature extraction methods and temporal manipulations. This thesis demonstrates the potential of exploiting the temporal structure of EEG signals in detecting mental states with low-cost devices.
[SSSR 2012] Mostow, J., Nelson, J., Kantorzyk, M., Gates, D., & Valeri, J. (2012, July 11-14). How does the amount of context in which words are practiced affect fluency growth? Experimental results. Talk presented at the Nineteenth Annual Meeting of the Society for the Scientific Study of Reading, Montreal, Canada. Click here for .pptx file.
Purpose: Previous studies have shown that practice reading words in running text transfers better to reading them in new text than practicing them in isolation does. Why? I.e., which cognitive processes while reading words in connected text improve their transfer to reading them fluently in new text? What context allows these processes? Seeing a word bigram permits parafoveal lookahead at the next word. Seeing a complete phrase allows syntactic parsing. Seeing a complete sentence allows (intra-sentential) comprehension.
Method: Project LISTEN’s automated Reading Tutor administers a randomized, within-subject, within-story experiment. Before a new story, it previews the five longest story words the child has not previously encountered in the Reading Tutor. The child reads one word in isolation, one in a bigram, one in a phrase, one in a sentence, and one not at all, as a no-exposure control. The outcome variable for each of these 5 matched trials is the time to read the word at the first encounter in the story itself.
Results: Preliminary data suggested a trend favoring the sentence condition over the no-exposure control. Data now being logged should help resolve differences among the 5 types of practice.
Conclusions: The results of this experiment should clarify the relative value of practicing words in differing amounts of context, their cost effectiveness in view of the extra time for longer contexts, and the theoretical implications for the role of word context in fluency development.
[EDM 2012 LR-DBN] Xu, Y., & Mostow, J. (2012, June 19-21). Comparison of methods to trace multiple subskills: Is LR-DBN best? [Best Student Paper Award]. In Proceedings of the Fifth International Conference on Educational Data Mining, Chania, Crete, Greece. Click here for .pdf file.
Abstract: A long-standing challenge for knowledge tracing is how to update estimates of multiple subskills that underlie a single observable step. We characterize approaches to this problem by how they model knowledge tracing, fit its parameters, predict performance, and update subskill estimates. Previous methods allocated blame or credit among subskills in various ways based on strong assumptions about their relation to observed performance. LR-DBN relaxes these assumptions by using logistic regression in a Dynamic Bayes Net. LR-DBN significantly outperforms previous methods on data sets from reading and algebra tutors in terms of predictive accuracy on unseen data, cutting the error rate by half. An ablation experiment shows that using logistic regression to predict performance helps, but that using it to jointly estimate subskills explains most of this dramatic improvement. An implementation of LR-DBN is now publicly available in the BNT-SM student modeling toolkit.
[EDM 2012 DCT] González-Brenes, J. P., & Mostow, J. (2012, June 19-21). Dynamic Cognitive Tracing: Towards Unified Discovery of Student and Cognitive Models. In Proceedings of the Fifth International Conference on Educational Data Mining, Chania, Crete, Greece. Click here for .pdf file.
Abstract: This work describes a unified approach to two problems previously addressed separately in Intelligent Tutoring Systems: (i) Cognitive Modeling, which factorizes problem solving steps into the latent set of skills required to perform them ; and (ii) Student Modeling, which infers students' learning by observing student performance .
The practical importance of improving understanding of how students learn is to build better intelligent tutors . The expected advantages of our integrated approach include (i) more accurate prediction of a student's future performance, and (ii) clustering items into skills automatically, without expensive manual expert knowledge annotation.
We introduce a unified model, Dynamic Cognitive Tracing, to explain student learning in terms of skill mastery over time, by learning the Cognitive Model and the Student
Model jointly. We formulate our approach as a graphical model, and we validate it using sixty different synthetic datasets. Dynamic Cognitive Tracing significantly outperforms
single-skill Knowledge Tracing on predicting future student performance.
[ISADEPT 2012] Mostow, J. (2012, June 6-8). Why and How Our Automated Reading Tutor Listens. In International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT), 43-52. KTH, Stockholm, Sweden. Click here for .pdf file.
Abstract: Project LISTEN’s Reading Tutor listens to children read aloud, and helps them learn to read. This paper outlines how it gives feedback, how it uses ASR, and how we measure its accuracy. It describes how we model various aspects of oral reading, some ideas we tried, and lessons we have learned about acoustic models, lexical models, confidence scores, language models, alignment methods, and prosodic models.
[BEA 2012] Mostow, J., & Jang, H. (2012, June 7). Generating Diagnostic Multiple Choice Comprehension Cloze Questions. In NAACL-HLT 7th Workshop on Innovative Use of NLP for Building Educational Applications, Montréal, Canada. Click here for .pdf file.
Abstract: This paper describes and evaluates DQGen, which automatically generates multiple choice cloze questions to test a child’s comprehension while reading a given text. Unlike previous methods, it generates different types of distracters designed to diagnose different types of comprehension failure, and tests comprehension not only of an individual sentence but of the context that precedes it. We evaluate the quality of the overall questions and the individual distracters, according to 8 human judges blind to the correct answers and intended distracter types. The results, errors, and judges’ comments reveal limitations and suggest how to address some of them.
[NAACL 2012 EEG] Chen, Y.-N., Chang, K.-M., & Mostow, J. (2012, June 3-8). Towards Using EEG to Improve ASR Accuracy. In The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Montréal, Canada. Click here for .pdf file.
Abstract: We report on a pilot experiment to improve the performance of an automatic speech recognizer (ASR) by using a single-channel EEG signal to classify the speaker’s mental state as reading easy or hard text. We use a previously published method (Mostow et al., 2011) to train the EEG classifier. We use its probabilistic output to control weighted inter-polation of separate language models for easy and difficult reading. The EEG-adapted ASR achieves higher accuracy than two baselines. We analyze how its performance depends on EEG classification accuracy. This pilot result is a step towards improving ASR more generally by using EEG to distinguish mental states.
[FLAIRS 2012 prosody] Sitaram, S., & Mostow, J. (2012, May 23-25). Mining Data from Project LISTEN’s Reading Tutor to Analyze Development of Children's Oral Reading Prosody [Best Paper Award]. In Proceedings of the 25th Florida Artificial Intelligence Research Society Conference (FLAIRS-25), 478-483. Marco Island, Florida. Click here for .pdf file.
Abstract: Reading tutors can provide an unprecedented opportunity to collect and analyze large amounts of data for understanding how students learn. We trained models of oral reading prosody (pitch, intensity, and duration) on a corpus of narrations of 4558 sentences by 11 fluent adults. We used these models to evaluate the oral reading prosody of 85,209 sentences read by 55 children (mostly) 7-10 years old who used Project LISTEN's Reading Tutor during the 2005-2006 school year. We mined the resulting data to pinpoint the specific common syntactic and lexical features of text that children scored best and worst on. These features predict their fluency and comprehension test scores and gains better than previous models. Focusing on these features may help human or automated tutors improve children’s fluency and comprehension more effectively.
[FLAIRS 2012 tracking] Li, Y., & Mostow, J. (2012, May 23-25). Evaluating and improving real-time tracking of children’s oral reading. In Proceedings of the 25th Florida Artificial Intelligence Research Society Conference (FLAIRS-25), 488-491. Marco Island, Florida. Click here for .pdf file.
Abstract: The accuracy of an automated reading tutor in tracking the reader’s position is affected by phenomena at the frontier of the speech recognizer’s output as it evolves in real time. We define metrics of real-time tracking accuracy computed from the recognizer’s successive partial hypotheses, in contrast to previous metrics computed from the final hypothesis. We analyze the resulting considerable loss in real-time accuracy, and propose and evaluate a method to address it. Our method raises real-time accuracy from 58% to 70%, which should improve the quality of the tutor’s feedback.
[FLAIRS 2012 subskills] Xu, Y., & Mostow, J. (2012, May 23-25). Extending a Dynamic Bayes Net Toolkit to Trace Multiple Subskills. In Proceedings of the 25th Florida Artificial Intelligence Research Society Conference (FLAIRS-25), 574. Marco Island, Florida. Click here for .pdf file.
Abstract: Dynamic Bayesian Nets (DBNs) provide a powerful representation to 1) model the relationships between students’ evolving knowledge and behavior in an intelligent tutoring system, and 2) infer changes in a student’s hidden knowledge from the student’s observed sequential steps. Chang et al. (2006) introduced a Matlab tool called BNT-SM, which inputs a concise specification of a DBN and uses the Bayes Net Toolbox (BNT) (Murphy 2001) to generate Matlab code to train and test the DBN. The input DBN specification, expressed in XML, is a fraction of the size of the generated output, thereby sparing researchers considerable coding.
However, the DBNs represented by BNT-SM did not model steps that involve multiple subskills. To overcome this limitation, LR-DBN (Xu and Mostow 2011b) uses logistic regression in DBNs to trace multiple subskills. As reported at EDM2011 (Xu and Mostow 2011b, 2011a), LR-DBN fits student performance data significantly better than previous methods, with only half as many prediction errors on unseen data.
Therefore we have extended BNT-SM to make LR-DBN available to researchers in easy-to-use form. Compared to implementing a LR-DBN model directly in BNT, implementing it in BNT-SM now requires substantially less user effort and code. For example, the simplest LR-DBN model uses logistic regression in Knowledge Tracing (Corbett and Anderson 1995). Implementing it directly in BNT required 86 lines of code. In contrast, implementing it in BNT-SM needs only half as many lines of XML to specify its structure and parameters.
[EACL 2012] Jang, H., & Mostow, J. (2012, April 23-27). Inferring Selectional Preferences from Part-of-Speech N-grams. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 377-386. Avignon, France. Click here for .pdf file.
Abstract: We present the PONG method to compute selectional preferences using part-of-speech (POS) N-grams. From a corpus labeled with grammatical dependencies, PONG learns the distribution of word relations for each POS N-gram. From the much larger but unlabeled Google N-grams corpus, PONG learns the distribution of POS N-grams for a given pair of words. We derive the probability that one word has a given grammatical relation to the other. PONG estimates this probability by combining both distributions, whether or not either word occurs in the labeled corpus. PONG achieves higher average precision on 16 relations than a state-of-the-art baseline in a pseudo-disambiguation task, but lower coverage and recall.
[Chen 2012 PhD] Chen, W. (2012). Detecting Off-task Speech. PhD thesis, Carnegie Mellon University, Pittsburgh, PA. Click here for .pdf file.
Abstract: Off-task speech is speech that strays away from an intended task. It occurs in many dialog applications, such as intelligent tutors, virtual games, health communication systems and human-robot cooperation. Off-task speech input to computers presents both challenges and opportunities for such dialog systems. On the one hand, off-task speech contains informal conversational style and potentially unbounded scope that hamper accurate speech recognition. On the other hand, an automated agent capable of detecting off-task speech could track users’ attention and thereby maintain the intended conversation by bringing a user back on task; also, knowledge of where off-task speech events are likely to occur can help the analysis of automatic speech recognition (ASR) errors. Related work has been done in confidence measures for dialog systems and detecting out-of-domain utterances. However, there is a lack of systematic study on the type of off-task speech being detected and generality of features capturing off-task speech. In addition, we know of no published research on detecting off-task speech in children’s interactions with an automated agent. The goal of this research is to fill in these blanks to provide a systematic study of off-task speech, with an emphasis on child-machine interactions.
To characterize off-task speech quantitatively, we used acoustic features to capture its speaking style; we used lexical features to capture its linguistic content; and we used contextual features to capture the relation of off-task speech to nearby utterances. Using these features, we trained an off-task speech detector that yielded 87% detection rate at a cost of 10% false positives on children’s oral reading. Furthermore, we studied the generality of these types of features by detecting off-task speech in data from four tutorial tasks ranging from oral reading to prompted free-form responses. In addition, we examined how the features help detect adults’ off-task speech in data from the CMU Let’s Go bus information system. We show that lexical features detect more task-related off-task speech such as complaints about the system, whereas acoustic features detect more unintelligible speech and non-speech events such as mumbling and humming. Moreover, acoustic features tend to be more robust than lexical features when switching domains. Finally, we demonstrate how off-task speech detection can improve the performance on application-relevant metrics such as predicting fluency test scores in oral reading and understanding utterances in the CMU Let’s Go bus information system.
[QG 2011 keynote] Mostow, J. (2011). Questions and answers about questions and answers: Lessons from generating, scoring, and analyzing questions in a reading tutor for children, Invited talk at AAAI Symposium on Question Generation. Arlington, VA, USA. Click here for .pdf file.
Abstract: This talk reviewed attempts over the years to automatically generate and administer questions in Project LISTEN's Reading Tutor; to log, score, and analyze children's responses; and to evaluate such questions. How can a tutor use questions? How can it generate them? How can it score children's responses? What can they tell us? How can we evaluate questions? Which questions succeeded or failed?
[QG 2011 cloze] Gates, D., Aist, G., Mostow, J., Mckeown, M., & Bey, J. (2011, November 4-6). How to Generate Cloze Questions from Definitions: a Syntactic Approach. Proceedings of the AAAI Symposium on Question Generation, Arlington, VA. Click here for .pdf file.
Abstract: This paper discusses the implementation and evaluation of automatically generated cloze questions in the style of the definitions found in Collins COBUILD English language learner’s dictionary. The definitions and the cloze questions are used in an automated reading tutor to help second and third grade students learn new vocabulary. A parser provides syntactic phrase structure trees for the definitions. With these parse trees as input, a pattern matching program uses a set of syntactic patterns to extract the phrases that make up the cloze question answers and distracters.
[QG 2011 evaluate] Chen, W., Mostow, J., & Aist, G. (2011, November 4-6). Using Automatic Question Generation to Evaluate Questions Generated by Children. Proceedings of the AAAI Symposium on Question Generation, Arlington, VA. Click here for .pdf file.
Abstract: This paper shows that automatically generated questions can help classify children’s spoken responses to a reading tutor teaching them to generate their own questions. We use automatic question generation to model and classify children’s prompted spoken questions about stories. On distinguishing complete and incomplete questions from irrelevant speech and silence, a language model built from automatically generated questions out-performs a trigram language model that does not exploit the structure of questions.
[Interspeech 2011] Chen, W., & Mostow, J. (2011, August 28-31). A Tale of Two Tasks: Detecting Children’s Off-Task Speech in a Reading Tutor. Interspeech: Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy. Click here for .pdf file.
Abstract: How can an automated tutor detect children’s off-task utterances? To answer this question, we trained SVM classifiers on a corpus of 495 children’s 36,492 computer-assisted oral reading utterances. On a test set of 651 utterances by 10 held-out readers, the classifier correctly detected 88% of off-task utterances and misclassified 17% of on-task utterances as off-task. As a test of generality, we applied the same classifier to 20 children’s 410 responses to vocabulary questions. The classifier detected 84% of off-task utterances but misclassified 57% of on-task utterances. Acoustic and lexical features helped detected off-task speech in both tasks.
[SLaTE 2011 tracking] Rasmussen, M. H., Mostow, J., Tan, Z.-H., Lindberg, B., & Li, Y. (2011, August 24-26). Evaluating Tracking Accuracy of an Automatic Reading Tutor, SLaTE: ISCA (International Speech Communication Association) Special Interest Group (SIG) Workshop on Speech and Language Technology in Education. Venice, Italy. Click here for .pdf file.
Abstract: In automatic reading tutoring, tracking is the process of automatically following a reader through a given target text. When developing tracking algorithms, a measure of the tracking accuracy – how often a spoken word is aligned to the right target text word position – is needed in order to evaluate performance and compare different algorithms. This paper presents a framework for determining the perceived tracking error rate. The proposed framework is used to evaluate three tracking strategies: A) follow the reader to whichever word he/she jumps to in the text, B) follow the reader monotonically from left to right ignoring word skips and regressions (going back to a previous text word), and C) the same as B but allowing isolated word skips. Perceived tracking error rate for each of the three tracking strategies is: A: 53%, B: 56%, and C: 47%, on 1883 utterances from 25 children.
[SLaTE 2011 prosody] Sitaram, S., Mostow, J., Li, Y., Weinstein, A., Yen, D., & Valeri, J. (2011, August 24-26). What visual feedback should a reading tutor give children on their oral reading prosody?, SLaTE: ISCA (International Speech Communication Association) Special Interest Group (SIG) Workshop on Speech and Language Technology in Education. Venice, Italy. Click here for .pdf file.
Abstract: An automated reading tutor that models and evaluates children's oral reading prosody should also be able to respond dynamically with feedback they like, understand, and benefit from. We describe visual feedback that Project LISTEN's Reading Tutor generates in realtime by mapping prosodic features of children's oral reading to dynamic graphical features of displayed text. We present results from preliminary usability studies of 20 children aged 7-10. We also describe an experiment to test whether such visual feedback elicits oral reading that more closely matches the prosodic contours of adult narrations. Effective feedback on prosody could help children become fluent, expressive readers.
[TSLP 2011 prosody] Duong, M., Mostow, J., & Sitaram, S. (2011). Two Methods for Assessing Oral Reading Prosody ACM Transactions on Speech and Language Processing (Special Issue on Speech and Language Processing of Children’s Speech for Child-machine Interaction Applications), 7(4), 14:11-22. Click here for .pdf file.
Abstract: We compare two types of models to assess the prosody of children’s oral reading. Template models measure how well the child’s prosodic contour in reading a given sentence correlates in pitch, intensity, pauses, or word reading times with an adult narration of the same sentence. We evaluate template models directly against a common rubric used to assess fluency by hand, and indirectly by their ability to predict fluency and comprehension test scores and gains of 10 children who used Project LISTEN’s Reading Tutor; the template models outpredict the human assessment.
We also use the same set of adult narrations to train generalized models for mapping text to prosody, and use them to evaluate children’s prosody. Using only durational features for both types of models, the generalized models perform better at predicting fluency and comprehension posttest scores of 55 children ages 7-10, with adjusted R2 of 0.6. Such models could help teachers identify which students are making adequate progress. The generalized models have the additional advantage of not requiring an adult narration of every sentence.
[TSLP 2011 dialogue] González-Brenes, J. P., & Mostow, J. (2011). Classifying Dialogue in High-Dimensional Space. ACM Transactions on Speech and Language Processing (Special Issue on Machine Learning for Adaptivity in Dialogue Systems), 7(3), 8:1-15. Click here for .pdf file.
Abstract: The richness of multimodal dialogue makes the space of possible features required to describe it very large relative to the amount of training data. However, conventional classifier learners require large amounts of data to avoid over-fitting, or do not generalize well to unseen examples. To learn dialogue classifiers using a rich feature set and fewer data points than features, we apply a recent technique, `1-regularized logistic regression. We demonstrate this approach empirically on real data from Project LISTEN's Reading Tutor, which displays a story on a computer screen and listens to a child read aloud. We train a classifier to predict task completion (i.e., whether the student will finish reading the story) with 71% accuracy on a balanced, unseen test set. To characterize differences in the behavior of children when they choose the story they read, we likewise train and test a classifier that with 73.6% accuracy infers who chose the story based on the ensuing dialogue. Both classifiers significantly outperform baselines and reveal relevant features of the dialogue.
[SSSR 2011 prosody] Mostow, J., & Sitaram, S. (2011, July 13-16). Mining data from Project LISTEN's Reading Tutor to analyze development of children’s oral reading prosody. In D. Compton (Ed.), Eighteenth Annual Meeting of the Society for the Scientific Study of Reading. St. Pete Beach, Florida.
Abstract: Purpose: How does children's oral reading prosody develop over time? Miller & Schwanenflugel (2008) investigated this question by meticulously analyzing children’s readings of a short text passage in grades 1 and 2 and comparing their prosody to fluent adult readings of the same passage, based on the observation that similarity to adult prosody is a good measure of children’s oral reading expressiveness.
Method: We implemented an automated, scaled-up version of this approach by training models of oral reading pitch, intensity, and duration on a corpus of narrations of 12408 sentences by 11 fluent adult narrators. We used these models to evaluate the oral reading prosody of 77693 sentences read by 205 children (mostly) 7-10 years old who used Project LISTEN's Reading Tutor during the 2005-2006 school year.
Results: The models assess a child’s prosodic contour for a given sentence by estimating the probability that an adult would generate it, based on syntactic and other features of the words. Aggregating such evaluations over multiple children and sentences makes it possible to pinpoint specific common prosodic deficits in children’s oral reading.
Conclusions: Mining this corpus provides an unprecedented opportunity to explore, at a large scale and fine grain size, the detailed prosodic characteristics of fluent adult reading that children lack most, learn fastest, and in what order.
[SSSR [2011 gaze] Nelson, J. (2011). Individual differences in reading skills and experience are reflected in eye movements during reading. In D. Compton (Ed.), Eighteenth Annual Meeting of the Society for the Scientific Study of Reading. St. Pete Beach, Florida.
Abstract: Purpose: The goal of this study was to characterize how individual differences in reading skills and experience are reflected in patterns of reading behavior, as measured by the eye movement record.
Method: A factor analysis of a large database of questionnaire and reading test scores enabled identification of five major dimensions of individual variability: expertise (speed and experience), sublexical skills, accuracy focus (reading at a pace that enables high comprehension), learning/memory, and amount of casual reading done. Thirty-five adult participants with scores along each of these dimensions read paragraphs while their eye movements were monitored
Results: A mixed-effects linear regression analysis revealed that experienced readers demonstrated more efficient reading behavior than less-experienced readers, especially for low frequency words and words with high frequency neighbors. This pattern reduced the word frequency effect and increased the neighborhood frequency effect for these readers. Readers with good sublexical skills also showed more efficient reading behaviors than those with poorer sublexical skills including, specifically, shorter first fixation durations for high frequency words. Main effects of accuracy focus and interactions between expertise and accuracy focus were also found.
Conclusions: The pattern of findings suggests that individual differences in skills, knowledge, and strategy are evident in the eye movement record. There are not only predictable main effects reflecting reading speed differences, but also interactions with lexical properties: Reading expertise especially strengthens knowledge of low frequency words, whereas sublexical skill benefits the rapid identification of high frequency words. The importance of reading practice and sublexical skills persists into adult reading.
[EDM 2011 dialogue] González-Brenes, J., Duan, W., & Mostow, J. (2011, July 6-8). How to Classify Tutorial Dialogue? Comparing Feature Vectors vs. Sequences. In M. Pechenizkiy, T. Calders, C. Conati, S. Ventura, C. Romero, & J. Stamper (Eds.), Proceedings of the 4th International Conference on Educational Data Mining (pp. 169-178). Eindhoven, Netherlands. Click here for .pdf file.
Abstract: A key issue in using machine learning to classify tutorial dialogues is how to represent time-varying data. Standard classifiers input a feature vector and output its predicted label. It is possible to formulate tutorial dialogue classification problems in this way. However, a feature vector representation requires mapping a dialogue onto a fixed number of features, and does not innately exploit its sequential nature. In contrast, this paper explores a recent method that classifies sequences, using a technique new to the Educational Data Mining community – Hidden Conditional Random Fields (Quattoni, Wang et al. 2007). We illustrate its application to a data set from Project LISTEN's Reading Tutor, and compare it to three baselines using the same data, cross-validation splits, and feature set. Our technique produces state-of-the-art classification accuracy in predicting reading task completion. We consider the contributions of this paper to be (i) introducing HCRFs to EDM community, (ii) formulating tutorial dialogue classification as a sequence classification problem, and (iii) evaluating and comparing dialogue classification.
[EDM 2011 AutoCord] Mostow, J., González-Brenes, J., & Tan, B. H. (2011, July 6-8). Learning Classifiers from a Relational Database of Tutor Logs. In M. Pechenizkiy, T. Calders, C. Conati, S. Ventura, C. Romero, & J. Stamper (Eds.), Proceedings of the 4th International Conference on Educational Data Mining (pp. 149-158). Eindhoven, Netherlands. Click here for .pdf file.
Abstract: A bottleneck in mining tutor data is mapping heterogeneous event streams to feature vectors with which to train and test classifiers. To bypass the labor-intensive process of feature engineering, AutoCord learns classifiers directly from a relational database of events logged by a tutor. It searches through a space of classifiers represented as database queries, using a small set of heuristic operators. We show how AutoCord learns a classifier to predict whether a child will finish reading a story in Project LISTEN’s Reading Tutor. We compare it to a previously reported classifier that uses hand-engineered features. AutoCord has the potential to learn classifiers with less effort and greater accuracy.
[EDM 2011 parameterization] Mostow, J., Xu, Y., & Munna, M. (2011, July 6-8). Desperately Seeking Subscripts: Towards Automated Model Parameterization. In M. Pechenizkiy, T. Calders, C. Conati, S. Ventura, C. Romero, & J. Stamper (Eds.), Proceedings of the 4th International Conference on Educational Data Mining (pp. 283-287). Eindhoven, Netherlands. Click here for .pdf file.
Abstract: This paper addresses the laborious task of specifying parameters within a given model of student learning. For example, should the model treat the probability of forgetting a skill as a theory-determined constant? As a single empirical parameter to fit to data? As a separate parameter for each student, or for each skill? We propose a generic framework to represent and mechanize this decision process as a heuristic search through a space of alternative parameterizations. Even partial automation of this search could ease researchers’ burden of developing models by hand. To test the framework’s generality, we apply it to two modeling formalisms – a dynamic Bayes net and learning decomposition – and compare how well they model the growth of children’s oral reading fluency.
[EDM 2011 compare] Xu, Y., & Mostow, J. (2011, July 6-8). Logistic Regression in a Dynamic Bayes Net Models Multiple Subskills Better! [Best Poster Nominee]. In M. Pechenizkiy, T. Calders, C. Conati, S. Ventura, C. Romero, & J. Stamper (Eds.), Proceedings of the 4th International Conference on Educational Data Mining (pp. 337-338). Eindhoven, Netherlands. Click here for .pdf file.
Abstract: A single student step in an intelligent tutor may involve multiple subskills. Conventional approaches sidestep this problem, model the step as using only its least known subskill, or treat the subskills as necessary and probabilistically independent. In contrast, we use logistic regression in a Dynamic Bayes Net (LR-DBN) to trace the multiple subskills. We compare these three types of models on a published data set from a cognitive tutor. LR-DBN fits the data significantly better, with only half as many prediction errors on unseen data.
[EDM 2011 subskills] Xu, Y., & Mostow, J. (2011, July 6-8). Using Logistic Regression to Trace Multiple Subskills in a Dynamic Bayes Net. In M. Pechenizkiy, T. Calders, C. Conati, S. Ventura, C. Romero, & J. Stamper (Eds.), Proceedings of the 4th International Conference on Educational Data Mining (pp. 241-245). Eindhoven, Netherlands. Click here for .pdf file.
Abstract: A challenge in estimating students' changing knowledge from sequential observations of their performance arises when each observed step involves multiple subskills. To overcome this mismatch in grain size between modelled skills and observed actions, we use logistic regression over each step's subskills to model transition probabilities for the overall knowledge the step requires. We are evaluating how well such models fit tutor data compared to alternative approaches. Unlike representing each combination of subskills as an independent skill, this approach can trace knowledge of the individual subskills.
[AIED 2011] Mostow, J., Chang, K.-m., & Nelson, J. (2011, June 28 - July 2). Toward Exploiting EEG Input in a Reading Tutor [Best Paper Nominee]. Proceedings of the 15th International Conference on Artificial Intelligence in Education, Auckland, NZ, 230-237. Click here for .pdf file. Click here for Powerpoint presentation (.pptx file).
Abstract: A new type of sensor for students’ mental states is a single-channel EEG headset simple enough to use in schools. Using its signal from adults and children reading text and isolated words, both aloud and silently, we train and test classifiers to tell easy from hard sentences, and to distinguish among easy words, hard words, pseudo-words, and unpronounceable strings. We also identify which EEG components appear sensitive to which lexical features. Better-than-chance performance shows promise for tutors to use EEG at school.
[BEA 2011] Mostow, J., & Duan, W. (2011, June 24). Generating Example Contexts to Illustrate a Target Word Sense. Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Portland, OR, 105-110. Click here for .pdf file.
Abstract: Learning a vocabulary word requires seeing it in multiple informative contexts. We describe a system to generate such contexts for a given word sense. Rather than attempt to do word sense disambiguation on example contexts already generated or selected from a corpus, we compile information about the word sense into the context generation process. To evaluate the sense-appropriateness of the generated contexts compared to WordNet examples, three human judges chose which word sense(s) fit each example, blind to its source and intended sense. On average, one judge rated the generated examples as sense-appropriate, compared to two judges for the WordNet examples. Although the system’s precision was only half of WordNet’s, its recall was actually higher than WordNet’s, thanks to covering many senses for which WordNet lacks examples.
[SIGDial 2011] González-Brenes, J. P., & Mostow, J. (2011, June 17-18). What System Differences Matter? Using L1/L2 Regularization to Compare Dialogue Systems [Best Paper Nominee]. SIGDial: 12th annual SIGdial Meeting on Discourse and Dialogue, Portland, OR. Click here for .pdf file.
Abstract: We investigate how to jointly explain the performance and behavioral differences of two spoken dialogue systems. For this we propose two algorithms, the simpler one is just a scaffold for the Join Evaluation and Differences Identification (JEDI) algorithm. JEDI finds differences between systems which relate to performance by formulating the problem as a multi-task feature selection question. JEDI provides evidence on the usefulness of a recent method, based on L1 regularization (Obozinski et al., 2007). We evaluate our approaches against manually annotated success criteria from real users interacting with five different spoken user interfaces that give bus schedule information.
[YRRSDS 2011] González-Brenes, J. P. (2011, June 15-16). Position paper. YRRSDS: 7th Young Researchers' Roundtable on Spoken Dialogue Systems, Portland, OR. Click here for .pdf file.
* [DEV 2010] Weber, F., & Bali, K. (2010). Enhancing ESL Education in India with a Reading Tutor that Listens, Proceedings of the First ACM Symposium on Computing for Development (pp. 20:21-29). London, United Kingdom: ACM. Click here for .pdf file at portal.acm.org.
Abstract: We report results of a 2 ½-month pilot study of Project Listen’s PC-based Reading Tutor program for enhancing English education in India. Our focus was on low-income elementary school students, a population that has little or no exposure to English outside of school. The students showed measurable improvement on quantitative tests of reading fluency while using the tutor. Post-pilot interviews explored the students’ experience of the reading tutor. Further, a survey of educational programs gives a picture of the wide range of institutions providing training in English in and around Bangalore to low-income populations. Each has associated infrastructure, personnel, and curricular constraints that would be faced by interventions like the reading tutor, even if it can be shown to be effective. The perceived advantages of literacy software and associated measures of success also vary by program.
[EDM 2010 handbook] Mostow, J., Beck, J., Cuneo, A., Gouvea, E., Heiner, C., & Juarez, O. (2010). Lessons from Project LISTEN's Session Browser. In C. Romero, S. Ventura, S. R. Viola, M. Pechenizkiy, & R. S. J. d. Baker (Eds.), Handbook of Educational Data Mining, 389-416: Taylor & Francis Group. Click here for .pdf file.
Abstract: A basic question in mining data from an intelligent tutoring system is, “What happened when…?” A tool to answer such questions should let the user specify which phenomena to explore; find instances of them; summarize them in human-understandable form; explore the context where they occurred; dynamically drill down and adjust which details to display; support manual annotation; and require minimal effort to adapt to new tutor versions, new users, new phenomena, or other tutors.
This chapter describes the Session Browser, an educational data mining tool that supports such case analysis by exploiting three simple but powerful ideas. First, logging tutorial interaction directly to a suitably designed and indexed database instead of to log files eliminates the need to parse them and supports immediate efficient access. Second, a student, computer, and time interval together suffice to identify a tutorial event. Third, a containment relation between time intervals defines a hierarchical structure of tutorial interactions. Together, these ideas make it possible to implement a flexible, efficient tool to browse tutor data in understandable form yet with minimal dependency on tutor-specific details.
We illustrate how we have used the Session Browser with MySQL databases of millions of events logged by successive versions of Project LISTEN’s Reading Tutor. We describe tasks we have used it for, improvements made, and lessons learned in the years since the first version of the Session Browser [1-3].
[ITID 2010] Korsah, G. A., Mostow, J., Dias, M. B., Sweet, T. M., Belousov, S. M., Dias, M. F., & Gong, H. (2010). Improving Child Literacy in Africa: Experiments with an Automated Reading Tutor. Information Technologies and International Development, 6(2), 1-19. Click here for .pdf file.
Abstract: This paper describes Project Kané , a research endeavor aimed at exploring the role that technology can play in improving child literacy in developing communities. An initial pilot study and a subsequent four-month-long controlled field study in Ghana investigated the viability and effectiveness of an automated reading tutor in helping urban children enhance their reading skills in English. In addition to quantitative data suggesting that automated tutoring can be useful for some children in this setting, these studies and an additional preliminary pilot study in Zambia yielded useful qualitative observations regarding the feasibility of applying technology solutions to the challenge of enhancing child literacy in developing communities. This paper presents the findings, observations, and lessons learned from the field studies.
[ITS 2010 ASR] Chen, W., Mostow, J., & Aist, G. (2010, June 14-18). Exploiting Predictable Response Training to Improve Automatic Recognition of Children's Spoken Questions. Proceedings of the Tenth International Conference on Intelligent Tutoring Systems (ITS2010), Pittsburgh, PA, 55-64. © Springer-Verlag. Click here for .pdf file. The original publication is available at www.springerlink.com.
Abstract: The unpredictability of spoken responses by young children (6-7 years old) makes them problematic for automatic speech recognizers. Aist and Mostow proposed predictable response training to improve automatic recognition of children’s free-form spoken responses. We apply this approach in the context of Project LISTEN’s Reading Tutor to the task of teaching children an important reading comprehension strategy, namely to make up their own questions about text while reading it. We show how to use knowledge about strategy instruction and the story text to generate a language model that predicts questions spoken by children during comprehension instruction. We evaluated this model on a previously unseen test set of 18 utterances totaling 137 words spoken by 11 second grade children in response to prompts the Reading Tutor inserted as they read. Compared to using a baseline trigram language model that does not incorporate this knowledge, speech recognition using the generated language model achieved concept recall 5 times higher – so much that the difference was statistically significant despite small sample size.
[ITS 2010 IE] Mostow, J., Aist., G., Bey, J., Chen, W., Corbett, A., Duan, W., Duke, N., Duong, M., Gates, D., Gonzalez, J. P., Juarez, O., Kantorzyk, M., Li, Y., Liu, L., McKeown, M., Trotochaud, C., Valeri, J., Weinstein, A., & Yen, D. (2010, June 14-18). A Better Reading Tutor That Listens [Interactive Event]. Proceedings of the Tenth International Conference on Intelligent Tutoring Systems (ITS2010), Pittsburgh, PA, 451.
Abstract: Project LISTEN’s Reading Tutor listens to children read aloud, and helps them learn to read, as illustrated on the Videos page of our website. This Interactive Event encompasses both this basic interaction and new extensions we are developing.
To accelerate fluency development, we are generating real-time visual feedback on children’s oral reading expressiveness by mapping prosodic features such as timing, pitch, and intensity to graphical features such as position, shape, and color. To design more effective practice on individual words, we are conducting an experiment to investigate whether and how the amount of context in which the student practices a word – in a sentence, in a phrase, in a bigram, in isolation, or not at all – affects the time to read the word subsequently in connected text.
To accelerate vocabulary development, we are augmenting children’s encounters of words in stories with additional instruction and encounters in multiple contexts required to acquire word meaning. To foster active processing required for successful learning, these encounters challenge the child to think about how words relate to context and to other words. We are developing automated methods to help generate effective contexts for learning word meaning, to generate useful challenges, to compute their answers, and to provide informative feedback to children’s responses.
To teach explicit reading comprehension strategies, we are adapting expert human instruction into scripted scenarios for Reading Tutor dialogue. The strategies include activating background knowledge, visualizing, asking questions, and summarizing. We are working to automate the scripting process of generating comprehension instruction, for example by generating good questions about a story and scaffolding children to make up their own. As Chen, Mostow, and Aist’s ITS2010 paper reports, we are attacking the problem of recognizing children’s free-form spoken responses to tutor prompts by training them to respond more predictably, and by exploiting this predictability to improve speech recognition. This work aims to enable the Reading Tutor, and perhaps other tutors some day, to listen to children not just read but talk.
[EDM 2010 predicting] González-Brenes, J. P., & Mostow, J. (2010, June 11-13). Predicting Task Completion from Rich but Scarce Data. Proceedings of the 3rd International Conference on Educational Data Mining, Pittsburgh, PA, 291-292. Click here for .pdf file.
[EDM 2010 AutoJoin] Mostow, J., & Tan, B. H. L. (2010, June 11-13). AutoJoin: Generalizing an Example into an EDM query. The Third International Conference on Educational Data Mining, Pittsburgh, PA, 307-308. Click here for .pdf file.
Abstract: This paper describes an implemented method to generalize an example tutor interaction into a query to retrieve similarly related sets of events. It infers WHERE clauses to equate repeated values unlikely to match by accident.
[LSA 2010] Aist, G., Gates, D., McKeown, M., & Mostow, J. (2010, January 8). Derivational morphology affects children's word reading in English earlier than previously thought. Presented at Linguistic Society of America (LSA) Annual Meeting, Baltimore, MD.
Abstract: What use do children make of prefixes while reading? Initial letters of nine common prefixes are surprisingly unreliable semantically: 37% (un-) to 5% (dis-, en-) reliable in the ANC (displeased, encouraged) vs. unreliable (mister, enough). Such low reliability might support the conventional delay of morphology instruction in reading until perhaps grade four. However, reliable words’ reading times from 212 children were 19% slower than unreliable words. This effect appears to result from reliable head, reliable tail (repaint) and letter bigram frequency, and holds across reading levels. Thus, children may be sensitive to morphology in printed English earlier than traditionally thought.
[NAACL 2010] Duan, W., & Yates, A. (2010, June 1-6). Extracting Glosses to Disambiguate Word Senses. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, 627–635. Click here for .pdf file.
Abstract: Like most natural language disambiguation tasks, word sense disambiguation (WSD) requires world knowledge for accurate predictions. Several proxies for this knowledge have been investigated, including labeled corpora, user-contributed knowledge, and machine readable dictionaries, but each of these proxies requires significant manual effort to create, and they do not cover all of the ambiguous terms in a language. We investigate the task of automatically extracting world knowledge, in the form of glosses, from an unlabeled corpus. We demonstrate how to use these glosses to automatically label a training corpus to build a statistical WSD system that uses no manually-labeled data, with experimental results approaching that of a supervised SVM-based classifier.
[Interspeech 2009 predictable] Aist, G., & Mostow, J. (2009, September 6-10). Designing Spoken Tutorial Dialogue with Children to Elicit Predictable but Educationally Valuable Responses. 10th Annual Conference of the International Speech Communication Association (Interspeech), Brighton, UK. Click here for .pdf file.
Abstract: How to construct spoken dialogue interactions with children that are educationally effective and technically feasible? To address this challenge, we propose a design principle that constructs short dialogues in which (a) the user’s utterance are the external evidence of task performance or learning in the domain, and (b) the target utterances can be expressed as a well-defined set, in some cases even as a finite language (up to a small set of variables which may change from exercise to exercise.) The key approach is to teach the human learner a parameterized process that maps input to response. We describe how the discovery of this design principle came out of analyzing the processes of automated tutoring for reading and pronunciation and designing dialogues to address vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions, and discuss how it could extend to non-language tutoring tasks.
[SLaTE 2009 predictable] Aist, G., & Mostow, J. (2009, September 3-5). Predictable and Educational Spoken Dialogues: Pilot Results. Second ISCA Workshop on Speech and Language Technology in Education (SLaTE), Wroxall Abbey Estate, Warwickshire, England. Click here for .pdf file.
Abstract: This paper addresses the challenge of designing spoken dialogues that are of educational benefit within the context of an intelligent tutoring system, yet predictable enough to facilitate automatic speech recognition and subsequent processing. We introduce a design principle to meet this goal: construct short dialogues in which the desired student utterances are external evidence of performance or learning in the domain, and in which those target utterances can be expressed as a well-defined set. The key to this principle is to teach the human learner a process that maps inputs to responses. Pilot results in two domains - self-generated questions and morphology exercises - indicate that the approach is promising in terms of its habitability and the predictability of the utterances elicited. We describe the results and sketch a brief taxonomy classifying the elicited utterances according to whether they evidence student performance or learning, whether they are amenable to automatic processing, and whether they support or call into question the hypothesis that such dialogues can elicit spoken utterances that are both educational and predictable.
[SLaTE 2009 prosody] Duong, M., & Mostow, J. (2009, September 3-5). Detecting Prosody Improvement in Oral Rereading. Second ISCA Workshop on Speech and Language Technology in Education (SLaTE), Wroxall Abbey Estate, Warwickshire, England. Click here for .ppt file. Click here for .pdf file.
Abstract: A reading tutor that listens to children read aloud should be able to detect fluency growth - not only in oral reading rate, but also in prosody. How sensitive can such detection be? We present an approach to detecting improved oral reading prosody in rereading a given text. We evaluate our method on data from 133 students ages 7-10 who used Project LISTEN's Reading Tutor. We compare the sensitivity of our extracted features in detecting improvements. We use them to compare the magnitude of recency and learning effects. We find that features computed by correlating the student's prosodic contours with those of an adult narration of the same text are generally not as sensitive to gains as features based solely on the student's speech. We also find that rereadings on the same day show greater improvement than those on later days: statistically reliable recency effects are almost twice as strong as learning effects for the same features.
[SLaTE 2009 contexts] Liu, L., Mostow, J., & Aist, G. (2009, September 3-5). Automated Generation of Example Contexts for Helping Children Learn Vocabulary. Second ISCA Workshop on Speech and Language Technology in Education (SLaTE), Wroxall Abbey Estate, Warwickshire, England. Click here for .pdf file.
Abstract: This paper addresses the problem of generating good example contexts to help children learn vocabulary. We construct candidate contexts from the Google N-gram corpus. We propose a set of constraints on good contexts, and use them to filter candidate example contexts. We evaluate the automatically generated contexts by comparison to example contexts from children’s dictionaries and from children’s stories.
* [IDEC 2009] Reeder, K., Shapiro, J., & Wakefield, J. (2009, July 19-22). A computer based reading tutor for young English language learners: recent research on proficiency gains and affective response. 16th European Conference on Reading and 1st Ibero-American Forum on Literacies, University of Minho, Campus de Gualtar, Braga, Portugal.
[AIED 2009 prosody] Mostow, J., & Duong, M. (2009, July 6-10). Automated Assessment of Oral Reading Prosody. Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED2009), Brighton, UK, 189-196. Click here for .pdf file.
Abstract: We describe an automated method to assess the expressiveness of children's oral reading by measuring how well its prosodic contours correlate in pitch, intensity, pauses, and word reading times with adult narrations of the same sentences. We evaluate the method directly against a common rubric used to assess fluency by hand. We also compare it against manual and automated baselines by its ability to predict fluency and comprehension test scores and gains of 55 children ages 7-10 who used Project LISTEN's Reading Tutor. It outperforms the human-scored rubric, predicts gains, and could help teachers identify which students are making adequate progress.
[AIED 2009 questioning] Mostow, J., & Chen, W. (2009, July 6-10). Generating Instruction Automatically for the Reading Strategy of Self-Questioning. Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED2009), Brighton, UK, 465-472. Click here for .pdf file.
Abstract: Self-questioning is an important reading comprehension strategy, so it would be useful for an intelligent tutor to help students apply it to any given text. Our goal is to help children generate questions that make them think about the text in ways that improve their comprehension and retention. However, teaching and scaffolding self-questioning involve analyzing both the text and the students’ responses. This requirement poses a tricky challenge to generating such instruction automatically, especially for children too young to respond by typing. This paper describes how to generate self-questioning instruction for an automated reading tutor. Following expert pedagogy, we decompose strategy instruction into describing, modeling, scaffolding, and prompting the strategy. We present a working example to illustrate how we generate each of these four phases of instruction for a given text. We identify some relevant criteria and use them to evaluate the generated instruction on a corpus of 513 children’s stories.
[QG 2009 informational] Chen, W., Aist, G., & Mostow, J. (2009, July 6). Generating Questions Automatically from Informational Text. Proceedings of AIED 2009 Workshop on Question Generation, Brighton, UK, 17-24. Click here for .pdf file.
Abstract: Good readers ask themselves questions during reading. Our goal is to scaffold this self-questioning strategy automatically to help children in grades 1-3 understand informational text. In previous work, we showed that instruction for self-questioning can be generated for narrative text. This paper tests the generality of that approach by applying it to informational text. We describe the modifications required, and evaluate the approach on informational texts from Project LISTEN's Reading Tutor.
[EDM 2009 logging] Mostow, J., & Beck, J. E. (2009, July 1-3). Why, What, and How to Log? Lessons from LISTEN. Proceedings of the Second International Conference on Educational Data Mining, Córdoba, Spain, 269-278. Click here for paper as .pdf file. Click here for poster as .pptx file.
Abstract: The ability to log tutorial interactions in comprehensive, longitudinal, fine-grained detail offers great potential for educational data mining – but what data is logged, and how, can facilitate or impede the realization of that potential. We propose guidelines gleaned over 15 years of logging, exploring, and analyzing millions of events from Project LISTEN’s Reading Tutor and its predecessors.
[SSSR 2009 prefixes] Mostow, J., Gates, D., McKeown, M., & Aist, G. (2009). How often are prefixes useful cues to word meaning? Less than you might think! Sixteenth Annual Meeting of the Society for the Scientific Study of Reading, Boston. Click here for .ppt file.
Abstract: We report the frequency and cue validity in WordNet and some large text corpora of several common prefixes often advocated as worth teaching in early grades. To estimate the cue validity of a prefix to word meaning, e.g. “un-,” to the meaning of over 10,000 distinct words, e.g. “undo” and “uncle,” we computed what percentage of their WordNet definitions contain keywords for the meaning of the prefix, e.g. "cancel," "lack," "no," “not," “opposite," “reverse,” etc. We analyze the cue validity of each prefix, both overall and how it varies by corpus and by lexical properties such as word frequency, length, part of speech, and whether the remainder of the word is also a word. This analysis revealed that their utility in deciphering word meaning varies considerably, and is surprisingly poor for some prefixes. We discuss the implications of these findings for vocabulary instruction in different grades, and for readers at varying levels of sophistication with respect to word structure and word meaning.
[ICTD 2009 Ghana] Mills-Tettey, A., Mostow, J., Dias, M. B., Sweet, T. M., Belousov, S. M., Dias, M. F., & Gong, H. (2009, April 17-19). Improving Child Literacy in Africa: Experiments with an Automated Reading Tutor. 3rd IEEE/ACM International Conference on Information and Communication Technologies and Development (ICTD2009), 129-138. Carnegie Mellon, Doha, Qatar. Honorable Mention Student Paper Award. Click here for .pdf file.
Abstract: This paper describes a research endeavor aimed at exploring the role that technology can play in improving child literacy in developing communities. An initial pilot study and subsequent four-month-long controlled field study in Ghana investigated the viability and effectiveness of an automated reading tutor in helping urban children enhance their reading skills in English. In addition to quantitative data suggesting that automated tutoring can be useful for some children in this setting, these studies and an additional preliminary pilot study in Zambia yielded useful qualitative observations regarding the feasibility of applying technology solutions to the challenge of enhancing child literacy in developing communities. This paper presents the findings, observations and lessons learned from the field studies.
[IWCS 2009 mental] Chen, W. (2009). Understanding Mental States in Natural Language. Proceedings of the 8th International Workshop on Computational Semantics, Tilburg, Netherlands, 61-72. Click here for .pdf file.
Abstract: Understanding mental states in narratives is an important aspect of human language comprehension. By “mental states” we refer to beliefs, states of knowledge, points of view, and suppositions, all of which may change over time. In this paper, we propose an approach for automatically extracting and understanding multiple mental states in stories. Our model consists of two parts: (1) a parser that takes an English sentence and translates it to some semantic operations; (2) a mental-state inference engine that reads in the semantic operations and produces a situation model that represents the meaning of the sentence. We present the performance of the system on a corpus of children stories containing both fictional and non-fictional texts.
[QG 2008 lessons] Corbett, A. & Mostow, J. (2008, September 25-26). Automating Comprehension Questions: Lessons from a Reading Tutor. In Proceedings of the 1st Workshop on Question Generation, NSF, Arlington, VA. Click here for .pdf file.
Abstract: How can intelligent tutors generate, answer, and score text comprehension questions? This paper proposes desiderata for such questions, illustrates what is already possible, discusses challenges for automated questions in Project LISTEN’s Reading Tutor, and proposes a framework for evaluating generated questions.
[ITS 2008 help] Beck, J. E., Chang, K.-m., Mostow, J., & Corbett, A. (2008, June 23-27). Does help help? Introducing the Bayesian Evaluation and Assessment methodology. 9th International Conference on Intelligent Tutoring Systems, Montreal, 383-394. ITS2008 Best Paper Award. Click here for .pdf file.
Abstract: Most ITS have a means of providing assistance to the student, either on student request or when the tutor determines it would be effective. Presumably, such assistance is included by the ITS designers since they feel it benefits the students. However, whether-and how-help helps students has not been a well studied problem in the ITS community. In this paper we present three approaches for evaluating the efficacy of the Reading Tutor's help: creating experimental trials from data, learning decomposition, and Bayesian Evaluation and Assessment, an approach that uses dynamic Bayesian networks. We have found that experimental trials and learning decomposition both find a negative benefit for help--that is, help hurts! However, the Bayesian Evaluation and Assessment framework finds that help both promotes student long-term learning and provides additional scaffolding on the current problem. We discuss why these approaches give divergent results, and suggest that the Bayesian Evaluation and Assessment framework is the strongest of the three. In addition to introducing Bayesian Evaluation and Assessment, a method for simultaneously assessing students and evaluating tutorial interventions, this paper describes how help can both scaffold the current problem attempt as well as teach the student knowledge that will transfer to later problems.
[ITS 2008 LD] Beck, J. E., & Mostow, J. (2008, June 23-27). How who should practice: Using learning decomposition to evaluate the efficacy of different types of practice for different types of students. 9th International Conference on Intelligent Tutoring Systems, Montreal, 353-362. Nominated for Best Paper. Click here for .pdf file.
Abstract: A basic question of instruction is how much students will actually learn from it. This paper presents an approach called learning decomposition, which determines the relative efficacy of different types of learning opportunities. This approach is a generalization of learning curve analysis, and uses non-linear regression to determine how to weight different types of practice opportunities relative to each other. We analyze 346 students reading 6.9 million words and show that different types of practice differ reliably in how efficiently students acquire the skill of reading words quickly and accurately. Specifically, massed practice is generally not effective for helping students learn words, and rereading the same stories is not as effective as reading a variety of stories. However, we were able to analyze data for individual student's learning and use bottom-up processing to detect small subgroups of students who did benefit from rereading (11 students) and from massed practice (5 students). The existence of these has two implications: 1) one size fits all instruction is adequate for perhaps 95% of the student population using computer tutors, but as a community we can do better and 2) the ITS community is well poised to study what type of instruction is optimal for the individual.
[ITS 2008 compare] Zhang, X., Mostow, J., & Beck, J. E. (2008). A Case Study Empirical Comparison of Three Methods to Evaluate Tutorial Behaviors. 9th International Conference on Intelligent Tutoring Systems, Montreal, 122-131. Click here for .pdf file.
Abstract: Researchers have used various methods to evaluate the fine-grained interactions of intelligent tutors with their students. We present a case study comparing three such methods on the same data set, logged by Project LISTEN's Reading Tutor from usage by 174 children in grades 2-4 (typically 7-10 years) over the course of the 2005-2006 school year. The Reading Tutor chooses randomly between two different types of reading practice. In assisted oral reading, the child reads aloud and the tutor helps. In "Word Swap," the tutor reads aloud and the child identifies misread words. One method we use here to evaluate reading practice is conventional analysis of randomized controlled trials (RCTs), where the outcome is performance on the same words when encountered again later. The second method is learning decomposition, which estimates the impact of each practice type as a parameter in an exponential learning curve. The third method is knowledge tracing, which estimates the impact of practice as a probability in a dynamic Bayes net. The comparison shows qualitative agreement among the three methods, which is evidence for their validity.
[EDM 2008 freeform] Zhang, X., Mostow, J., Duke, N. K., Trotochaud, C., Valeri, J., & Corbett, A. (2008, June 20-21). Mining Free-form Spoken Responses to Tutor Prompts. Proceedings of the First International Conference on Educational Data Mining, Montreal, 234-241. Click here for .pdf file.
Abstract: How can an automated tutor assess children's spoken responses despite imperfect speech recognition? We address this challenge in the context of tutoring children in explicit strategies for reading comprehension. We report initial progress on collecting, annotating, and mining their spoken responses. Collection and annotation yield authentic but sparse data, which we use to synthesize additional realistic data. We train and evaluate a classifier to estimate the probability that a response mentions a given target.
[EDM 2008 analytic] Mostow, J., & Zhang, X. (2008, June 20-21). Analytic Comparison of Three Methods to Evaluate Tutorial Behaviors. Proceedings of the First International Conference on Educational Data Mining, Montreal, 28-37. Click here for .pdf file.
Abstract: We compare the purposes, inputs, representations, and assumptions of three methods to evaluate the fine-grained interactions of intelligent tutors with their students. One method is conventional analysis of randomized controlled trials (RCTs). The second method is learning decomposition, which estimates the impact of each practice type as a parameter in an exponential learning curve. The third method is knowledge tracing, which estimates the impact of practice as a probability in a dynamic Bayes net. The comparison leads to a generalization of learning decomposition to account for slips and guesses.
[IES 2008] Mostow, J., Corbett, A., Valeri, J., Bey, J., Duke, N. K., & Trotochaud, C. (2008, June 10-12). Explicit Comprehension Instruction in an Automated Reading Tutor that Listens: Year 1 [poster and handout]. IES Third Annual Research Conference, Washington, DC.
[FLET 2008] Mostow, J. (2008). Experience from a Reading Tutor that listens: Evaluation purposes, excuses, and methods. In C. K. Kinzer & L. Verhoeven (Eds.), Interactive Literacy Education: Facilitating Literacy Environments Through Technology, pp. 117-148. New York: Lawrence Erlbaum Associates, Taylor & Francis Group. Click here to order book from Amazon.com.
Abstract: This chapter gives three good reasons to evaluate reading software, identifies three methods for doing so, and refutes three excuses for not evaluating – namely, that evaluation is premature, unnecessary, or will be done by others:
(1) Wizard of Oz experiments help test whether (and clarify how) a proposed approach might work, and refute the excuse that evaluation is premature because the approach has not yet been implemented in a proposed system that may take years to develop.
(2) Conventional controlled studies help determine whether an implemented system helps children gain more in reading than they would otherwise. This criterion is necessary to improve on the status quo, but the difficulty of meeting it refutes the excuse that evaluation is unnecessary due to the supposedly innate superiority of learning on computers, or of a proposed way to use them.
(3) Experiments embedded in an automated tutor help analyze which tutorial actions help which students and words, thereby guiding improvement of the tutor in ways that third party evaluation cannot, thus refuting the excuse that evaluation can be left to others.
The chapter details some practical lessons learned from designing, performing, and analyzing experiments embedded in Project LISTEN’s school-deployed Reading Tutor, which uses speech recognition to listen to children read aloud, and is helping hundreds of children learn to read.
[STLL 2008 SC] Aist, G., & Mostow, J. (2008). Faster, better task choice in a reading tutor that listens. In V. M. Holland & F. P. Fisher (Eds.), The Path of Speech Technologies in Computer Assisted Language Learning: From Research Toward Practice (pp. 220-240). New York: Routledge.
Abstract: We analyze the efficiency and effectiveness of task choice in the context of a reading tutor that listens to children read aloud. We define efficiency as the time to pick a story, and effectiveness in terms of exposing students to new material. We describe design features we added to improve the Reading Tutor’s efficiency and effectiveness, and evaluate the resulting systems quantitatively, as follows. First, we made the story menu child-friendlier by incorporating two improvements: (a) to support use by nonreaders, the new menu spoke all items on the list; (b) to speed up choice, the new menu required just one click to select an item. Second, we instituted a mixed-initiative story choice policy where the Reading Tutor and the student took turns choosing stories. These improvements made story choice measurably more efficient and effective.
[STLL 2008 S98] Mostow, J., Aist, G., Huang, C., Junker, B., Kennedy, R., Lan, H., Latimer, D., O'Connor, R., Tassone, R., Tobin, B., & Wierman, A. (2008). 4-Month evaluation of a learner-controlled Reading Tutor that listens. In V. M. Holland & F. P. Fisher (Eds.), The Path of Speech Technologies in Computer Assisted Language Learning: From Research Toward Practice (pp. 201-219). New York: Routledge.
Abstract: We evaluated an automated Reading Tutor that let children pick stories to read, and listened to them read aloud. All 72 children in three classrooms (grades 2, 4, 5) were independently tested on the nationally normed Word Attack, Word Identification, and Passage Comprehension subtests of the Woodcock Reading Mastery Test (where they averaged nearly 2 standard deviations below national norms), and on oral reading fluency. We split each class into 3 matched treatment groups: Reading Tutor, commercial reading software, or other activities. In 4 months, the Reading Tutor group gained significantly more in Passage Comprehension than the control group (effect size = 1.2, p=.002) - even though actual usage was a fraction of the planned daily 20-25 minutes. To help explain these results, we analyzed relationships among gains in Word Attack, Word Identification, Passage Comprehension, and fluency by 108 additional children who used the Reading Tutor in 7 other classrooms (grades 1-4). Gains in Word Identification predicted Passage Comprehension gains only for Reading Tutor users, both in the controlled study (n=21, p=.042, regression coefficient B=.495± s.e. .227) and in the other classrooms (n=108, p=.005, B=.331±.115), where grade was also a significant predictor (p=.024, B=2.575±1.127).
* [IDEC 2007] Reeder, K., Shapiro, J., & Wakefield, J. (2007, August 5-8). The effectiveness of speech recognition technology in promoting reading proficiency and attitudes for Canadian immigrant children. 15th European Conference on Reading, Humboldt University, Berlin. Click here for .ppsx Powerpoint presentation.
Abstract: This paper reports on recently-completed Canadian trials of the Reading Tutor, a prototype program that uses advanced speech recognition technology to listen to children read aloud in English. When the program hears the reader experiencing difficulty, it offers help with the goal of enhancing reading fluency, and in turn, comprehension. We followed 62 Canadian immigrant children in grades 2-7, ages 8 – 13 in three multicultural western Canadian urban elementary schools for 4 to 7 months of daily, 20-minute sessions on the Reading Tutor. Our first goal was to determine the role of English language (L2) proficiency in any reading gains achieved, while controlling for participants’ differing amounts of practice with the software. Our second goal was to describe participants’ attitudes toward, and perceptions of the experience of using the Reading Tutor software.
Participants were pre-tested for English language proficiency level and for reading proficiency. At the end of each school’s trial, children were post-tested for reading proficiency, including word recognition, word attack, and word and passage comprehension. The lowest of the three English language proficiency groups showed the strongest reading gains, and did so in ways that reflected specific features of their language development. To assess the attitudinal dimension, we administered a clinical interview to all participants at the conclusion of the trial. We describe children’s perceptions of how the program assisted them in their literate development.
* [JECR 2007] Poulsen, R., Wiemer-Hastings,
P., & Allbritton, D. (2007). Tutoring Bilingual Students with an
Abstract: Children from non-English-speaking homes are doubly disadvantaged when learning English in school. They enter school with less prior knowledge of English sounds, word meanings, and sentence structure, and they get little or no reinforcement of their learning outside of the classroom. This article compares the classroom standard practice of sustained silent reading with the Project LISTEN Reading Tutor which uses automated speech recognition to "listen" to children read aloud, providing both spoken and graphical feedback. Previous research with the Reading Tutor has focused primarily on native speaking populations. In this study 34 Hispanic students spent one month in the classroom and one month using the Reading Tutor for 25 minutes per day. The Reading Tutor condition produced significant learning gains in several measures of fluency. Effect sizes ranged from 0.55 to 1.27. These dramatic results from a one-month treatment indicate this technology may have much to offer English language learners.
2007 ASL] Xu, L., Varadharajan, V., Maravich,
J., Tongia, R., & Mostow, J. (2007, October 1-3). DeSIGN:
An Intelligent Tutor to Teach American Sign Language. SLaTE
workshop on Speech and Language Technology for Education, ISCA Tutorial and
Research Workshop, The Summit Inn,
Abstract: This paper presents the development of DeSIGN, an educational software application for those
deaf students who are taught to communicate using American Sign Language
(ASL). The software reinforces English vocabulary and ASL signs by providing
two essential components of a tutor, lessons and tests. The current version
was designed for 5th and 6th graders, whose literacy skills lag by a grade or
more on average. In addition, a game that allows the students to be creative
has been integrated into the tests. Another feature of DeSIGN is its ability to intelligently adapt its tests to
the changing knowledge of the student as determined by a knowledge tracing
algorithm. A separate interface for the teacher enables additions and
modifications to the content of the tutor and provides progress monitoring.
These dynamic aspects help motivate the students to use the software
repeatedly. This software prototype aims at a feasible and sustainable
approach to increase the participation of deaf people in society. DeSIGN has undergone an iteration of testing and is currently
in use at a school for the deaf in
2007 motivation] Beck, J. E. (2007, July 9-13). Does learner control
affect learning? Proceedings of the 13th International Conference on
Artificial Intelligence in Education,
Abstract: Many intelligent tutoring systems permit some degree of learner control. A natural question is whether the increased student engagement and motivation such control provides results in additional student learning. This paper uses a novel approach, learning decomposition, to investigate whether students do in fact learn more from a story they select to read than from a story the tutor selects for them. By analyzing 346 students reading approximately 6.9 million words, we have found that students learn approximately 25% more in stories they choose to read, even though from a purely pedagogical standpoint such stories may not be as appropriate as those chosen by the computer. Furthermore, we found that (for our instantiation of learner control) younger students may derive less benefit from learner control than older students, and girls derive less benefit than boys.
2007 comprehension] Zhang, X., Mostow, J., & Beck, J. E. (2007, July
9-13). Can a Computer Listen for Fluctuations in
Abstract: The ability to detect fluctuation in students' comprehension of text would be very useful for many intelligent tutoring systems. The obvious solution of inserting comprehension questions is limited in its application because it interrupts the flow of reading. To investigate whether we can detect comprehension fluctuations simply by observing the reading process itself, we developed a statistical model of 7805 responses by 289 children in grades 1-4 to multiple-choice comprehension questions in Project LISTEN's Reading Tutor, which listens to children read aloud and helps them learn to read. Machine-observable features of students' reading behavior turned out to be statistically significant predictors of their performance on individual questions.
[EDM 2007 LFA transfer] Leszczenski, J. M., & Beck, J. E. (2007, July 9). What’s in a word? Extending learning factors analysis to modeling reading transfer. Proceedings of the AIED2007 Workshop on Educational Data Mining, Marina del Rey, CA, 31-39. Click here for .pdf file.
Abstract: Learning Factors Analysis (LFA) has been proposed as a generic solution to evaluate and compare cognitive models of learning . By performing a heuristic search over a space of statistical models, the researcher may evaluate different cognitive representations of a set of skills. We introduce a scalable application of this framework in the context of transfer in reading and demonstrate it upon Reading Tutor data. Using an assumption of a word-level model of learning as a baseline, we apply LFA to determine whether a representation with fewer word independencies will produce a better fit for student learning data. Specifically, we show that representing some groups of words as their common root leads to a better fitting model of student knowledge, indicating that this representation offers more information than merely viewing words as independent, atomic skills. In addition, we demonstrate an approximation to LFA which allows it to scale tractably to large datasets. We find that using a word root-based model of learning leads to an improved model fit, suggesting students make use of this information in their representation of words. Additionally, we present evidence based on both model fit and learning rate relationships that low proficiency students tend to exhibit a lesser degree of transfer through the word root representation than higher proficiency students.
[EDM 2007 LD transfer] Zhang,
X., Mostow, J., & Beck, J. E. (2007, July 9). All in the (word)
family: Using learning decomposition to estimate transfer between
skills in a
Abstract: In this paper, we use the method of learning decomposition to study students’ mental representations of English words. Specifically, we investigate whether practice on a word transfers to similar words. We focus on the case where similar words share the same root (e.g., “dog” and “dogs”). Our data comes from Project LISTEN’s Reading Tutor during the 2003—2004 school year, and includes 6,213,289 words read by 650 students. We analyze the distribution of transfer effects across students, and identify factors that predict the amount of transfer. The results support some of our hypotheses about learning, e.g., the transfer effect from practice on similar words is greater for proficient readers than for poor readers. More significant than these empirical findings, however, is the novel analytic approach to measure transfer effects.
[EDM 2007 Dirichlet] Beck, J. E. (2007, July 9). Difficulties in inferring student knowledge from observations (and why you should care). Proceedings of the AIED2007 Workshop on Educational Data Mining, Marina del Rey, CA, 21-30. Click here for .pdf file.
Abstract: Student modeling has a long history in the field of intelligent educational software and is the basis for many tutorial decisions. Furthermore, the task of assessing a student’s level of knowledge is a basic building block in the educational data mining process. If we cannot estimate what students know, it is difficult to perform fine-grained analyses to see if a system’s teaching actions are having a positive effect. In this paper, we demonstrate that there are several unaddressed problems with student model construction that negatively affect the inferences we can make. We present two partial solutions to these problems, using Expectation Maximization to estimate parameters and using Dirichlet priors to bias the model fit procedure. Aside from reliably improving model fit in predictive accuracy, these approaches might result in model parameters that are more plausible. Although parameter plausibility is difficult to quantify, we discuss some guidelines and propose a derived measure of predicted number of trials until mastery as a method for evaluating model parameters.
[UM 2007] Beck, J. E.,
& Chang, K.-m. (2007, June 25-29). Identifiability:
A Fundamental Problem of Student Modeling. Proceedings of the 11th
International Conference on User Modeling (UM 2007),
Abstract: In this paper we show how model identifiability is an issue for student modeling: observed student performance corresponds to an infinite family of possible model parameter estimates, all of which make identical predictions about student performance. However, these parameter estimates make different claims, some of which are clearly incorrect, about the student’s unobservable internal knowledge. We propose methods for evaluating these models to find ones that are more plausible. Specifically, we present an approach using Dirichlet priors to bias model search that results in a statistically reliable improvement in predictive accuracy (AUC of 0.620 ± 0.002 vs. 0.614 ± 0.002). Furthermore, the parameters associated with this model provide more plausible estimates of student learning, and better track with known properties of students’ background knowledge. The main conclusion is that prior beliefs are necessary to bias the student modeling search, and even large quantities of performance data alone are insufficient to properly estimate the model.
[ICASSP 2007] Anumanchipalli, G. K., Ravishankar, M., & Reddy, R.
(2007, April 15-20). Improving Pronunciation Inference Using N-Best List,
Acoustics and Orthography. Proc. 32nd IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP),
Abstract: In this paper, we tackle the problem of pronunciation inference and Out-of-Vocabulary (OOV) enrollment in Automatic Speech Recognition (ASR) applications. We combine linguistic and acoustic information of the OOV word using its spelling and a single instance of its utterance to derive an appropriate phonetic baseform. The novelty of the approach is in its employment of an orthography-driven n-best hypothesis and rescoring strategy of the pronunciation alternatives. We make use of decision trees and heuristic tree search to construct and score the n-best hypotheses space. We use acoustic alignment likelihood and phone transition cost to leverage the empirical evidence and phonotactic priors to rescore the hypotheses and refine the baseforms.
2007] Mostow, J., & Beck, J. (2007). When the Rubber Meets the
Road: Lessons from the In-School Adventures of an Automated
Abstract: Project LISTEN's Reading Tutor (www.cs.cmu.edu/~listen) uses automatic speech recognition to listen to children read aloud, and helps them learn to read. Its experimental deployment in schools has expanded from a single computer used by eight third graders in one school in 1996 to two hundred computers used by children in grades 1-3 in nine schools in 2003. This project illustrates how technology can not just scale up an intervention, but instrument its implementation. For example, analysis of 2002-2003 usage showed that session frequency and duration averaged significantly higher in lab settings than in classrooms.
[ICSLP2006] Mostow, J. (2006, September 17-21). Is ASR accurate enough for automated reading tutors, and how can we tell? Ninth International Conference on Spoken Language Processing (Interspeech 2006 — ICSLP), Pittsburgh, PA, 837-840. Click here for .pdf file.
Abstract: We discuss pros and cons of several ways to evaluate ASR accuracy in automated tutors that listen to students read aloud. Whether ASR is accurate enough for a particular reading tutor function depends on what ASR-based judgment it requires, the visibility of that judgment to students and teachers, and the amount of input speech on which it is based. How to tell depends on the purpose, criterion, and space of the evaluation.
help] Chang, K., Beck, J. E., Mostow, J., & Corbett, A. (2006, July
17). Does Help Help? A Bayes Net Approach
to Modeling Tutor Interventions. AAAI2006 Workshop on Educational Data
Abstract: This paper describes an effort to measure the effectiveness of tutor help in an intelligent tutoring system. Conventional pre- and post- test experimental methods can determine whether help is effective but are expensive to conduct. Furthermore, a pre and post- test methodology ignores a source of information: students request help about words they do not know. Therefore, we propose a dynamic Bayes net (which we call the help model) that models tutor help and student knowledge in one coherent framework. The help model distinguishes two different effects of help: scaffolding immediate performance vs. teaching persistent knowledge that improves long term performance. We train the help model to fit the student performance data gathered from usage of Reading Tutor. The parameters of the trained model suggest that students benefit from both the scaffolding and teaching effects of help. Thus, our framework is able to distinguish two types of influence that help has on the student, and can determine whether help helps learning without an explicit controlled study.
cloze] Hensler, B. S., & Beck, J. (2006, July 6-8). Are all
questions created equal? Factors that influence cloze question
difficulty. Thirteenth Annual Meeting of the Society for the Scientific
Study of Reading,
Abstract: The multiple choice
cloze (MCC) assessment methodology is widely used in assessing reading
comprehension; therefore an improved scoring methodology would have broad
impact within the reading research community. We have constructed an
MCC question model that simultaneously estimates the student's comprehension
proficiency and the impact of various terms on MCC difficulty. To build the
model, we analyzed 16,161 MCC question responses that were administered by a
computer reading tutor over the course of a school year. Participants
were 373 students in grades 1 through 6 (ages 5-12) in urban and suburban
public schools in
To develop our model of MCC difficulty, we used multinomial logistic regression to calculate the relative impact of a number of factors. Our model includes the location of the deleted target word within the sentence and question length as covariates. As factors, we used student identity, reaction time (rounded to the nearest second) and level of difficulty of the target word. We hypothesized that more proficient readers would use syntactic cues while less proficient readers would not. To add syntax to the model, we used the TreeTagger part of speech tagger to annotate the part of speech of the correct answer for each cloze question. We then computed how many of the distractors could have the same part of speech as the answer. Presumably questions with many distractors able to take on the same part of speech as the answer would be harder.
After training the model on our 16,161 MCC questions, there were two main findings. First, our model found that students who had a second grade reading proficiency (as measured by Woodcock Reading Comprehension Cluster) or higher were sensitive to how many of the possible responses could take on the same part of speech as the correct answer (p= 0.002) for the cloze sentence, while students below second grade proficiency were insensitive to this term (p=0.467). This result suggests that students' syntactic awareness, at least within the context of MCC questions, begins at around the second grade. The second main finding was the degree of correlation of each student's Beta parameter, the model's estimate of her ability to answer MCC questions, with her associated Woodcock test score. The mean within-grade correlation between Beta and the Reading Comprehension Cluster score was 0.69, a very strong fit.
fluency] Mostow, J. and J. Beck (2006, July 6-8). Refined
micro-analysis of fluency gains in a
Abstract: Our SSSR2005 talk presented a linear model of speedup in word reading between successive encounters in connected text, based on a quarter of a million such encounters. The model indicated that reading a word in a new context contributed more to speedup than re-encountering it in an old context, implying that wide reading builds fluency more than rereading. Our new, improved model uses a growth curve to model word reading time as a function of the number and types of encounters of the word. This approach lets us estimate -- both overall and at different reading levels -- the relative value of encountering a word in a new context versus an old one, and for the first time on a given day versus subsequently.
gaming] Baker, R. S. J. d., Corbett, A. T., Koedinger, K. R., Evenson, S., Roll, I., Wagner, A. Z., Naim,
M., Raspat, J., Baker, D. J., & Beck, J. E.
(2006, June 26-30). Adapting to When Students Game an Intelligent Tutoring
System. Proceedings of the 8th International Conference on Intelligent
Abstract: It has been found in recent years that many students who use intelligent tutoring systems game the system, attempting to succeed in the educational environment by exploiting properties of the system rather than by learning the material and trying to use that knowledge to answer correctly. In this paper, we introduce a system which gives a gaming student supplementary exercises focused on exactly the material the student bypassed by gaming, and which also expresses negative emotion to gaming students through an animated agent. Students using this system engage in less gaming, and students who receive many supplemental exercises have considerably better learning than is associated with gaming in the control condition or prior studies.
[ITS2006 BNT-SM] Chang, K., Beck, J., Mostow, J., & Corbett, A. (2006, June 26-30). A Bayes Net Toolkit for Student Modeling in Intelligent Tutoring Systems. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, 104-113. Click here for .pdf file.
Abstract: This paper describes an effort to model a student’s changing knowledge state during skill acquisition. Dynamic Bayes Nets (DBNs) provide a powerful way to represent and reason about uncertainty in time series data, and are therefore well-suited to model student knowledge. Many general-purpose Bayes net packages have been implemented and distributed; however, constructing DBNs often involves complicated coding effort. To address this problem, we introduce a tool called BNTSM. BNT-SM inputs a data set and a compact XML specification of a Bayes net model hypothesized by a researcher to describe causal relationships among student knowledge and observed behavior. BNT-SM generates and executes the code to train and test the model using the Bayes Net Toolbox . Compared to the BNT code it outputs, BNT-SM reduces the number of lines of code required to use a DBN by a factor of 5. In addition to supporting more flexible models, we illustrate how to use BNT-SM to simulate Knowledge Tracing (KT) , an established technique for student modeling. The trained DBN does a better job of modeling and predicting student performance than the original KT code (Area Under Curve = 0.610 > 0.568), due to differences in how it estimates parameters.
[ITS2006 cloze] Hensler, B. S., & Beck, J. (2006, June 26-30). Better student assessing by finding difficulty factors in a fully automated comprehension measure. Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, 21-30. Nominated for Best Paper. Click here for .pdf file.
Abstract: The multiple choice cloze (MCC) question format is commonly used to assess students' comprehension. It is an especially useful format for ITS because it is fully automatable and can be used on any text. Unfortunately, very little is known about the factors that influence MCC question difficulty and student performance on such questions. In order to better understand student performance on MCC questions, we developed a model of MCC questions. Our model shows that the difficulty of the answer and the student’s response time are the most important predictors of student performance. In addition to showing the relative impact of the terms in our model, our model provides evidence of a developmental trend in syntactic awareness beginning around the 2nd grade. Our model also accounts for 10% more variance in students’ external test scores compared to the standard scoring method for MCC questions.
vocabulary] Heiner, C., Beck, J., & Mostow, J. (2006, June 26-30). Automated
Vocabulary Instruction in a
Abstract: This paper presents a within-subject, randomized experiment to compare automated interventions for teaching vocabulary to young readers using Project LISTEN's Reading Tutor. The experiment compared three conditions: no explicit instruction, a quick definition, and a quick definition plus a post-story battery of extended instruction based on a published instructional sequence for human teachers. A month long study with elementary school children indicates that the quick instruction which lasts about seven seconds has immediate effects on learning gains that did not persist. Extended instruction which lasted about thirty seconds longer than the quick instruction had a persistent effect and produced gains on a posttest one week later.
decomposition] Beck, J. (2006, June 26). Using learning decomposition
to analyze student fluency development. ITS2006 Educational Data Mining
Abstract: This paper introduces an approach called learning decomposition to analyze what types of practice are most effective for helping students learn a skill. The approach is a generalization of learning curve analysis, and uses non-linear regression to determine how to weight different types of practice opportunities relative to each other. We are able to show that different types of practice differ reliably in how quickly students acquire the skill of reading words quickly and accurately. Specifically, massed practice is generally not effective for helping students learn words, but may be acceptable for less proficient readers. Rereading the same story is generally not as effective as reading a variety of stories, but might be beneficial for more proficient readers.
[JNLE2006] Mostow, J. and J. Beck (2006). Some useful tactics to modify, map, and mine data from intelligent tutors. Natural Language Engineering (Special Issue on Educational Applications) 12(2),195-208. © 2006 Cambridge University Press. Click here for .pdf file.
Abstract: Mining data logged by intelligent tutoring systems has the potential to discover information of value to students, teachers, authors, developers, researchers, and the tutors themselves -- information that could make education dramatically more effcient, effective, and responsive to individual needs. We factor this discovery process into tactics to modify tutors, map heterogeneous event streams into tabular data sets, and mine them. This model and the tactics identified mark out a roadmap for the emerging area of tutorial data mining, and may provide a useful vocabulary and framework for characterizing past, current, and future work in this area. We illustrate this framework using experiments that tested interventions by an automated reading tutor to help children decode words and comprehend stories.
[IJAIED2006] Beck, J. E., & Sison, J. (2006). Using knowledge tracing in a noisy environment to measure student reading proficiencies. International Journal of Artificial Intelligence in Education, 16, 129-143. (In Special “Best of ITS 2004” Issue.) Click here for .pdf file.
Abstract: Constructing a student model for language tutors is a challenging task. This paper describes using knowledge tracing to construct a student model of reading proficiency and validates the model. We use speech recognition to assess a student’s reading proficiency at a subword level, even though the speech recognizer output is at the level of words and is statistically noisy. Specifically, we estimate the student’s knowledge of 80 letter to sound mappings, such as ch making the sound /K/ in “chemistry.” At a coarse level, the student model did a better job at estimating reading proficiency for 47.2% of the students than did a standardized test designed for the task. Although not quite as strong as the standardized test, our assessment method can provide a report on the student at any time during the year and requires no break from reading to administer. Our model’s estimate of the student’s knowledge on individual letter to sound mappings is a significant predictor of whether he will ask for help on a particular word. Thus, our student model is able to describe student performance both at a coarse- and at a fine-grain size.
event] Mostow, J., Beck, J., Cen, H., Gouvea, E., & Heiner, C. (2005,
July). Interactive Demonstration of a Generic Tool to Browse Tutor-Student
Interactions. Interactive Events Proceedings of the 12th International
Conference on Artificial Intelligence in Education (AIED 2005),
Abstract: Project LISTEN's Session Browser is a generic tool to browse a database of students' interactions with an automated tutor. Using databases logged by Project LISTEN's Reading Tutor, we illustrate how to specify phenomena to investigate, explore events and the context where they occurred, dynamically drill down and adjust which details to display, and summarize events in human-understandable form. The tool should apply to MySQL databases from other tutors as well.
browser] Mostow, J., Beck, J.,
Abstract: A basic question in mining data from an intelligent tutoring system is, "What happened when…?" A generic tool to answer such questions should let the user specify which phenomenon to explore; explore selected events and the context in which they occurred; and require minimal effort to adapt the tool to new versions, to new users, or to other tutors. We describe an implemented tool and how it meets these requirements. The tool applies to MySQL databases whose representation of tutorial events includes student, computer, start time, and end time. It infers the implicit hierarchical structure of tutorial interaction so humans can browse it. A companion paper  illustrates the use of this tool to explore data from Project LISTEN's automated Reading Tutor.
interruption] Heiner, C., Beck, J., & Mostow, J. (2005, July 18-22).
When do students interrupt help? Effects of individual
differences. Proceedings of the 12th International Conference on
Artificial Intelligence in Education (AIED 2005),
Abstract. When do students interrupt help to request different help? To study this question, we analyze a within-subject experiment in the 2003-2004 version of Project LISTEN's Reading Tutor. From 168,983 trials of this experiment, we report patterns in when students choose to interrupt help. To improve model fit for individual data, we adjust our model to account for individual differences. We report small but significant correlations between a student parameter in our model and gender as well as external measures of motivation and academic performance.
engagement] Beck, J. (2005, July 18-22). Engagement tracing: using
response times to model student disengagement. Proceedings of the 12th
International Conference on Artificial Intelligence in Education (AIED 2005),
Abstract: Time on task is an important predictor for how much students learn. However, students must be focused on the learning for the time invested to be productive. Unfortunately, students do not always try their hardest to solve problems presented by computer tutors. This paper explores student disengagement and proposes an approach, engagement tracing, for detecting whether a student is engaged in answering questions. This model is based on item response theory, and uses as input the difficulty of the question, how long the student took to respond, and whether the response was correct. From these data, the model determines the probability a student was actively engaged in trying to answer the question. The model has a reliability of 0.95, and its estimate of student engagement correlates at 0.25 with student gains on external tests. Finally, the model is sensitive enough to detect variations in student engagement within a single tutoring session. The novel aspect of this work is that it requires only data normally collected by a computer tutor, and the affective model is validated against student performance on an external measure.
ASR] Beck, J. E., Chang, K., Mostow, J., & Corbett, A. (2005, July
19). Using a student model to improve a computer tutor's speech recognition. Proceedings
of the AIED 05 Workshop on Student Modeling for Language Tutors, 12th
International Conference on Artificial Intelligence in Education,
Abstract: Intelligent computer tutors can derive much of their power from having a student model that describes the learner’s competencies. However, constructing a student model is challenging for computer tutors that use automated speech recognition (ASR) as input. This paper reports using ASR output from a computer tutor for reading to compare two models of how students learn to read words: a model that assumes students learn words as whole-unit chunks, and a model that assumes students learn the individual letteràsound mappings that make up words. We use the data collected by the ASR to show that a model of letteràsound mappings better describes student performance. We then compare using the student model and the ASR, both alone and in combination, to predict which words the student will read correctly, as scored by a human transcriber. Surprisingly, majority class has a higher classification accuracy than the ASR. However, we demonstrate that the ASR output still has useful information, and that classification accuracy is not a good metric for this task, and the Area Under Curve (AUC) of ROC curves is a superior scoring method. The AUC of the student model is statistically reliably better (0.670 vs. 0.550) than that of the ASR, which in turn is reliably better than majority class. These results show that ASR can be used to compare theories of how students learn to read words, and modeling individual learner’s proficiencies may enable improved speech recognition.
2005 model] Chang, K.., Beck, J. E., Mostow, J., & Corbett, A. (2005,
July 19). Using speech recognition to evaluate two student models for a
reading tutor. Proceedings of the AIED 05 Workshop on Student Modeling for
Language Tutors, 12th International Conference on Artificial Intelligence
Abstract: Intelligent Tutoring Systems derive much of their power from having a student model that describes the learner's competencies. However, constructing a student model is challenging for computer tutors that use automated speech recognition (ASR) as input, due to inherent inaccuracies in ASR. We describe two extremely simplified models of developing word decoding skills and explore whether there is sufficient information in ASR output to determine which model fits student performance better, and under what circumstances one model is preferable to another.
The two models that we describe are a lexical model that assumes students learn words as whole-unit chunks, and a grapheme-to-phoneme (G-to-P) model that assumes students learn the individual letter-to-sound mappings that compose the words. We use the data collected by the ASR to show that the G-to-P model better describes student performance than the lexical model. We then determine which model performs better under what conditions. On one hand, the G-to-P model better correlates with student performance data when the student is older or when the word is more difficult to read or spell. On the other hand, the lexical model better correlates with student performance data when the student has seen the word more times.
[AAAI 2005 workshop] Beck, J. (Ed.). (2005, July 10). Proceedings of the AAAI2005 Workshop on Educational Data Mining. Pittsburgh, PA.
browser] Mostow, J., Beck, J., Cen, H.,
Abstract: A basic question in mining data from an intelligent tutoring system is, "What happened when…?" We identify requirements for a tool to help answer such questions by finding occurrences of specified phenomena and browsing them in human-understandable form. We describe an implemented tool and how it meets the requirements. The tool applies to MySQL databases whose representation of tutorial events includes student, computer, start time, and end time. It automatically computes and displays the temporal hierarchy implicit in this representation. We illustrate the use of this tool to mine data from Project LISTEN's automated Reading Tutor.
Abstract: Students in two classes in the fall of 2004 making extensive use of online courseware were logged as they visited over 500 different “learning pages” which varied in length and in difficulty. We computed the time spent on each page by each student during each session they were logged in. We then modeled the time spent for a particular visit as a function of the page itself, the session, and the student. Surprisingly, the average time a student spent on learning pages (over their whole course experience) was of almost no value in predicting how long they would spend on a given page, even controlling for the session and page difficulty. The page itself was highly predictive, but so was the average time spent on learning pages in a given session. This indicates that local considerations, e.g., mood, deadline proximity, etc., play a much greater role in determining student pace and attention than do intrinsic student traits. We also consider the average time spent on learning pages as a function of the time of semester. Students spent less time on pages later in the semester, even for more demanding material.
[SSSR 2005] Mostow, J.,
& Beck, J. (2005). Micro-analysis of fluency gains in a Reading Tutor
that listens: Wide vs. repeated guided oral reading. Talk at Twelfth
Annual Meeting of the Society for the Scientific Study of
Abstract: Fluency growth is essential but imperfectly understood. By using automatic speech recognition to listen to children read aloud, Project LISTEN's Reading Tutor provides a novel instrument to study fluency development. During the 2002-2003 school year, hundreds of children in grades 1-4 used the Reading Tutor, which recorded them reading millions of words of text. The latency preceding each word reflects the reader’s cognitive effort to identify the word. Using automatic speech recognition to analyze latency changes between successive encounters of words in the same or different contexts provides new data about how fluency grows.
[Toronto 2005] Cunningham, T., & Geva, E. (2005, June 24). The
effects of reading technologies on literacy development of ESL students
[poster presentation]. Twelfth Annual Meeting of the Society for the
Scientific Study of Reading,
2005] Reeder, K., Early, M., Kendrick, M., Shapiro, J., &
2005] Beck, J. E., & Mostow, J. (2005). Mining Data from Randomized
Within-Subject Experiments in an Automated Reading Tutor (poster in session
34.080, "Logging Students' Learning in Complex Domains: Empirical
Considerations and Technological Solutions"). American Educational
Research Association 2005 Annual Meeting: Demography and Democracy in
the Era of Accountability,
Abstract: Experiments embedded in the Reading Tutor help evaluate its decisions in tutoring decoding, vocabulary, and comprehension.
Abstract: This study looked at factors influencing teachers’ perception and usage of Project LISTEN’s Reading Tutor, a computerized tutor used with elementary students in 9 classroom-based, 10 computer lab-based, and 3 specialist-room school settings. Thirteen interviews and 22 survey responses (of a possible 28 teachers) examined teachers’ perception of the Reading Tutor and suggested that teachers’ belief in the Tutor influenced their usage of it (r = .46, p < .03). Three factors seemed to influence teacher belief: 1) perceived ease of use (r = .52, p < .01), 2) teachers’ reported experience with computers (r = .41, p < .04) and instructional technology (r = .48, p < .03), and 3) perceived technical problems such as frequency of technical problems (r = -.44, p < .04) and speed with which problems were fixed (r = .49, p < .02). Analysis of these factors suggested four themes that cut-across factors and seem to influence the way teachers evaluate and use the Reading Tutor – the technology’s degree of convenience, competition from other educational priorities and practices, teacher experience and/or interest with technology, and data available to teachers and the way teachers prioritize that data. These results suggest that improving convenience of the Reading Tutor, instituting specialized training programs, and improving feedback mechanisms for teachers by providing relevant, situated data may influence teacher belief in the Reading Tutor and thereby increase teacher usage. This study contributes to current literature on educational technology usage by supporting previous literature suggesting that teacher belief in the importance of a technology influences their use of it. One unique feature of this study is that is uses both quantitative and qualitative methods to look at the research questions from two different research perspectives.
Abstract: A two-month pilot study comprised of 34 second through fourth grade Hispanic students from four bilingual education classrooms was conducted to compare the efficacy of the 2004 version of the Project LISTEN Reading Tutor against the standard practice of sustained silent reading (SSR). The Reading Tutor uses automated speech recognition to listen to children read aloud. It provides both spoken and graphical feedback in order to assist the children with the oral reading task. Prior research with this software has demonstrated its efficacy within populations of native English speakers. This study was undertaken to obtain some initial indication as to whether the tutor would also be effective within a population of English language learners.
The study employed a crossover design where each participant spent one month in each of the treatment conditions. The experimental treatment consisted of 25 minutes per day using the Reading Tutor within a small pullout lab setting. Control treatment consisted of the students who remained in the classroom where they participated in established reading instruction activities. Dependent variables consisted of the school districts curriculum based measures for fluency, sight word recognition and comprehension.
The Reading Tutor group out-gained the control group in every measure during both halves of the crossover experiment. Within subject results from a paired T-Test indicate these gains were significant for one sight word measure (p = .056) and both fluency measures (p < .001). Effect sizes were 0.55 for timed sight words, a robust 1.16 for total fluency and an even larger 1.27 for fluency controlled for word accuracy. These dramatic results observed during a one-month treatment indicate this technology may have much to offer English language learners.
Abstract: We describe the automated generation and use of 69,326 comprehension cloze questions and 5,668 vocabulary matching questions in the 2001-2002 version of Project LISTEN's Reading Tutor used by 364 students in grades 1-9 at seven schools. To validate our methods, we used students' performance on these multiple-choice questions to predict their scores on the Woodcock Reading Mastery Test. A model based on students' cloze performance predicted their Passage Comprehension scores with correlation R=.85. The percentage of vocabulary words that students matched correctly to their definitions predicted their Word Comprehension scores with correlation R=.61.
We used both types of questions in a within-subject automated experiment to compare four ways to preview new vocabulary before a story - defining the word, giving a synonym, asking about the word, and doing nothing. Outcomes included comprehension as measured by performance on multiple-choice cloze questions during the story, and vocabulary as measured by matching words to their definitions in a posttest after the story. A synonym or short definition significantly improved posttest performance compared to just encountering the word in the story - but only for words students didn't already know, and only if they had a grade 4 or better vocabulary. Such a preview significantly improved performance during the story on cloze questions involving the previewed word - but only for students with a grade 1-3 vocabulary.
[TICL fluency] Beck, J. E., Jia, P., & Mostow, J. (2004). Automatically assessing oral reading fluency in a computer tutor that listens. Technology, Instruction, Cognition and Learning, 2, 61-81. Click here to download .pdf file.
Abstract: Much of the power of a computer tutor comes from its ability to assess students. In some domains, including oral reading, assessing the proficiency of a student is a challenging task for a computer. Our approach for assessing student reading proficiency is to use data that a computer tutor collects through its interactions with a student to estimate his performance on a human-administered test of oral reading fluency. A model with data collected from the tutor's speech recognizer output correlated, within-grade, at 0.78 on average with student performance on the fluency test. For assessing students, data from the speech recognizer were more useful than student help-seeking behavior. However, adding help-seeking behavior increased the average within-grade correlation to 0.83. These results show that speech recognition is a powerful source of data about student performance, particularly for reading.
[ITS 2004 tracing] Beck, J. E., & Sison, J. (2004, September 1-3). Using knowledge tracing to measure student reading proficiencies. Proceedings of the 7th International Conference on Intelligent Tutoring Systems, 624-634. Maceio, Brazil. (c) Springer-Verlag at http://www.springer.de/comp/lncs/index.html. Click here to download .pdf file.
Abstract: Constructing a student model for language tutors is a challenging task. This paper describes using knowledge tracing to construct a student model of reading proficiency and validates the model. We use speech recognition to assess a student’s reading proficiency at a subword level, even though the speech recognizer output is at the level of words. Specifically we estimate the student’s knowledge of 80 letter to sound mappings, such as ch making the sound /K/ in “chemistry.” At a coarse level, the student model did a better job at estimating reading proficiency for 47.2% of the students than did a standardized test designed for the task. Our model’s estimate of the student’s knowledge on individual letter to sound mappings is a significant predictor in whether he will ask for help on a particular word. Thus, our student model is able to describe student performance both at a coarse- and at a fine-grain size.
[ITS 2004 questions] Beck, J. E., Mostow, J., & Bey, J. (2004, September 1-3). Can automated questions scaffold children's reading comprehension? Proceedings of the 7th International Conference on Intelligent Tutoring Systems, 478-490. Maceio, Brazil. (c) Springer-Verlag at http://www.springer.de/comp/lncs/index.html. Click here to download .pdf file.
Abstract: Can automatically generated
questions scaffold reading comprehension? We automated three kinds of
multiple-choice questions in children’s assisted reading:
A within-subject experiment in the
spring 2003 version of Project LISTEN’s Reading Tutor randomly inserted all
three kinds of questions during stories as it helped children read
them. To compare their effects on story-specific comprehension, we
analyzed 15,196 subsequent cloze test responses by 404 children in grades
[ITS 2004 disengagement] Beck,
J. E. (2004, August 31). Using response times to model student disengagement.
Proceedings of the ITS2004 Workshop on Social and Emotional Intelligence
in Learning Environments,
Abstract: Time on task is an important variable for learning a skill. However, learners must be focused on the learning for the time invested to be productive. Unfortunately, students do not always try their hardest to solve problems presented by computer tutors. This paper explores student disengagement and proposes a model for detecting whether a student is engaged in answering questions. This model is based on item response theory, and uses as input the difficulty of the question, how long the student took to respond, and whether the response was correct. From these data, the model determines the probability a student was actively engaged in trying to answer the question. To validate our model, we analyze 231 students’ interactions with the 2002-2003 version of the Reading Tutor. We show that disengagement is better modeled by simultaneously estimating student proficiency and disengagement than just estimating disengagement alone. Our best model of disengagement has a correlation of -0.25 with student learning gains. The novel aspect of this work is that it requires only data normally collected by a computer tutor, and the affective model is validated against student performance on an external measure.
[ITS 2004 mining] Mostow,
J. (2004, August 30). Some useful design tactics for mining ITS data. Proceedings
of the ITS2004 Workshop on Analyzing Student-Tutor Interaction Logs to
Improve Educational Outcomes, Maceió,
Abstract: Mining data logged by intelligent tutoring systems has the potential to reveal valuable discoveries. What characteristics make such data conducive to mining? What variables are informative to compute? Based on our experience in mining data from Project LISTEN’s Reading Tutor, we discuss how to collect machine-analyzable data and formulate it into experimental trials. The resulting concepts and tactics mark out a roadmap for the emerging area of tutorial data mining, and may provide a useful vocabulary and framework for characterizing past, current, and future work in this area.
[ITS 2004 lessons] Heiner,
C., Beck, J., & Mostow, J. (2004, August 30). Lessons on using ITS data to
answer educational research questions. Proceedings of the ITS2004 Workshop
on Analyzing Student-Tutor Interaction Logs to Improve Educational Outcomes,
Abstract: Some tutoring system projects have completed empirical studies of student-tutor interaction by manually collecting data while observing fewer than a hundred students. Analyzing larger, automatically collected data sets requires new methods to address new problems. We share lessons on design, analysis, presentation, and iteration. Our lessons are based on our experience analyzing data from Project LISTEN’s Reading Tutor, which automatically collected tutorial data from hundreds of students. We hope that these lessons will help guide analysis of similar datasets from other intelligent tutoring systems.
[ACL 2004 keynote] Mostow,
J. (2004, July 22). If I Have a Hammer: Computational Linguistics in a
Abstract: Project LISTEN’s Reading Tutor uses speech recognition to listen to children read aloud, and helps them learn to read, as evidenced by rigorous evaluations of pre- to posttest gains compared to various controls. In the 2003-2004 school year, children ages 5-14 used the Reading Tutor daily at school on over 200 computers, logging over 50,000 sessions, 1.5 million tutorial responses, and 10 million words.
This talk uses the Reading Tutor to
illustrate the diverse roles that computational linguistics can play in an
A recurring theme is the use of “big data” to train such models automatically.
[SSSR 2004 help] Mostow,
J., Beck, J. E., & Heiner, C. (2004). Which Help Helps? Effects of
Various Types of Help on Word Learning in an Automated
Abstract: When a tutor gives help on a word during assisted oral reading, how does the type of help matter? We report an automated, within-subject, randomized-trial experiment embedded in Project LISTEN's Reading Tutor. Hundreds of children (mostly in grades 1-3) used the Reading Tutor in 2002-2003, reading millions of words and getting help on hundreds of thousands of them. The experimental variable was the type of help, selected randomly by the Reading Tutor whenever it gave help on a word. The outcome variable was student performance on the next encounter of the word. We compare effects of several types of help.
[SSSR 2004 interventions] Beck,
J. E., Sison, J., & Mostow, J. (2004, June
27-30). Using automated speech recognition to measure scaffolding and
learning effects of word identification interventions in a computer tutor
that listens. Eleventh Annual Meeting of the Society for the Scientific Study
of Reading, Amsterdam, The
Abstract: Does it help to provide brief word identification assistance to students? On words they encounter soon afterwards? Does brief assistance lead to long-term learning gains? Which types of assistance are best? We have explored these questions using automated experiments in a computer tutor for reading that listens. We examine data from 300 students, mostly in grades 1 through 3. The major results were a definite scaffolding effect in student performance on the same day as they were given assistance. However, although there was a slight improvement in longer-term performance, the difference was not statistically significant.
[ICALL 2004] Heiner, C.,
Beck, J. E., & Mostow, J. (2004, June 17-19). Improving the Help
Selection Policy in a
Abstract: What type of oral reading assistance is most effective for a given student on a given word? We analyze 189,039 randomized trials of a within-subject experiment to compare the effects of several types of help in the 2002-2003 version of Project LISTEN’s Reading Tutor. The independent variable is the type of help given on a word. The outcome variable is the student’s performance at the next encounter of that word, as measured by automatic speech recognition. Training a help selection policy sensitive to student or word level improves this outcome by a projected 4% – a substantial effect for picking a single better intervention.
[CALICO 2004] Beck, J. E.,
& Sison, J. (2004, June 8-12). Automated
student assessment in language tutors. CALICO,
Abstract: The Reading Tutor is a computer tutor that uses Automated Speech Recognition (ASR) technology to listen to children read aloud and helps them learn how to read. The research reported here uses ASR output to predict students' GORT fluency posttest scores. Using a linear regression model, we achieved correlations of over .80 for predicting first through fourth graders' performance. Our model's predictive ability is on par with standard public school reading assessment measures. This work contributes to a better understanding of automated student assessment in language tutors and introduces methods for accounting for noisy ASR output.
[IJAIE 2004] Murray, R. C., VanLehn, K., & Mostow, J. (2004). Looking Ahead to Select Tutorial Actions: A Decision-Theoretic Approach. International Journal of Artificial Intelligence in Education, 14, 235-278. Download paper as .pdf file.
Abstract: We propose and evaluate a decision-theoretic approach for selecting tutorial actions by looking ahead to anticipate their effects on the student and other aspects of the tutorial state. The approach uses a dynamic decision network to consider the tutor’s uncertain beliefs and objectives in adapting to and managing the changing tutorial state. Prototype action selection engines for diverse domains – calculus and elementary reading – illustrate the approach. These applications employ a rich model of the tutorial state, including attributes such as the student’s knowledge, focus of attention, affective state, and next action(s), along with task progress and the discourse state. Our action selection engines have not yet been integrated into complete ITSs (this is the focus of future work), so we use simulated students to evaluate their capability to select rational tutorial actions that emulate the behaviors of human tutors. We also evaluate their capability to select tutorial actions quickly enough for real-world tutoring applications.
[ICAAI 2003] Banerjee, S.,
Mostow, J., Beck, J., & Tam, W. (2003, December 15-16). Improving
Language Models by Learning from Speech Recognition Errors in a
Abstract: Lowering the perplexity of a language model does not always translate into higher speech recognition accuracy. Our goal is to improve language models by learning from speech recognition errors. In this paper we present an algorithm that first learns to predict which n-grams are likely to increase recognition errors, and then uses that prediction to improve language models so that the errors are reduced. We show that our algorithm reduces a measure of tracking error by more than 24% on unseen test data from a Reading Tutor that listens to children read aloud.
[CSMP 2003] Mostow, J.,
& Beck, J. (2003, November 3-4). When the Rubber Meets the Road:
Lessons from the In-School Adventures of an Automated
Abstract: Project LISTEN's Reading Tutor (www.cs.cmu.edu/~listen) uses automatic speech recognition to listen to children read aloud, and helps them learn to read. Its experimental deployment in schools has expanded from a single computer used by eight third graders in one school in 1996 to two hundred computers used by children in grades 1-3 in nine schools in 2003. This project illustrates how technology can not just scale up an intervention, but instrument its implementation. For example, analysis of 2002-2003 usage showed that session frequency and duration averaged significantly higher in lab settings than in classrooms.
Abstract: This paper extends and evaluates previously published methods for predicting likely miscues in children's oral reading in a Reading Tutor that listens. The goal is to improve the speech recognizer's ability to detect miscues but limit the number of "false alarms" (correctly read words misclassified as incorrect). The "rote" method listens for specific miscues from a training corpus. The "extrapolative" method generalizes to predict other miscues on other words. We construct and evaluate a scheme that combines our rote and extrapolative models. This combined approach reduced false alarms by 0.52% absolute (12% relative) while simultaneously improving miscue detection by 1.04% absolute (4.2% relative) over our existing miscue prediction scheme.
Abstract: One issue in a Reading Tutor that listens is to determine which words the student read correctly. We describe a confidence measure that uses a variety of features to estimate the probability that a word was read correctly. We trained two decision tree classifiers. The first classifier tries to fix insertion and substitution errors made by the speech decoder, while the second classifier tries to fix deletion errors. By applying the two classifiers together, we achieved a relative reduction in false alarm rate by 25.89% while holding the miscue detection rate constant.
Abstract: We present an automated method to ask children questions during assisted reading, and experimentally evaluate its effects on their comprehension. In 2002, after a randomly inserted generic multiple-choice What/Where/When question, children were likelier to correctly answer an automatically generated comprehension question on a later sentence. The positive effects of such questions vanished during the second half of the study in 2003. We hypothesize why.
Abstract: This interactive event demonstrates various aspects of Project LISTEN’s Reading Tutor, which listens to children read aloud, and helps them learn to read.
Abstract: A 2002 Wizard of Oz study showed that emotional scaffolding provided by a human significantly increased children’s persistence in an automated Reading Tutor, as measured by the number of tasks they chose to undertake. We report a 5,965-trial experiment to test a simple automated form of such scaffolding, compared to a control condition without it. 348 children in grades K-4 spent significantly longer per task in the experimental condition due to a design flaw, yet still averaged equal numbers of tasks in both conditions. We theorize that they subjectively gauged effort in terms of number of tasks rather than number or duration of solution attempts.
Abstract: This paper
describes our efforts at constructing a fine-grained student model in Project
LISTEN’s intelligent tutor for reading.
Abstract: This paper reports results on using data mining to extract useful variables from a database that contains interactions between the student and Project LISTEN’s Reading Tutor. Our approach is to find variables we believe to be useful in the information logged by the tutor, and then to derive models that relate those variables to student’s scores on external, paper-based tests of reading proficiency. Once the relationship between the recorded variables and the paper tests is discovered, it is possible to use information recorded by the tutor to assess the student’s current level of proficiency. The major results of this work were the discovery of useful features available to the Reading Tutor that describe students, and a strong predictive model of external tests that correlates with actual test scores at 0.88.
Abstract: A year-long study of 131 second and third graders in 12 classrooms compared three daily 20-minute treatments. (a) 58 students in 6 classrooms used the 1999-2000 version of Project LISTEN’s Reading Tutor, a computer program that uses automated speech recognition to listen to a child read aloud, and gives spoken and graphical assistance. Students took daily turns using one shared Reading Tutor in their classroom while the rest of their class received regular instruction. (b) 34 students in the other 6 classrooms were pulled out daily for one-on-one tutoring by certified teachers. To control for materials, the human tutors used the same set of stories as the Reading Tutor. (c) 39 students served as in-classroom controls, receiving regular instruction without tutoring. We compared students’ pre- to post-test gains on the Word Identification, Word Attack, Word Comprehension, and Passage Comprehension subtests of the Woodcock Reading Mastery Test, and in oral reading fluency.
Surprisingly, the human-tutored group significantly outgained the Reading Tutor group only in Word Attack (main effects p<.02, effect size .55). Third graders in both the computer- and human-tutored conditions outgained the control group significantly in Word Comprehension (p<.02, respective effect sizes .56 and .72) and suggestively in Passage Comprehension (p=.14, respective effect sizes .48 and .34). No differences between groups on gains in Word Identification or fluency were significant. These results are consistent with an earlier study in which students who used the 1998 version of the Reading Tutor outgained their matched classmates in Passage Comprehension (p=.11, effect size .60), but not in Word Attack, Word Identification, or fluency.
To shed light on outcome differences between tutoring conditions and between individual human tutors, we compared process variables. Analysis of logs from all 6,080 human and computer tutoring sessions showed that human tutors included less rereading and more frequent writing than the Reading Tutor. Micro-analysis of 40 videotaped sessions showed that students who used the Reading Tutor spent considerable time waiting for it to respond, requested help more frequently, and picked easier stories when it was their turn. Human tutors corrected more errors, focussed more on individual letters, and provided assistance more interactively, for example getting students to sound out words rather than sounding out words themselves as the Reading Tutor did.
Abstract: When does taking time to preview a new word before reading a story improve vocabulary and comprehension more than encountering the word in context? To address this question, the 2001-2002 version of Project LISTEN's Reading Tutor embedded an automated experiment to compare three types of vocabulary preview -- defining the word, giving a synonym, or just asking about the word -- and a control condition. Outcomes included within-story comprehension as measured by performance on multiple-choice cloze questions, and post-story vocabulary as measured by matching words to their definitions. We analyze results based on thousands of randomized trials.
[ICMI 2002 emotional] Aist, G., Kort, B., Reilly, R., Mostow, J., & Picard, R. (2002, October 14-16). Experimentally Augmenting an Intelligent Tutoring System with Human-Supplied Capabilities: Adding Human-Provided Emotional Scaffolding to an Automated Reading Tutor that Listens. Proceedings of the Fourth IEEE International Conference on Multimodal Interfaces (ICMI 2002), Pittsburgh, PA, 483-490. Revised version of paper first presented at ITS 2002 Workshop on Empirical Methods for Tutorial Dialogue Systems, San Sebastian, Spain. Download paper in pdf format.
Abstract: This paper presents the first statistically reliable empirical evidence from a controlled study for the effect of human-provided emotional scaffolding on student persistence in an intelligent tutoring system. We describe an experiment that added human-provided emotional scaffolding to an automated Reading Tutor that listens, and discuss the methodology we developed to conduct this experiment. Each student participated in one (experimental) session with emotional scaffolding, and in one (control) session without emotional scaffolding, counterbalanced by order of session. Each session was divided into several portions. After each portion of the session was completed, the Reading Tutor gave the student a choice: continue, or quit. We measured persistence as the number of portions the student completed. Human-provided emotional scaffolding added to the automated Reading Tutor resulted in increased student persistence, compared to the Reading Tutor alone. Increased persistence means increased time on task, which ought lead to improved learning. If these results for reading turn out to hold for other domains too, the implication for intelligent tutoring systems is that they should respond with not just cognitive support – but emotional scaffolding as well. Furthermore, the general technique of adding human-supplied capabilities to an existing intelligent tutoring system should prove useful for studying other ITSs too.
[ICMI 2002] Mostow, J.,
Beck, J., Chalasani, R.,
Abstract: It is easier to
record logs of multimodal human-computer tutorial dialogue than to make sense
of them. In the 2000-2001 school year, we
logged the interactions of approximately 400 students who used Project
LISTEN’s Reading Tutor and who read aloud over 2.4 million words. This
paper discusses some difficulties we encountered converting the logs into a
more easily understandable database. It is faster to write SQL queries
to answer research questions than to analyze complex log files each time.
The database also permits us to construct a viewer to examine individual
Abstract: This paper
explores the problem of predicting specific reading mistakes, called miscues,
on a given word. Characterizing likely miscues tells an automated
reading tutor what to anticipate, detect, and remediate. As training
and test data, we use a database of over 100,000 miscues transcribed by
Abstract: This paper addresses an indispensable skill using a unique method to teach a critical component: helping children learn to read by using computer-assisted oral reading to help children learn vocabulary. We build on Project LISTEN’s Reading Tutor, a computer program that adapts automatic speech recognition to listen to children read aloud, and helps them learn to read (http://www.cs.cmu.edu/~listen). To learn a word from reading with the Reading Tutor, students must encounter the word and learn the meaning of the word in context. We modified the Reading Tutor first to help students encounter new words and then to help them learn the meanings of new words. We then compared the Reading Tutor to classroom instruction and to human-assisted oral reading as part of a yearlong study with 144 second and third graders. The result: Second graders did about the same on word comprehension in all three conditions. However, third graders who read with the 1999 Reading Tutor, modified as described in this paper, performed statistically significantly better than other third graders in a classroom control on word comprehension gains – and even comparably with other third graders who read one-on-one with human tutors.
Abstract: A 7-month study of 178 students in grades 1-4 at two schools compared two daily 20-minute treatments. 88 students did Sustained Silent Reading (SSR) in their classrooms. 90 students in 10-computer labs used the 2000-2001 version of Project LISTEN’s Reading Tutor (RT), which uses speech recognition to listen to a child read aloud, and responds with spoken and graphical assistance (www.cs.cmu.edu/~listen). The RT group significantly outgained their statistically matched SSR classmates in phonemic awareness, rapid letter naming, word identification, word comprehension, passage comprehension, fluency, and spelling – especially in grade 1, where effect sizes for these skills ranged from .20 to .72.
Abstract: Analyzing the time allocation of students’ activities in a school-deployed mixed initiative tutor can be illuminating but surprisingly tricky. We discuss some complementary methods that we have used to understand how tutoring time is spent, such as analyzing sample videotaped sessions by hand, and querying a database generated from session logs. We identify issues, methods, and lessons that may be relevant to other tutors. One theme is that iterative design of “non-tutoring” components can enhance a tutor’s effectiveness, not by improved teaching, but by reducing the time wasted on non-learning activities. Another is that it is possible to relate student’s time allocation to improvements in various outcome measures.
Abstract: Our goal is to
find a methodology for directing development effort in an intelligent
tutoring system (ITS). Given that ITS have several AI reasoning components,
as well as content to present, evaluating them is a challenging task. Due to
these difficulties, few evaluation studies to measure the impact of individual
components have been performed. Our architecture evaluates the efficacy of
each component of an ITS and considers the impact of
a particular teaching goal when determining whether a particular component
needs improving. For our AnimalWatch tutor, we
found that for certain goals the tutor itself, rather than its reasoning
components, needed improvement. We have found that it is necessary to know
what the system’s teaching goals are before deciding which component is the
limiting factor on performance. [Based on Dr. Beck's research at
Abstract: This paper presents the first statistically reliable empirical evidence from a controlled study for the effect of human-provided emotional scaffolding on student persistence in an intelligent tutoring system. We describe an experiment that added human-provided emotional scaffolding to an automated Reading Tutor that listens, and discuss the methodology we developed to conduct this experiment. Each student participated in one (experimental) session with emotional scaffolding, and in one (control) session without emotional scaffolding, counterbalanced by order of session. Each session was divided into several portions. After each portion of the session was completed, the Reading Tutor gave the student a choice: continue, or quit. We measured persistence as the number of portions the student completed. Human-provided emotional scaffolding added to the automated Reading Tutor resulted in increased student persistence, compared to the Reading Tutor alone. Increased persistence means increased time on task, which ought lead to improved learning. If these results for reading turn out to hold for other domains too, the implication for intelligent tutoring systems is that they should respond with not just cognitive support – but emotional scaffolding as well. Furthermore, the general technique of adding human-supplied capabilities to an existing intelligent tutoring system should prove useful for studying other ITSs too
Abstract: It is easier to
record logs of multimodal human-computer tutorial dialogue than to make sense
of them. This paper discusses some of the problems in extracting useful
information from such logs and the difficulties we encountered in converting
the logs into a more easily understandable database. Once log files are
parsed into a database, it is possible to write SQL queries to answer
research questions faster than analyzing complex log files each time.
The database permits us to construct a viewer to examine individual
Abstract: Can vocabulary and comprehension assessments be generated automatically for a given text? We describe the automated method used to generate, administer, and score multiple-choice vocabulary and comprehension questions in the 2001-2002 version of Project LISTEN’s Reading Tutor. To validate the method against the Woodcock Reading Mastery Test, we analyzed 69,326 multiple-choice cloze items generated in the course of regular Reading Tutor use by 364 students in grades 1-9 at seven schools. Correlation between predicted and actual scores reached R=.85 for Word and Passage Comprehension.
[CVDA 2002 latency] Jia, P., Beck, J. E., & Mostow, J. (2002, June 3). Can a Reading Tutor that Listens use Inter-word Latency to Assess a Student's Reading Ability? ITS 2002 Workshop on Creating Valid Diagnostic Assessments, San Sebastian, Spain, pp. 23-32. Download paper in pdf format.
Abstract: This paper describes our use of inter-word latency, the delay before a student speaks a word in the course of reading a sentence aloud, to assess oral reading automatically. The context of our study is a Reading Tutor that uses automated speech recognition to listen to children read aloud. Using data from 58 students in grades 1 through 4, we used inter-word latency to predict scores on external, individually administered, paper-based tests. Correlation between predicted and actual test scores exceeded .7 for fluency, word attack, word identification, word comprehension, and passage comprehension. Compared with paper-based tests, this evaluation method is much cheaper, based on computer-guided oral reading recorded in the course of regular tutor use, and invisible to students. It has the potential to provide continuous assessment of student progress, both to report to teachers and to guide its own tutoring.
[IRA 2002 award] Aist,
G. (2002, April 29). Helping Children Learn Vocabulary during
Computer-Assisted Oral Reading: A Dissertation Summary [Poster
presented as a Distinguished Finalist for the Outstanding Dissertation of the
Year Award]. 47th Annual Convention of the International Reading Association,
[IJAIED 2001] Aist, G. Towards automatic glossarization: automatically constructing and administering vocabulary assistance factoids and multiple-choice assessment. International Journal of Artificial Intelligence in Education (2001) 12, 212-231. Download from IJAIE website.
Abstract: We address an important problem with a novel approach: helping children learn words during computer-assisted oral reading. We build on Project LISTEN's Reading Tutor, which is a computer program that adapts automatic speech recognition to listen to children read aloud, and helps them learn to read (http://www.cs.cmu.edu/~listen). In this paper, we focus on the problem of vocabulary acquisition. To learn a word from reading with the Reading Tutor, students must first encounter the word and then learn the meaning of the word from context. This paper describes how we modified the Reading Tutor to help students learn the meanings of new words by augmenting stories with WordNet-derived comparisons to other words – "factoids". Furthermore, we report results from an embedded experiment designed to evaluate the effectiveness of including factoids in stories that children read with the Reading Tutor. Factoids helped – not for all students and all words, but for third graders seeing rare words, and for single sense rare words tested one or two days later. We also discuss further steps towards automatic construction of explanations of words.
[FF 2001] Mostow, J., and Aist, G. Evaluating tutors that listen: An overview of Project LISTEN. In (K. Forbus and P. Feltovich, Eds.) Smart Machines in Education, pp. 169-234. MIT/AAAI Press, 2001. Order book from AAAI Press.
[DYD 2001] Aist, G.
Towards Worldwide Literacy: Technological Affordances, Economic Challenges,
Affordable Technology. Development by Design: Workshop on Collaborative Open
Source Design of Appropriate Technologies. MIT Media Lab,
[NAACL 2001] Jack Mostow,
Greg Aist, Juliet Bey, Paul Burkhead, Andrew Cuneo,
Susan Rossbach, Brian Tobin, Joe Valeri, and Sara
Wilson. A hands-on demonstration of Project LISTEN’s Reading Tutor and
its embedded experiments. Refereed demo presented at Language
Technologies 2001: The Second Meeting of the North American Chapter of
the Association for Computational Linguistics,
Abstract: Project LISTEN’s Reading Tutor helps children learn to read. It uses speech recognition to listen to them read aloud, and responds with spoken and graphical feedback. The demonstration lets attendees try out this interaction themselves. Besides the spoken tutorial dialog, features shown include an automated tutorial for new users, interactive activities that combine assisted reading with other types of steps, and automated field studies to evaluate the efficacy of alternative tutorial interventions by embedding experiments within the Reading Tutor.
[WTDS 2001 DT]
Murray, R. Charles, Van Lehn, Kurt, and Mostow, Jack. A
Decision-Theoretic Approach for Selecting Tutorial Discourse Actions. In Proceedings
of the NAACL 2001 Workshop on Adaptation in Dialogue Systems,
[WTDS 2001 DTa] Murray, R. Charles, Van Lehn, Kurt, and Mostow, Jack. A Decision-Theoretic Architecture for Selecting Tutorial Discourse Actions. In Proceedings of the AIED-2001 Workshop on Tutorial Dialog Systems, San Antonio, Texas, May 2001, pp. 35-46. Download paper in pdf format.
Abstract: We propose a decision-theoretic architecture for selecting tutorial discourse actions. DT Tutor, an action selection engine which embodies our approach, uses a dynamic decision network to consider the tutor’s objectives and uncertain beliefs in adapting to the changing tutorial state. It predicts the effects of the tutor’s discourse actions on the tutorial state, including the student’s internal state, and then selects the action with maximum expected utility. We illustrate our approach with prototype applications for diverse domains: calculus problem-solving and elementary reading. Formative off-line evaluations assess DT Tutor’s ability to select optimal actions quickly enough to keep a student engaged.
2001 poster] Mostow, J., Aist, G. S., Burkhead,
P., Corbett, A.,
Abstract: A year-long study of 144 second and third graders compared outcomes (gains in test scores) and process variables (e.g. words read) for Project LISTEN’s Reading Tutor, human tutors, and a classroom control. Human tutors beat the Reading Tutor only in word attack. Both beat the control in grade 3 word comprehension.
pause video] Jack Mostow, Cathy Huang, and Brian Tobin. Pause
the Video: Quick but quantitative expert evaluation of tutorial choices in a
Reading Tutor that listens. In J. D. Moore, C. L. Redfield, and W. L.
Johnson (Eds.), Artificial Intelligence in Education: AI-ED in the
Wired and Wireless Future, pp. 343-353.
Abstract: To critique Project LISTEN’s automated Reading Tutor, we adapted a panel-of-judges methodology for evaluating expert systems. Three professional elementary educators watched 15 video clips of the Reading Tutor listening to second and third graders read aloud. Each expert chose which of 10 interventions to make in each situation. To keep the Reading Tutor’s choice from influencing the expert, we paused each video clip just before the Reading Tutor intervened. After the expert responded, we played back what the Reading Tutor had actually done. The expert then rated its intervention compared to hers.
Although the experts seldom agreed, they rated the Reading Tutor’s choices as better than their own in 5% of the cases, equally good in 36%, worse but OK in 41%, and inappropriate in only 19%. The lack of agreement and the surprisingly favorable ratings together suggest that either the Reading Tutor’s choices were better than we thought, the experts knew less than we hoped, or the clips showed less than they should.
miscue mining] James Fogarty, Laura Dabbish, David Steck,
and Jack Mostow. Mining a database of reading mistakes: For what should
an automated Reading Tutor listen? In J. D. Moore, C. L. Redfield, and
W. L. Johnson (Eds.), Artificial Intelligence in Education: AI-ED in
the Wired and Wireless Future, pp. 422-433.
Abstract: Using a machine
learning approach to mine a database of over 70,000 oral reading mistakes
vocabulary gains] Aist, G. S., Mostow, J., Tobin, B., Burkhead,
P., Corbett, A.,
Abstract: We describe results on helping children learn vocabulary during computer-assisted oral reading. This paper focuses on one aspect – vocabulary learning – of a larger study comparing computerized oral reading tutoring to classroom instruction and one-on-one human tutoring. 144 students in second and third grade were assigned to one of three conditions: (a) classroom instruction, (b) classroom instruction with one-on-one tutoring replacing part of the school day, and (c) computer instruction replacing part of the school day. For second graders, there were no significant differences between treatments in word comprehension gains. For third graders, however, the computer tutor showed an advantage over classroom instruction for gains in word comprehension (p = 0.042, effect size = 0.56) as measured by the Woodcock Reading Mastery Test. One-on-one human tutoring also showed an advantage over classroom instruction alone (p = 0.039, effect size = 0.72). Computer tutoring and one-on-one human tutoring were not significantly different in terms of word comprehension gains.
factoids] Gregory S. Aist. Factoids: Automatically
constructing and administering vocabulary assistance and assessment. In
J. D. Moore, C. L. Redfield, and W. L. Johnson (Eds.), Artificial
Intelligence in Education: AI-ED in the Wired and Wireless Future, pp.
Abstract: We address an important problem with a novel approach: helping children learn words during computer-assisted oral reading. We build on Project LISTEN's Reading Tutor, which is a computer program that adapts automatic speech recognition to listen to children read aloud, and helps them learn to read (http://www.cs.cmu.edu/~listen). In this paper, we focus on the problem of vocabulary acquisition. To learn a word from reading with the Reading Tutor, students must first encounter the word and then learn the meaning of the word from context. This paper describes how we modified the Reading Tutor to help students learn the meanings of new words by augmenting stories with WordNet-derived comparisons to other words – “factoids”. Furthermore, we report results from an embedded experiment designed to evaluate the effectiveness of including factoids in stories that children read with the Reading Tutor. Factoids helped – not for all students and all words, but for third graders seeing rare words, and for single-sense rare words tested one or two days later.
[2001 PhD] Aist, G.
2001. Helping Children Learn Vocabulary during Computer-Assisted Oral
2000 SA] Aist, G. Identifying words to explain to a reader: A preliminary
study. Student Abstract and Poster, Proceedings of the Seventeenth
National Conference on Artificial Intelligence (AAAI-2000), p.
[AAAI 2000 DC]
Aist, G. Helping children learn vocabulary during computer assisted
oral reading. SIGART/AAAI Doctoral Consortium, Proceedings of the
Seventeenth National Conference on Artificial Intelligence (AAAI-2000),
[HMC 2000] Aist,
G. Taking Turns Talking About Text in a
Abstract: In this paper we report on ongoing work on turn-taking in Project LISTEN's Reading Tutor (Mostow & Aist CALICO 1999). Project LISTEN’s Reading Tutor listens to children read aloud and helps them learn to read. The Reading Tutor’s repertoire of turn-taking behaviors includes not only alternating turns, but also backchanneling, interrupting, and prompting.
[ITS 2000 YR]
Aist, G. An informal model of vocabulary acquisition during assisted
oral reading and some implications for computerized instruction. In R.
Nkambou (Ed.), ITS'2000 Young Researchers Track Proceedings, pp.
22-24. Fifth International Conference on Intelligent Tutoring
[ITS 2000 PA] Aist,
G. and Mostow, J. Improving story choice in a reading tutor that
listens. Proceedings of the Fifth International Conference on
Intelligent Tutoring Systems (ITS’2000), p. 645.
[ITS 2000 HT] Aist,
G. Human Tutor and Computer Tutor Story Choice in Listening to Children
Read Aloud. In B. du Boulay (Ed.), Proceedings of the ITS'2000 Workshop on Modeling Human Teaching Tactics and
Strategies, pp. 8-10. Fifth International Conference on Intelligent
Abstract: A preliminary report on a comparison of human tutor story choice and mixed-initiative computer tutor story choice in Project LISTEN's Reading Tutor.
[ITS 2000 ML] Aist,
G. and Mostow, J. Using Automated Within-Subject Invisible Experiments
to Test the Effectiveness of Automated Vocabulary Assistance. In Joseph Beck
(Ed.), Proceedings of ITS'2000 Workshop on Applying Machine Learning to
ITS Design/Construction, pp. 4-8. Fifth International Conference on
Intelligent Tutoring Systems.
Abstract: Machine learning offers the potential to allow an intelligent tutoring system to learn effective tutoring strategies. A necessary prerequisite to learning an effective strategy is being able to automatically test a strategy's effectiveness. We conducted an automated, within-subject “invisible experiment” to test the effectiveness of a particular form of vocabulary instruction in a Reading Tutor that listens. Both conditions were in the context of assisted oral reading with the computer. The control condition was encountering a word in a story. The experimental condition was first reading a short automatically generated "factoid" about the word, such as "cheetah can be a kind of cat. Is it here?" and then reading the sentence from the story containing the target word. The initial analysis revealed no significant difference between the conditions. Further inspection revealed that sometimes students benefited from receiving help on "hard" or infrequent words. Designing, implementing, and analyzing this experiment shed light not only on the particular vocabulary help tested, but also on the machine-learning-inspired methodology we used to test the effectiveness of this tutorial action.
Aist, G. and Mostow, J. Measuring the Effects of Backchanneling
in Computerized Oral
Abstract: What is the effect of back channeling on human-computer dialog, and how should such effects be measured? We present experiments designed to evaluate the immediate effects of back channeling on computer-assisted oral reading tutoring. These experiments are implemented in a reading tutor that listens to children read aloud, and helps them learn to read. As a byproduct of designing, conducting, and evaluating these experiments, we are able to describe some unique methodological challenges in evaluating the effects of low-level turn taking dialog behavior.
[USPTO 99] Mostow, J. and Aist, G. Reading and Pronunciation Tutor. United States Patent No. 5,920,838. Filed June 2, 1997; issued July 6, 1999. US Patent and Trademark Office.
Abstract: A computer implemented reading tutor comprises a player for outputting a response. An input block implementing a plurality of functions such as silence detection, speech recognition, etc. captures the read material. A tutoring function compares the output of the speech recognizer to the text which was supposed to have been read and generates a response, as needed, based on information in a knowledge base and an optional student model. The response is output to the user through the player. A quality control function evaluates the captured read material and stores the captured material in the knowledge base under certain conditions. An auto enhancement function uses information available to the tutor to create additional resources such as identifying rhyming words, words with common roots, etc., which can be used as responses.
[AAAI99] Mostow, J. and
Aist, G. Authoring New Material in a
Abstract: Project LISTEN’s Reading Tutor helps children learn to read by providing assisted practice in reading connected text. A key goal is to provide assistance for reading any English text entered by students or adults. This live demonstration shows how the Reading Tutor helps users enter and narrate stories, and then helps children read them.
[CALICO99] Mostow, J.
and Aist, G. Giving Help and Praise in a
Abstract: Human tutors make use of a wide range of input and output modalities, such as speech, vision, gaze, and gesture. Computer tutors are typically limited to keyboard and mouse input. Project LISTEN’s Reading Tutor uses speech recognition technology to listen to children read aloud and help them. Why should a computer tutor listen? A computer tutor that listens can give help and praise naturally and unobtrusively. We address the following questions: When and how should a computer tutor that listens help students? When and how should it praise students? We examine how the advantages and disadvantages of speech recognition technology helped shape the design and implementation of the Reading Tutor. Despite its limitations, this technology enables the Reading Tutor to provide patient, unobtrusive, and natural assistance for reading aloud.
[SRinCALL] G. Aist. Speech recognition in computer assisted language learning. In K. C. Cameron (ed.), Computer Assisted Language Learning (CALL): Media, Design, and Applications. Lisse: Swets & Zeitlinger, 1999.
[CHI99] G. Aist.
Skill-specific spoken dialogs in a reading tutor that listens. Doctoral
Consortium paper. In Proceedings of the Conference on Human Factors
in Computing Systems: CHI 99 Extended Abstracts, pp. 55-56.
[LIS99] Mostow, J. (ed.), McClelland, J., Fiez, J., McCandliss, B., Plaut, D., and Schneider, W. Poster and short presentation at the NSF Learning & Intelligent Systems Principal Investigators' meeting, Washington, DC, May, 1999. At http://www.cnbc.cmu.edu/collaborative/lisweb/ppt/index.htm. In J. McClelland (PI), Intervention Strategies that Promote Learning: Their Basis and Use in Enhancing Literacy, at http://www.cnbc.cmu.edu/collaborative/lisweb
Mostow, J. Collaborative Research on Learning Technologies: An
[HCIGW99 IS] Mostow,
J. Guiding Spoken Dialogue with Computers by Responding to Prosodic
Cues. Proceedings of the
NSF Human Computer Interaction Grantees Workshop (HCIGW99),
acoustic] Aist, G., Chan, P., Huang, X. D., Jiang, L., Kennedy, R.,
Latimer, D., Mostow, J., and Yeung, C. How effective is unsupervised data
collection for children's speech recognition? International
Conference on Speech and Language Processing (ICSLP98).
Abstract: Children present a unique challenge to automatic speech recognition. Today’s state-of-the-art speech recognition systems still have problems handling children’s speech because acoustic models are trained on data collected from adult speech. In this paper we describe an inexpensive way to mend this problem. We collected children’s speech when they interact with an automated reading tutor. These data are subsequently transcribed by a speech recognition system and automatically filtered. We studied how to use these automatically collected data to improve children’s speech recognition system’s performance. Experiments indicate that automatically collected data can reduce the error rate significantly on children’s speech.
architecture] Aist, G. Expanding A Time-Sensitive Conversational
Architecture For Turn-Taking To Handle Content-Driven Interruption. International
Conference on Speech and Language Processing (ICSLP98).
Abstract: Turn taking in spoken language systems has generally been push-to-talk or strict alternation (user speaks, system speaks, user speaks, …) with some systems such as telephone-based systems handling barge-in (interruption by the user.) In this paper we describe our time sensitive conversational architecture for turn taking that not only allows alternating turns and barge in, but other conversational behaviors as well. This architecture allows back channeling, prompting the user by taking more than one turn if necessary, and overlapping speech. The architecture is implemented in a Reading Tutor that listens to children read aloud, and helps them. We extended this architecture to allow the Reading Tutor to interrupt the student based on a non-self-corrected mistake – “content-driven interruption”. To the best of our knowledge, the Reading Tutor is thus the first spoken language system to intentionally interrupt the user based on the content of the utterance.
[AAAI AMLDP 98] G.
Aist and J. Mostow. Estimating the Effectiveness of Conversational Behaviors
Abstract: Project LISTEN's Reading Tutor listens to children read aloud, and helps them learn to read. Besides user satisfaction, a primary criterion for tutorial spoken dialogue agents should be educational effectiveness. In order to learn to be more effective, a spoken dialogue agent must be able to evaluate the effect of its own actions. When evaluating the effectiveness of individual actions, rather than comparing a conversational action to "nothing," an agent must compare it to reasonable alternative actions. We describe a methodology for analyzing the immediate effect of a conversational action, and some of the difficulties in doing so. We also describe some preliminary results on evaluating the effectiveness of conversational behaviors in a reading tutor that listens.
[AAAI IE 98] J.
Kominek, G. Aist, and J. Mostow. When Listening Is Not Enough: Potential Uses
of Vision for a
Abstract: Speech offers a powerful avenue between user and computer. However, if the user is not speaking, or is speaking to someone else, what is the computer to make of it? Project LISTEN's Reading Tutor is speech-aware software that strives to teach children to read. Because it is useful to know what the child is doing when reading, we are investigating some potential uses of computer vision. By recording and analyzing video of the Tutor in use, we measured the frequency of events that cannot be detected by speech alone. These include how often the child is visually distracted, and how often the teacher or another student provides assistance. This information helps us assess how vision might enhance the effectiveness of the Reading Tutor.
CAHM 97] G. S. Aist and J. Mostow. A time to be silent and a time to
speak: Time-sensitive communicative actions in a reading tutor that listens. AAAI
Fall Symposium on Communicative Actions in Humans and Machines.
Abstract: Timing is important in discourse, and key in tutoring. Communicative actions that are too late or too early may be infelicitous. How can an agent engage in temporally appropriate behavior? We present a domain-independent architecture that models elapsed time as a critical factor in understanding the discourse. Our architecture also allows for "invisible experiments" where the agent varies its behavior and studies the effects of its behavior on the discourse. This architecture has been instantiated and is in use in an oral reading tutor that listens to children read aloud and helps them.
[PUI 97] G. S. Aist and
J. Mostow. When Speech Input is Not an Afterthought: A Reading Tutor that
Listens. Proceedings of the Workshop on Perceptual User Interfaces,
Abstract: Project LISTEN's Reading Tutor listens to children read aloud, and helps them. The first extended in-school use of the Reading Tutor suggests that for this task speech input can be natural, compelling, and effective.
[CALL 97] G. S. Aist
and J. Mostow. Adapting Human Tutorial Interventions for a
Abstract: Human tutors make use of a wide range of input and output modalities, such as speech, vision, gaze, and gesture. Computer tutors are typically limited to keyboard and mouse input. Project LISTEN's Reading Tutor listens to children read aloud, and helps them. Why should a computer tutor listen? A computer tutor that listens can give help and give praise naturally and unobtrusively. In this paper, we address the following questions: When and how should a computer tutor that listens help students? When and how should a computer tutor that listens praise students? We examine how the advantages and disadvantages of speech recognition helped shape the design and implementation of the Reading Tutor. Despite its limitations, speech recognition enables the Reading Tutor to provide patient, unobtrusive, and natural assistance for reading out loud.
[ISGW97 CRLT] J.
Research on Learning Technologies: An Automated Reading Assistant That
Listens. Proceedings of the
NSF Interactive Systems Grantees Workshop (ISGW97),
[ISGW97 IS] J.
Spoken Dialogue with Computers by Responding to Prosodic Cues. Proceedings of the NSF Interactive
Systems Grantees Workshop (ISGW97),
[ISGW97 KIDS] J.
Mostow and M. Eskenazi. A Database of
Children's Speech. Proceedings
of the NSF Interactive Systems Grantees Workshop (ISGW97),
[LDC KIDS] M. Eskenazi
and J. Mostow. The CMU KIDS Speech Corpus. Corpus of children's read speech
digitized and transcribed on two CD-ROMs, with assistance from Multicom Research and David Graff. Published by the Linguistic Data Consortium,
[AAAI97] J. Mostow and
G. Aist. The Sounds of Silence: Towards Automated Evaluation of Student
Learning in a
Abstract: We propose a paradigm for ecologically valid, authentic, unobtrusive, automatic, data-rich, fast, robust, and sensitive evaluation of computer-assisted student performance. We instantiate this paradigm in the context of a Reading Tutor that listens to children read aloud, and helps them. We introduce inter-word latency as a simple prosodic measure of assisted reading performance. Finally, to validate the measure and analyze performance improvement, we report initial experimental results from the first extended in-school deployment of the Reading Tutor.
[1997 video] J. Mostow.
Pilot Evaluation of Project LISTEN's Reading Tutor (5-minute video). July,
1997. Presented at the Fourteenth National Conference on Artificial
Intelligence (AAAI-97) and the Ninth National Conference on Innovative
Applications of Artificial Intelligence (IAAI-97).
[MS 97] G. S. Aist. A General Architecture for a Real-Time Discourse Agent and
a Case Study in Oral
[AAAI CMMII 97] G. S. Aist. Challenges for a mixed initiative spoken dialog system for oral reading tutoring. In Computational Models for Mixed Initiative Interaction: Working Notes of the AAAI 1997 Spring Symposium. March, 1997. Download paper in pdf format.
Abstract: Deciding when a task is complete and deciding when to intervene and provide assistance are two basic challenges for an intelligent tutoring system. This paper describes these decisions in the context of Project LISTEN, an oral reading tutor that listens to children read aloud and helps them. We present theoretical analysis and experimental results demonstrating that supporting mixed initiative interaction produces better decisions on the task completeness decision than either system-only or user-only initiative. We describe some desired characteristics of a solution to the intervention decision, and specify possible evaluation criteria for such a solution.
Abstract: We have collected a database of children reading age- and reading-level-appropriate text aloud. This (labelled) data, to be distributed in the near future, was primarily intended to be used in CMU's LISTEN tutor which employs speech recognition to monitor children's reading and then help correct errors. The speaker population was therefore chosen to represent good and poor readers and to incorporate dialects of the speakers for whom the reading coach is intended. Phonemic balance could not be achieved (although it has been calculated) since the primary concern in recording children reading is to present sentences that can effectively be read by first through third graders. The text is a series of sentences we adapted from text in the Weekly Reader series - most of the adaptation concerned the lack of the accompanying images. The text was chosen for its intrinsic interest and widespread use. Several trial recording sessions allowed us to develop a protocol that kept extraneous noises produced by the children at a minimum. We will discuss this and other problems inherent in recording children reading. Novel techniques developed for labelling this kind of speech will also be presented. This work was funded by NSF Grant No. IRI-9528984.
J. Mostow, A. Hauptmann, and S. Roth. Demonstration of a Reading Coach that
Listens. In Proceedings of the Eighth Annual Symposium on User Interface
Software and Technology, pp. 77-78. Sponsored by ACM SIGGRAPH and SIGCHI
in cooperation with SIGSOFT,
Abstract: Project LISTEN stands for "Literacy Innovation that Speech Technology ENables." We will demonstrate a prototype automated reading coach that displays text on a screen, listens to a child read it aloud, and helps where needed. We have tested successive prototypes of the coach on several dozen second graders. Mostow et al [AAAI94] reports implementation details and evaluation results. Here we summarize its functionality, the issues it raises in human-computer interaction, and how it addresses them. We are redesigning the coach based on our experience, and will demonstrate its successor at UIST '95.
[NSF ISGW 95] J.
Mostow & M. Eskenazi, summary of NSF project, November 1995,
[AAAI 94] J. Mostow, S. Roth, A. G. Hauptmann, and M. Kane, "A Prototype Reading Coach that Listens", Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), American Association for Artificial Intelligence, Seattle, WA, August 1994, pp. 785-792. Recipient of the AAAI-94 Outstanding Paper Award. Download paper in pdf format.
Abstract: We report progress on a new approach to combating illiteracy -- getting computers to listen to children read aloud. We describe a fully automated prototype coach for oral reading. It displays a story on the screen, listens as a child reads it, and decides whether and how to intervene. We report on pilot experiments with low-reading second graders to test whether these interventions are technically feasible to automate and pedagogically effective to perform. By adapting a continuous speech recognizer, we detected 49% of the misread words, with a false alarm rate under 4%. By incorporating the interventions in a simulated coach, we enabled the children to read and comprehend material at a reading level 0.6 years higher than what they could read on their own. We show how the prototype uses the recognizer to trigger these interventions automatically.
[AAAI 94 video] J. Mostow, S. Roth, A. Hauptmann, M. Kane, A. Swift, L. Chase, and B. Weide, "A Reading Coach that Listens (6-minute video)", Video Track of the Twelfth National Conference on Artificial Intelligence (AAAI94), American Association for Artificial Intelligence, Seattle, WA, August 1994. Download paper in pdf format.
[ARPA HLT 94] A. G.
Hauptmann, J. Mostow, S. F. Roth, M. Kane, and A. Swift, "A Prototype
Reading Coach that Listens: Summary of Project LISTEN." In C. Weinstein
(ed.), Proceedings ARPA Workshop on Human Language Technology,March
[Eurospeech 93] A. G. Hauptmann, L. L. Chase, and J.
Mostow, "Speech Recognition Applied to Reading Assistance for Children:
A Baseline Language Model", Proceedings of the 3rd European
Conference on Speech Communication and Technology (EUROSPEECH93),
Abstract: We describe an approach to using speech recognition in assisting children's reading. A state-of-the-art speaker independent continuous speech recognizer designed for large vocabulary dictation is adapted to the task of identifying substitutions and omissions in a known text. A baseline language model for this new task is detailed and evaluated against a corpus of children reading graded passages. We are able to identify words missed by a reader with an average false positive rate of 39% and a corresponding false negative rate of 37%. These preliminary results are encouraging for our long-term goal of providing automated coaching for children learning to read.
[Video 93] J. Mostow, S. Roth, A. Hauptmann, M. Kane, A. Swift, L. Chase, and B. Weide, "Getting Computers to Listen to Children Read: A New Way to Combat Illiteracy (7-minute video)", Overview and research methodology of Project LISTEN as of July 1993.
[AAAI 93] J. Mostow, A. G. Hauptmann, L. L. Chase, and S. Roth, "Towards a Reading Coach that Listens: Automated Detection of Oral Reading Errors", Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI93), American Association for Artificial Intelligence, Washington, DC, July 1993, pp. 392-397. Download paper in pdf format.
Abstract: What skill is more important to teach than reading? Unfortunately, millions of Americans cannot read. Although a large body of educational software exists to help teach reading, its inability to hear the student limits what it can do.
This paper reports a significant step toward using automatic speech recognition to help children learn to read: an implemented system that displays a text, follows as a student reads it aloud, and automatically identifies which words he or she missed. We describe how the system works, and evaluate its performance on a corpus of second graders' oral reading that we have recorded and transcribed.