Text-Driven Forecasting (11-773)

Instructor: Prof. Noah Smith
History: Taught in Fall 2009 (Tuesday/Thursday 12-1:20 pm, Wean 5304)
Prerequisite: permission of instructor

Course Description

Text-driven forecasting is an emerging collection of problems in which text documents or document collections are automatically analyzed to make specific, testable predictions about the future. Well-known examples include predictions about stock or market behavior, product sales patterns, government elections, legislative activities, or public opinion polls.

While a research community focusing on these problems has yet to form, this course is based on the following observations:

This twelve-credit seminar-project hybrid course aims to begin identifying challenge problems and testing some solutions to them.


The time and location are TBD; please contact the instructor if you are interested in participating.

The course will meet twice a week for the first month or so, operating like a seminar with discussion of two or three papers per week and brainstorming. The remainder of the semester will focus on team projects, which will be the bulk of the grade. Each team of approximately three students will build a system that uses a text database to make testable, future predictions.

A student wishing to audit the course will be expected to attend the course meetings, serve as an informal consultant to one of the teams and write a short "lessons learned" paper at the end of the semester.

This course counts as a "lab" for LTI students.


Grades will be assigned based on participation in class discussions (40%) and the course project (60%).

Course Plan and Readings

Part 1: Seminar (roughly 1/3 of the semester)

DateReadings to discussNotes
Tu 8-25 None; introductions, administrivia, and high-level discussion about the course.
Th 8-27 Das and Chen, 2007: Yahoo! for Amazon: Sentiment extraction from small talk on the Web. This is a journal version of a much-cited 2001 paper you can find here. Note that the classification techniques in this paper are very simplistic, from the point of view of machine learning as well as computational linguistics. Brendan's notes.
Tu 9-1 Koppel and Shtrimberg, 2004: Good news or bad news? Let the market decide.
Lavrenko, Schmill, Lawrie, Ogilvie, Jensen, and Allen, 2000: Mining of concurrent text and time series.
Vasco's notes.
Th 9-3 Ghose, Ipeirotis, and Sundararajan, 2007: Opinion mining using econometrics: a case study on reputation systems.
Kogan, Levin, Routledge, Sagi, and Smith, 2009: Predicting risk from financial reports with regression.
Brendan's notes.
Tu 9-8 Antweiler and Frank, 2005: Do US stock markets typically overreact to corporate news stories?
Skim only: Antweiler and Frank, 2004: Is all that talk just noise? The information content of Internet message boards.
Mahesh's notes.
Th 9-10Danescu-Niculescu-Mizil, Kossinets, Kleinberg, and Lee, 2009: How opinions are received by online communities: A case study on Amazon.com helpfulness votes. Mahesh's notes.
Tu 9-15Monroe, Colaresi, and Quinn, 2009: Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict.Dipanjan's notes.
Th 9-17Lerman, Gilder, Dredze, and Pereira, 2008: Reading the markets: Forecasting public opinion of political candidates by news analysisRamnath's notes.
Tu 9-22(no meeting)
Th 9-24Gentzkow and Shapiro, 2007: What drives media slant? Evidence from U.S. daily newspapers. Neel's notes.
Tu 9-29Fader, Radev, Crespin, Monroe, Quinn, and Colaresi, 2007: MavenRank: Identifying influential members of the U.S. Senate using lexical centrality. Dipanjan's notes.
Th 10-1Tausczik and Pennebaker, 2009: The psychological meaning of words: LIWC and computerized text analysis methods.

Part 2: Projects (roughly 2/3 of the semester)

After deciding on project topics and forming teams, we will usually meet as a class once a week to discuss issues that come up in the projects and hear interim reports from each team. There may be some additional readings as well.
Tu 10-6 Project proposals
Th 10-8 Project selection and division into teams
Tu 10-13Zhang and Skiena, 2009: Improving movie gross prediction through news analysis.
Tu 10-20Dodds and Danforth, 2009 Measuring the happiness of large-scale written expression: songs, blogs, and presidents.
Tu 10-27Simonoff and Sparrow, 2000: Predicting movie grosses: Winners and losers, blockbusters and sleepers.
Tu 11-3 Friedman, Hastie, Tibshirani, 2009: Regularization paths for generalized linear models via coordinate descent.
Tu 11-10 Mishne and Glance, 2006: Predicting movie sales from blogger sentiment.
Tu 11-17 (no paper)
Tu 11-24Liang, Jordan, Klein, 2009: "Learning semantic correspondences with less supervision.
Th 12-3Final project presentations (Thursday, not Tuesday!)

Useful Resources

[an error occurred while processing this directive]