Divider

Advanced Statistical Language Processing: Reading the Web (10-709)

Data and Software

Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University

Fall 2009

Divider

 We have several data sets available to support class projects.  

1. Co-occurrence statistics between noun phrases (e.g., 'New York City') and contexts (e.g., 'mayor of __').   We have available two sets of this data.

2. Browsable knowledge bases, including lists of instances of animals, shapes, people, etc., and lists of instances of relations such as plays_sport(person,team), learned extraction patterns, and more.  This data is available here, and software to access it by program is just below.  Some items from this site that might be especially useful are:

3. (coming soon:). Co-occurence statistics between individual English words.  In particular, a 50k by 50k array giving the frequency of co-occurrence of the 50k most frequent words/tokens in English, with one another.