CMU Text-Learning Group Data Archive
Here you will find a collection of text data sets. The types of data sets range from newsgroup articles to collections of web pages. We are always looking for more sets of data to include in the archive, so if you have any text-related data sets you would like to submit, please click the link below on adding to the archive.
All of the data referenced by this page can be found in under the directory /afs/cs.cmu.edu/project/theo-3/. There, you will find a directory for data, one for results, one for training models and one for data packages (tarred and gzipped bundles)
Info on adding to the archive
This section contains pointers to results obtained from the data listed above.
This section contains pointers to knowledge models obtained from the data listed above.
This section will (eventually) contain pointers to tarred and gzipped datasets which are publicly distributable.