The 4 Universities Data Set

This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group. The 8,282 pages were manually classified into the following categories:

student (1641)
faculty (1124)
staff (137)
department (182)
course (930)
project (504)
other (3764)

The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. (For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.)

For each class the data set contains pages from the four universities

Cornell (867)
Texas (827)
Washington (1205)
Wisconsin (1263)

and 4,120 miscellaneous pages collected from other universities.

The data is available from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz
(GNU tar'ed and gzip'ped).

The files are organized into a directory structure, one directory for each class. Each of these seven directories contains 5 subdirectories, one for each of the 4 universities and one for the miscellaneous pages. These directories in turn contain the Web-pages. The file name of each page corresponds to its URL, where '/' was replaced with '^'. Note that the pages start with a MIME-header. Some of the pages do not contain useful information. For example, about 80 pages only contain information for redirecting the browser to a different location. These are not evenly distributed over the different classes.

Test/Train Splits

Since each university's web pages have their own idiosyncrasies, we do not recommend training and testing on pages from the same university. We recommend training on three of the universities plus the misc collection, and testing on the pages from a fourth, held-out university. There is a simple Perl script for creating a directory structure, which should make it easier to do this four-fold cross validation. No guarantees.

How the Web->Kb Project Tokenized the Data

When using the rainbow software to classify these pages, we tokenized the text using the following rainbow options:

--skip-headers, to avoid tokenizing the MIME headers
--skip-html, to avoid tokenizing everything inside `<' and `>'
--lex-pipe-command=tag-digits, to tokenize numbers specially (where `tag-digits is a file containing this Perl script.)
--no-stoplist, to avoid rainbow's default behavior of removing words all words in the standard SMART stoplist.
--prune-vocab-by-occur-count=2, to remove from the vocabulary all tokens that only occur once.

For each test/train split, we performed feature selection by removing all by the 2000 words with highest Mutual Information with the class variable. We did this using rainbow's --prune-vocab-by-infogain=2000 option. last update: January 11, 1998 (McCallum)
created: January, 1998 (Juffi)