For each class the data set contains pages from the four universities
The data is available from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz
(GNU tar'ed and gzip'ped).
The files are organized into a directory structure, one directory for each class. Each of these seven directories contains 5 subdirectories, one for each of the 4 universities and one for the miscellaneous pages. These directories in turn contain the Web-pages. The file name of each page corresponds to its URL, where '/' was replaced with '^'. Note that the pages start with a MIME-header. Some of the pages do not contain useful information. For example, about 80 pages only contain information for redirecting the browser to a different location. These are not evenly distributed over the different classes.
Since each university's web pages have their own idiosyncrasies, we do not recommend training and testing on pages from the same university. We recommend training on three of the universities plus the misc collection, and testing on the pages from a fourth, held-out university. There is a simple Perl script for creating a directory structure, which should make it easier to do this four-fold cross validation. No guarantees.
When using the rainbow software to classify these pages, we tokenized the text using the following rainbow options:
For each test/train split, we performed feature selection by removing
all by the 2000 words with highest Mutual Information with the class
variable. We did this using rainbow's
last update: January 11, 1998 (McCallum)
created: January, 1998 (Juffi)