Dataset Used in the Co-training Experiments for COLT 98

This data set contains a subset of the WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group. The 1051 pages were manually classified into the following categories:

The data is available from
(GNU tar'ed and gzip'ped).

The files are organized into a directory structure with two directories at the top level

Under each of the two directories, there is one directory for each class (course, non-course). These directories in turn contain the Web-pages. The file name of each page corresponds to its URL, where '/' was replaced with '^'. Note that the pages start with a MIME-header.

