Overview of Text Datasets

The WebKB dataset

The complete WebKB dataset, consists of seven classes of web pages collected from computer science departments: student, faculty, course, project, department, staff and other.

Frequently, only four classes are used (student, faculty, course, project); this subset is typically called WebKB4. This is not to be confused with the 4 universities subset, which includes web pages from Cornell, Washington, Wisconsin and Texas, but not pages from the misc collection.

Some learning algorithms use both the web page text and the hyperlink structure. A relational representation of the 4 universities pages and hyperlinks is available. Also available is a collection of anchor text and fulltext for discriminating between courses and non-courses for the 4 universities data.

The 20 Newsgroups dataset

The 20 Newsgroups dataset is a collection of about 20,000 UseNet news postings into 20 different newsgroups.

The Industry Sector dataset

Industry Sector Dataset is a collection of web pages belonging to comapneis from various economic sectors

