Read The Web - Datasets

Spring 2006

1. A data set of web pages from CS and Biology departments.

This collection of web pages is available on /afs/cs/project/theo-21/dataset/. It contains web pages obtained by crawling Biology departments (BIO/) and Computer Science departments (CS/) in universities, stored in one file per web page. The only pages cached during the crawl are pages with extensions: html, htm, txt, /, php. The stored pages are in raw html format with all html tags. The raw page has been modified in just one way: the URL of the source page has been added as the the first line of the file.

Please send your questions and feedback to Sophie ( last update 6th Feb 2006 )