1. A data set of web pages from CS and Biology departments.
This collection of web pages is available on
/afs/cs/project/theo-21/dataset/. It contains web pages
by crawling Biology departments (BIO/) and Computer Science departments
(CS/) in universities, stored in one file per web page. The
cached during the crawl are pages with extensions: html, htm, txt, /,
php. The stored pages are in raw html format with all html
The raw page has been modified in just one way: the URL of the source
page has been added as the the first line of the file.
Please send your
questions and feedback to Sophie
( last update 6th Feb