SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers

Robert C. Miller and Krishna Bharat

Robert C. Miller and Krishna Bharat. "SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers." Proceedings of the Seventh International World Wide Web Conference (WWW7), Brisbane, Australia, April 1998. In Computer Network and ISDN Systems v. 30, pp. 119-130, 1998.

Abstract

Crawlers, also called robots and spiders, are programs that browse the World Wide Web autonomously. This paper describes SPHINX, a Java toolkit and interactive development environment for Web crawlers. Unlike other crawler development systems, SPHINX is geared towards developing crawlers that are Web-site-specific, personally customized, and relocatable. SPHINX allows site-specific crawling rules to be encapsulated and reused in content analyzers, known as classifiers. Personal crawling tasks can be performed (often without programming) in the Crawler Workbench, an interactive environment for crawler development and testing. For efficiency, relocatable crawlers developed using SPHINX can be uploaded and executed on a remote Web server.

SPHINX
A screenshot from SPHINX.

Full Text


Rob Miller