Web Intelligent Toolkit

Read the Web: Web Intelligent Toolkit

This page describes a Java package for retrieving web pages in three ways. You may (1) specify a URL of the page you want, (2) specify a search query which will be passed to Google to retrieve page URLs that match the query, or (3) specify a start page and domain that you would like to crawl.

Installation:

Java development kit j2sdk 1.4 or later is required. It is available from http://java.sun.com/
Download "WIT.zip", and unzip it on your computer.
Apply for a gmail account, get your own Google web search license key and replace it in lib/key.txt.
Google Desktop Search version 2 is also required. You need to install it on your workstation.

Set Environment Variables:

Define WIT path

If on Windows, define "WIT" as a system environment variable from "Control Panel -> System -> Advanced -> Environment Variables", e.g. set "WIT" as "C:\wit"; If on Linux, e.g. define "WIT" by "export WIT=/usr/user1/smith/wit".

Set CLASSPATH

From the "wit/" directory, execute:

".\script\setup.bat" if in DOS mode;

". script/setup.cygwin" if in a cygwin terminal on Windows;

". script/setup.linux" if on Linux.

Development:

Java development kit j2sdk 1.4 or later is required. It is available from http://java.sun.com/
Install Apache Ant to compile, use java development software Netbeans, or Eclipse
Download "WIT-source.zip", and unzip it on your computer.
Javadocs for WIT.
HtmlParser has been integrated into WIT. You can learn details from its web site.
For details, please see developer's page.

Usage:

GetURLsFromQuery. Given a query, use Google Web search to get a number of relevant web page URLs. The retrieved URLs will be saved at "<current_directory>/LINKs/webs/<search_query>-webURL.txt".

java edu.cmu.cs.readweb.GetURLsFromQuery -query <search_query> -num <result_num> -startPoint <start_point> -out -exactMatch

     Example of input parameters:
    <search_query>                              -- was born in                  % query
     <result_num>                                   -- 100 (by default 500)        % how many URLs to get,
     <start_point>                                   -- 10 (by default 0)               % from which position to get URLs,
     -out                                                                                        % prints information to the screen
     -exactMatch                                                                          % require exact match

<start_point> equals to 0 means retrieving page from the first hit.

Example1: retrieve up to 10 URLs from the 0th position that contain the exact string "Elvis Presley" (same a typing "Elvis Preley" to Google Web Search)

java edu.cmu.cs.readweb.GetURLsFromQuery -query Elvis Presley -num 10 -startPoint 0 -out -exactMatch

Example 2: retrieve up to 100 URLs from the 10th position that match "Elvis Presley" but don't require exact match (same as typing Elvis Presley to Google (without double quotes).

java edu.cmu.cs.readweb.GetURLsFromQuery -query Elvis Presley -num 100 -startPoint 10 -out

GetURLsFromTwoQueries. This is the same as the above GetURLsFromQuery, except it allows you to specify two query strings each of which must be exactly matched. Given two queries, use Google Web search to get a number of relevant web page URLs. The retrieved URLs will be saved at "<current_directory>/LINKs/webs/<search_query>-webURL.txt".

java edu.cmu.cs.readweb.GetURLFromTwoQueries -query1 <search_query1> -query2 <search_query2> -num <result_num> -startPoint <start_point> -out -exactMatch

Example of input parameters:

    <search_query1>                           -- Elvis Presley                  % query1
    <search_query2>                           -- in 1935                  % query2
     <result_num>                                   -- 100 (by default 500)     % how many URL to get,
     <start_point>                                   -- 10 (by default 0)            % from which position to get URLs,
     -out                                                                                     % prints information to the screen

-exactMatch % require exact match for both queries

Example: retrieve up to 50 URLs from the 10th position that exactly match the two strings "Elvis Presley" and "January 1935"

java edu.cmu.cs.readweb.GetURLFromTwoQueries -query1 Elvis Presley -query2 January 1935 -num 50 -startPoint 10 -out -exactMatch

GetPageFromURL. Given a URL string, first check if the according web page is already in cache, fetch the file if yes, otherwise crawl its web page and save the page into the file <save_file_name>.

java edu.cmu.cs.readweb.GetPageFromURL -url <url> -saveAs <save_file_name> -dir <cache_directory>

     Example of input parameters:
    <url>                                                   -- http://www.cmu.edu/~smith
     <save_file_name >                          -- smith.html

<dir_name > -- BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu/~smith, store page in the file "C:/smith.html". Before going to the web, first look for the page in the cache directory "C:/webcache/".

java edu.cmu.cs.readweb.GetPageFromURL -url http://www.cmu.edu/~smith -saveAs C:/smith.html -dir C:/webcache

GetPageTextFromURL. Given a URL string, first check if the according web page is already in cache, fetch the file if yes, otherwise crawl its web page and save the text into the file <save_file_name>. The text that is saved has all html tags removed.

java edu.cmu.cs.readweb.GetPageTextFromURL -url <url> -saveAs <save_file_name> -dir <cache_directory>

     Example of input parameters:
    <url>                                                   -- http://www.cmu.edu/bio/
     <save_file_name >                          -- cmu_bio.txt

<dir_name > -- BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu/bio/, remove all html tags, and store result in the file "C:/cmu_bio.txt". Before going to the web, first look for the page in the cache directory "C:/BIO/".

java edu.cmu.cs.readweb.GetPageTextFromURL -url http://www.cmu.edu -saveAs C:/myfile.txt -dir C:/BIO/

GetPageLinksFromURL. Given a URL string, first check if the according web page is already in cache, fetch the file if yes, otherwise crawl its web page and save the extracted links into the file <save_file_name>. The links contain URL strings appearing in the page with their anchor texts.

java edu.cmu.cs.readweb.GetPageLinksFromURL -url <url> -saveAs <save_file_name> -dir <cache_directory>

     Example of input parameters:
    <url>                                                   -- http://www.cmu.edu/~smith
     <save_file_name >                          -- smith-links.txt

<dir_name > -- BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu/~smith, extract all URLlinks and anchor texts, then store result in the file "C:/smith-links.txt". Before going to the web, first look for the page in the cache directory "C:/BIO/".

java edu.cmu.cs.readweb.GetPageLinksFromURL -url http://www.cmu.edu/~smith -saveAs C:/smith-links.txt -dir C:/BIO/

CrawlDomain. Collect multiple web pages from the World Wide Web, beginning the crawl at the specified URL, and collecting only web pages within the specified domain. As it visits URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web. It performs a breadth-first crawl until either it has collected MaxNumPagesToCollect pages, or there are no further pages within the domain. The only web pages visited or stored are those whose extensions are included in {'.html','.htm','.txt', '/', '.php'} . Web pages are cached in the directory given by CacheDirectory. These pages are stored in their original html format, and the page URL is added as the first line of the file.

java -Xmx500m edu.cmu.cs.readweb.CrawlDomain <start_page> <domain_name> <cache_dir> <max_num_page_to_collect>

     Example of input parameters:
    <start_page>                               -- http://www.cmu.edu/bio/, %StartPage
     <domain_name>                         -- http://www.cmu.edu/bio/, % DomainName
     <cache_dir>                          -- C:/webcache/                     % CacheDirectory
<max_num_page_to_collect>   -- 10000,                                % MaxNumPagesToCollect

Example: Crawl up to 1000 pages begining at http://www.cmu.edu/bio, and remaining inside the domain http://www.cmu.edu

java edu.cmu.cs.readweb.CrawlDomain http://www.cmu.edu/bio http://www.cmu.edu/ C:/webcache/ 1000

GetFilesFromQuery. Given a query, use Google Desktop search to get a number of relevant file paths. The retrieved paths will be saved at "<current_directory>/LINKs/files/<search_query>-fileURL.txt".

java edu.cmu.cs.readweb.GetFilesFromQuery -query <search_query> -num <result_num> -startPoint <start_point> -searchDir <search_dir> -out -exactMatch

     Example of input parameters:
    <search_query>                              -- was born in                  % query
     <result_num>                                   -- 100 (by default 500)        % how many URLs to get,
     <start_point>                                   -- 10 (by default 0)               % from which position to get URLs,
     <search_dir>                                   -- CS                                       % only search files at CS directory
     -out                                                                                        % prints information to the screen
     -exactMatch                                                                          % require exact match

<start_point> equals to 0 means retrieving page from the first hit.

Example: retrieve up to 10 file paths under CS directory, from the 0th position that contain the exact string "Elvis Presley" (same a typing "Elvis Preley" to Google Desktop Search)

java edu.cmu.cs.readweb.GetFilesFromQuery -query Elvis Presley -num 10 -startPoint 0 -searchDir CS -out -exactMatch

FAQ frequent asked questions so far...

Change Log records updates in WIT thus far.

Please send your questions and feedback to ( last update 7th Feb 2006 )