Read the Web: Web Intelligent Toolkit

This page describes a Java package for retrieving web pages in three ways.  You may (1) specify a URL of the page you want, (2) specify a search query which will be passed to Google to retrieve page URLs that match the query, or (3) specify a start page and domain that you would like to crawl. 

Installation:


Set Environment Variables:

          If on Windows, define "WIT" as a system environment variable from "Control Panel -> System -> Advanced -> Environment Variables", e.g. set "WIT" as "C:\wit"; If on Linux,  e.g. define "WIT" by "export WIT=/usr/user1/smith/wit".
          From the "wit/" directory, execute:
       

".\script\setup.bat" if in DOS mode;

". script/setup.cygwin" if in a cygwin terminal on Windows;

". script/setup.linux" if on Linux.

Development:


Usage:

            
     Example of input parameters:
     <search_query>                              --  was born in                        % query
     <result_num>                                   -- 100  (by default 500)         % how many URLs to get,
     <start_point>                                   -- 10  (by default 0)                % from which position to get URLs,
     -out                                                                                                    % prints information to the screen
     -exactMatch                                                                                     % require exact match

<start_point> equals to 0 means retrieving page from the first hit.

Example1: retrieve up to 10 URLs from the 0th position that contain the exact string "Elvis Presley"  (same a typing "Elvis Preley" to Google Web Search)

Example 2: retrieve up to 100 URLs from the 10th position that match "Elvis Presley" but don't require exact match (same as typing Elvis Presley to Google (without double quotes).

            Example of input parameters:
     <search_query1>                            --  Elvis Presley                   % query1
     <search_query2>                            --  in 1935                            % query2
     <result_num>                                   -- 100  (by default 500)       % how many URL to get,
     <start_point>                                    -- 10  (by default 0)             % from which position to get URLs,
     -out                                                                                                  % prints information to the screen
     -exactMatch                                                                                   % require exact match for both queries

Example: retrieve up to 50 URLs
from the 10th position that exactly match the two strings "Elvis Presley" and "January 1935"

 

     Example of input parameters:
     <url>                                                   --  http://www.cmu.edu/~smith
     <save_file_name >                          --  smith.html
     <dir_name >                                     --  BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu/~smith, store page in the file "C:/smith.html".   Before going to the web, first look for the page in the cache directory "C:/webcache/".
java edu.cmu.cs.readweb.GetPageFromURL -url http://www.cmu.edu/~smith -saveAs C:/smith.html -dir C:/webcache
 
     Example of input parameters:
     <url>                                                   --  http://www.cmu.edu/bio/
     <save_file_name >                          --  cmu_bio.txt
     <dir_name >                                     --  BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu/bio/, remove all html tags, and store result in the file "C:/cmu_bio.txt".   Before going to the web, first look for the page in the cache directory "C:/BIO/".
java edu.cmu.cs.readweb.GetPageTextFromURL -url http://www.cmu.edu -saveAs C:/myfile.txt -dir C:/BIO/

     Example of input parameters:
     <url>                                                   --  http://www.cmu.edu/~smith
     <save_file_name >                          --  smith-links.txt
     <dir_name >                                     --  BIO/

Example: Retrieve the web page for the URL http://www.cmu.edu
/~smith, extract all URLlinks and anchor texts, then store result in the file "C:/smith-links.txt".   Before going to the web, first look for the page in the cache directory "C:/BIO/".
            
     Example of input parameters:
     <start_page>                               -- http://www.cmu.edu/bio/,   %StartPage
     <domain_name>                         -- http://www.cmu.edu/bio/,   % DomainName
     <cache_dir>                                 -- C:/webcache/                     % CacheDirectory
     <max_num_page_to_collect>   -- 10000,                                 % MaxNumPagesToCollect


Example: Crawl up to 1000 pages begining at http://www.cmu.edu/bio, and remaining inside the domain http://www.cmu.edu

            
     Example of input parameters:
     <search_query>                              --  was born in                        % query
     <result_num>                                   -- 100  (by default 500)         % how many URLs to get,
     <start_point>                                   -- 10  (by default 0)                % from which position to get URLs,
     <search_dir>                                   -- CS                                        % only search files at CS directory
     -out                                                                                                    % prints information to the screen
     -exactMatch                                                                                     % require exact match

<start_point> equals to 0 means retrieving page from the first hit.

Example: retrieve up to 10 file paths under CS directory, from the 0th position that contain the exact string "Elvis Presley"  (same a typing "Elvis Preley" to Google Desktop Search)

FAQ     frequent asked questions so far...
 
Change Log     records updates in WIT thus far.
 


Please send your questions and feedback to     ( last update 7th Feb 2006 )


        Copyright @ 2005 CALD CMU. All rights reserved.