Read
the Web:
Web Intelligent Toolkit
This
page describes a Java package
for retrieving web pages in three ways. You may (1) specify a
URL
of the page you want, (2) specify a search query which will be passed
to Google to retrieve page URLs that match the query, or (3) specify a
start page and domain that you would like to crawl.
Installation:
- Java
development kit j2sdk 1.4
or later is required. It
is available from http://java.sun.com/
- Download
"WIT.zip",
and unzip it on your computer.
- Apply
for a gmail account, get your own Google web search license
key and
replace it
in lib/key.txt.
- Google
Desktop Search version 2 is
also required. You need to install it on your workstation.
Set Environment Variables:
If on Windows, define "WIT" as
a system environment variable from
"Control Panel -> System -> Advanced ->
Environment
Variables", e.g. set "WIT" as
"C:\wit";
If on Linux, e.g. define "WIT" by
"export WIT=/usr/user1/smith/wit".
From the "wit/"
directory, execute:
".\script\setup.bat"
if in DOS mode;
".
script/setup.cygwin"
if in a cygwin
terminal on Windows;
". script/setup.linux"
if on Linux.
Development:
Usage:
- GetURLsFromQuery.
Given a query, use
Google Web search to get a number of relevant web page URLs.
The
retrieved URLs will be saved at "<current_directory>/LINKs/webs/<search_query>-webURL.txt".
Example of input
parameters:
<search_query>
--
was born in
% query
<result_num>
-- 100
(by default 500)
% how many
URLs to get,
<start_point>
-- 10
(by default 0)
% from
which position to
get URLs,
-out
% prints
information to the
screen
-exactMatch
% require exact match
<start_point> equals to 0 means retrieving page from the
first
hit.
Example1:
retrieve up to 10 URLs from
the 0th position that contain the
exact string "Elvis
Presley" (same a typing "Elvis Preley" to Google Web Search)
- java
edu.cmu.cs.readweb.GetURLsFromQuery
-query Elvis
Presley -num 10 -startPoint
0 -out -exactMatch
Example
2:
retrieve up to 100 URLs from the 10th position that match "Elvis
Presley" but don't require
exact match (same as typing Elvis Presley to Google (without double
quotes).
- java
edu.cmu.cs.readweb.GetURLsFromQuery
-query Elvis
Presley -num 100 -startPoint 10
-out
- GetURLsFromTwoQueries.
This is the same as
the above GetURLsFromQuery, except it allows you to specify two query
strings each of which must be exactly matched. Given two
queries,
use Google Web search to get a number of relevant web page
URLs. The
retrieved URLs will
be saved at "<current_directory>/LINKs/webs/<search_query>-webURL.txt".
Example
of input
parameters:
<search_query1>
-- Elvis
Presley
% query1
<search_query2>
-- in 1935
% query2
<result_num>
-- 100
(by default 500)
% how many URL to get,
<start_point>
-- 10
(by default 0)
% from which position to
get URLs,
-out
% prints information to the
screen
-exactMatch
% require exact
match for both queries
Example: retrieve up to 50 URLs from
the 10th
position that exactly match
the two strings
"Elvis Presley" and "January 1935"
- java edu.cmu.cs.readweb.GetURLFromTwoQueries -query1
Elvis
Presley -query2 January 1935 -num 50 -startPoint 10 -out
-exactMatch
- GetPageFromURL.
Given a URL string,
first check if the according web page is already in
cache, fetch
the file if yes, otherwise crawl its web page and save the page into
the file <save_file_name>.
Example of input
parameters:
<url>
-- http://www.cmu.edu/~smith
<save_file_name >
-- smith.html
<dir_name
>
--
BIO/
Example:
Retrieve the web page for the URL
http://www.cmu.edu/~smith, store page in the file
"C:/smith.html".
Before
going to the web, first
look for
the page in the cache directory "C:/webcache/".
java edu.cmu.cs.readweb.GetPageFromURL -url
http://www.cmu.edu/~smith -saveAs
C:/smith.html -dir C:/webcache
- GetPageTextFromURL.
Given a URL string,
first check if the according web page is already in
cache, fetch
the file if yes, otherwise crawl its web page and save the text into
the file <save_file_name>. The text that is
saved has all
html tags removed.
Example of input
parameters:
<url>
-- http://www.cmu.edu/bio/
<save_file_name >
-- cmu_bio.txt
<dir_name
>
--
BIO/
Example:
Retrieve the web page for the URL
http://www.cmu.edu/bio/,
remove
all html tags, and store result in the file
"C:/cmu_bio.txt".
Before
going to the web, first
look for
the page in the cache directory "C:/BIO/".
java edu.cmu.cs.readweb.GetPageTextFromURL -url
http://www.cmu.edu -saveAs
C:/myfile.txt -dir C:/BIO/
- GetPageLinksFromURL.
Given a URL string,
first check if the according web page is already in
cache, fetch
the file if yes, otherwise crawl its web page and save the extracted
links into
the file <save_file_name>. The links contain
URL strings
appearing in the page with their anchor texts.
Example of input
parameters:
<url>
-- http://www.cmu.edu/~smith
<save_file_name >
-- smith-links.txt
<dir_name
>
--
BIO/
Example: Retrieve the web page for the URL http://www.cmu.edu/~smith,
extract
all URLlinks and anchor texts, then store result in the file
"C:/smith-links.txt".
Before
going to the web, first
look for
the page in the cache directory "C:/BIO/".
- java edu.cmu.cs.readweb.GetPageLinksFromURL -url
http://www.cmu.edu/~smith -saveAs
C:/smith-links.txt -dir C:/BIO/
- CrawlDomain.
Collect multiple web pages from the World Wide Web, beginning the crawl
at the specified URL, and collecting only web pages within the
specified
domain. As it visits
URLs, it identifies all the hyperlinks in the page and adds them to the
list of URLs to visit, recursively browsing the Web.
It performs a
breadth-first crawl until either it has collected MaxNumPagesToCollect
pages, or there are no further pages within the domain. The
only
web pages visited or stored are those whose extensions are included in {'.html','.htm','.txt',
'/', '.php'} .
Web pages are cached in the directory given by
CacheDirectory.
These pages are stored in their original html format, and the page URL
is added as the first line of the file.
Example of input
parameters:
<start_page>
-- http://www.cmu.edu/bio/, %StartPage
<domain_name>
-- http://www.cmu.edu/bio/, %
DomainName
<cache_dir>
-- C:/webcache/
%
CacheDirectory
<max_num_page_to_collect>
--
10000,
% MaxNumPagesToCollect
Example: Crawl up
to 1000 pages begining at http://www.cmu.edu/bio, and remaining inside
the domain http://www.cmu.edu
- java edu.cmu.cs.readweb.CrawlDomain http://www.cmu.edu/bio http://www.cmu.edu/
C:/webcache/
1000
- GetFilesFromQuery.
Given a query, use
Google Desktop search to get a number of relevant file paths.
The
retrieved paths will be saved at "<current_directory>/LINKs/files/<search_query>-fileURL.txt".
Example of input
parameters:
<search_query>
--
was born in
% query
<result_num>
-- 100
(by default 500)
% how many
URLs to get,
<start_point>
-- 10
(by default 0)
% from
which position to
get URLs,
<search_dir>
-- CS
% only search files at CS directory
-out
% prints
information to the
screen
-exactMatch
% require exact match
<start_point> equals to 0 means retrieving page from the
first
hit.
Example:
retrieve up to 10 file paths under CS directory, from
the 0th position that contain the
exact string "Elvis
Presley" (same a typing "Elvis Preley" to Google Desktop
Search)
- java
edu.cmu.cs.readweb.GetFilesFromQuery
-query Elvis
Presley -num 10 -startPoint
0 -searchDir CS -out
-exactMatch
FAQ
frequent
asked questions so far...
Change
Log
records
updates in WIT thus far.
Please send your
questions and feedback
to
( last update 7th Feb
2006 )
Copyright @ 2005 CALD CMU. All rights
reserved.