WIT
- develop descriptions
This
page describes sample usages for calling WIT functions in a Java code.
You should have basic knowledge of Java programming. It is very helpful to
look at
on-line Java API Specification Java
API
Usage:
- GetURLsFromQuery.
Given a query, use
Google Web search to get a number of relevant web page URLs, which are
returned in a Vector.
-
webVector=GWS.getPageURLs(queryString,
numOfResult, startPoint, printOut, exactMatch);
- GetURLsFromTwoQueries. Given two
queries,
use Google Web search to get a number of relevant web page URLs, which are
returned in a Vector.
-
webVector=GWS.getPageURLsFromTwoQueries(queryString1,
queryString2, numOfResult, startPoint, printOut, exactMatch);
- GetPageFromURL.
Given a URL string,
first check if the according web page is already in
cache, fetch
the file if yes, otherwise crawl its web page and returned page streame
is a string.
- GetPageTextFromURL.
Given a URL string,
first check if the according web page is already in
cache directory, fetch
the file if yes, otherwise crawl its web page and parse it.
doc.getDocStream() return a
string that has all
html tags removed.
If you do
not check if the required page have been crawled in cache directory,
simply create a DocumentItem class by giving a URL string.
- GetPageLinksFromURL.
Similarly as last one, once you get DocumentItem class by giving a URL string,
doc.getDocLinks()
return a
string that contains
URLs
appearing in the page with their anchor texts (which direct a URL link
to its neighbor).
The
URL strings and anchor texts are in html tag
<a>...</a>, you need postprocess it to retrieve URL
string and anchor text separately.
- CrawlDomain.
Collect multiple web pages from the World Wide Web, beginning the crawl
at the specified URL, and collecting only web pages within the
specified
domain. As it visits
URLs, it identifies all the hyperlinks in the page and adds them to the
list of URLs to visit, recursively browsing the Web.
It performs a
breadth-first crawl until either it has collected MaxNumPagesToCollect
pages, or there are no further pages within the domain. The
only
web pages visited or stored are those whose extensions are included in {'.html','.htm','.txt',
'/', '.php'} .
Web pages are cached in the directory given by
CacheDirectory.
These pages are stored in their original html format, and the page URL
is added as the first line of the file.
- WitQuery. Given a query, get a
number of relevant documents, urls or snippets by
searching the web or retrieving them from precached web pages.
- Query.witQuery(queryString,
numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);
- cacheQuery. Given a query, get a
number of relevant documents, urls or snippets only by
retrieving them from precached web pages.
- Query.cacheQuery(queryString,
numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);
- Query.cacheQuery(queryString,
numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch, numOfTokenBetweenTwoPhases);
- WebQuery. Given a query, get a
number of relevant documents, urls or snippets by
searching the web, the searched web pages are new from precached web pages.
- Query.webQuery(queryString,
numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);
- BuildIndexFun. Build Indri index for a given directory that stores cached web pages. The index directory also needs to be specified.
While the third
argument equal to "false", The function will create the new index for
cached web page; Otherwise, while it is "true", the function will
attach the index to the original index files.
WIT example WITExample1.java, WITExample2.java,
Please send your
questions and feedback
to
( last update 31st Aug
2006 )
Copyright @ 2005~2006 CALD CMU. All rights
reserved.