Web Intelligent Toolkit - develop description

WIT - develop descriptions

This page describes sample usages for calling WIT functions in a Java code. You should have basic knowledge of Java programming. It is very helpful to look at on-line Java API Specification Java API

Usage:

GetURLsFromQuery. Given a query, use Google Web search to get a number of relevant web page URLs, which are returned in a Vector.

webVector=GWS.getPageURLs(queryString, numOfResult, startPoint, printOut, exactMatch);

GetURLsFromTwoQueries. Given two queries, use Google Web search to get a number of relevant web page URLs, which are returned in a Vector.

webVector=GWS.getPageURLsFromTwoQueries(queryString1, queryString2, numOfResult, startPoint, printOut, exactMatch);

GetPageFromURL. Given a URL string, first check if the according web page is already in cache, fetch the file if yes, otherwise crawl its web page and returned page streame is a string.
- pageStream = Crawler.CrawlPage(urlString, cacheDir);

GetPageTextFromURL. Given a URL string, first check if the according web page is already in cache directory, fetch the file if yes, otherwise crawl its web page and parse it.

CacheDir cd = new CacheDir();
fileName = cd.getCacheFileName(urlString, cacheDir);
File lf = new File(fileName);
if (lf.exists())
urlString=(lf.toURL()).toString();
DocumentItem doc = new DocumentItem(urlString);

doc.getDocStream() return a string that has all html tags removed.

If you do not check if the required page have been crawled in cache directory, simply create a DocumentItem class by giving a URL string.

DocumentItem doc = new DocumentItem(urlString);

GetPageLinksFromURL. Similarly as last one, once you get DocumentItem class by giving a URL string, doc.getDocLinks() return a string that contains URLs appearing in the page with their anchor texts (which direct a URL link to its neighbor).

DocumentItem doc = new DocumentItem(urlString);

The URL strings and anchor texts are in html tag <a>...</a>, you need postprocess it to retrieve URL string and anchor text separately.

CrawlDomain. Collect multiple web pages from the World Wide Web, beginning the crawl at the specified URL, and collecting only web pages within the specified domain. As it visits URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web. It performs a breadth-first crawl until either it has collected MaxNumPagesToCollect pages, or there are no further pages within the domain. The only web pages visited or stored are those whose extensions are included in {'.html','.htm','.txt', '/', '.php'} . Web pages are cached in the directory given by CacheDirectory. These pages are stored in their original html format, and the page URL is added as the first line of the file.

Crawler.CrawlDomain(inputURL, startDomain, cacheDir, SEARCH_LIMIT);

WitQuery. Given a query, get a number of relevant documents, urls or snippets by searching the web or retrieving them from precached web pages.

Query.witQuery(queryString, numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);

cacheQuery. Given a query, get a number of relevant documents, urls or snippets only by retrieving them from precached web pages.

Query.cacheQuery(queryString, numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);
Query.cacheQuery(queryString, numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch, numOfTokenBetweenTwoPhases);

WebQuery. Given a query, get a number of relevant documents, urls or snippets by searching the web, the searched web pages are new from precached web pages.

Query.webQuery(queryString, numOfResult, resultTypeCode, cacheDir, indexDir, exactMatch);

BuildIndexFun. Build Indri index for a given directory that stores cached web pages. The index directory also needs to be specified.

BuildIndexFun bif = new BuildIndexFun(cacheDir, indexDir, IsAttachIndex);

While the third argument equal to "false", The function will create the new index for cached web page; Otherwise, while it is "true", the function will attach the index to the original index files.

WIT example WITExample1.java, WITExample2.java,

Please send your questions and feedback to ( last update 31st Aug 2006 )