edu.cmu.cs.readweb.util
Class Crawler

java.lang.Object
  extended byedu.cmu.cs.readweb.util.Crawler

public class Crawler
extends java.lang.Object


Field Summary
static java.lang.String DISALLOW
           
static int pageAddesToSearch
           
static int pageStreamCalled
           
static java.util.Vector vectorToSearch
           
 
Constructor Summary
Crawler()
          to collect web pages from the World Wide Web.
 
Method Summary
static java.lang.String convertLegalUrlName(java.lang.String uName)
          We stroe html file with its url name.
static void CrawlDomain(java.lang.String startURL, java.lang.String startDomain, java.lang.String cacheDir, int SEARCH_LIMIT)
          startURL: a start URL to crawl web pages startDomain: a URL to constrain the domain to crawl web pages cacheDir: directory name where to cache crawled web pages SEARCH_LIMIT: the limitation number of maximum pages to crawl Example: , <100000>
static java.lang.String CrawlPage(java.lang.String inputURL, java.lang.String cacheDir)
          crawl a page by given url.
static java.lang.String getPageStream(java.lang.String inputUrl)
          get web page content from a given url
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

vectorToSearch

public static java.util.Vector vectorToSearch

pageStreamCalled

public static int pageStreamCalled

pageAddesToSearch

public static int pageAddesToSearch

DISALLOW

public static final java.lang.String DISALLOW
See Also:
Constant Field Values
Constructor Detail

Crawler

public Crawler()
to collect web pages from the World Wide Web. In general, it starts with a URL. As it visits this URL, it identifies all the hyperlink in the page and adds them to the list of URLs to visit, recursively browsing the Web according to a set of policies.

Method Detail

CrawlPage

public static java.lang.String CrawlPage(java.lang.String inputURL,
                                         java.lang.String cacheDir)
crawl a page by given url. Firstly check the crawled pages in cache dir, get page from file if it exist. Otherwist crawl page from internet directly.


CrawlDomain

public static void CrawlDomain(java.lang.String startURL,
                               java.lang.String startDomain,
                               java.lang.String cacheDir,
                               int SEARCH_LIMIT)
startURL: a start URL to crawl web pages startDomain: a URL to constrain the domain to crawl web pages cacheDir: directory name where to cache crawled web pages SEARCH_LIMIT: the limitation number of maximum pages to crawl Example: , <100000>


convertLegalUrlName

public static java.lang.String convertLegalUrlName(java.lang.String uName)
We stroe html file with its url name. However, it is not allowed to have a file name with '/' or ':' in it. Here we replace '/' by '^', ':' by '_', '?' by '&'


getPageStream

public static java.lang.String getPageStream(java.lang.String inputUrl)
                                      throws java.net.MalformedURLException
get web page content from a given url

Throws:
java.net.MalformedURLException