websphinx
Class Page

java.lang.Object
  |
  +--websphinx.Region
        |
        +--websphinx.Page

public class Page
extends Region

A Web page. Although a Page can represent any MIME type, it mainly supports HTML pages, which are automatically parsed. The parsing produces a list of tags, a list of words, an HTML parse tree, and a list of links.


Field Summary
 
Fields inherited from class websphinx.Region
end, names, source, start, TRUE
 
Constructor Summary
Page(byte[] content)
          Make a Page from a byte array of content.
Page(Link link)
          Make a Page by downloading and parsing a Link.
Page(Link link, DownloadParameters dp)
          Make a Page by downloading a Link.
Page(Link link, DownloadParameters dp, HTMLParser parser)
          Make a Page by downloading a Link.
Page(java.lang.String content)
          Make a Page from a string of content.
Page(java.net.URL url, java.lang.String html)
          Make a Page from a URL and a string of HTML.
Page(java.net.URL url, java.lang.String html, HTMLParser parser)
          Make a Page from a URL and a string of HTML.
 
Method Summary
 void discardContent()
          Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl).
 void download(DownloadParameters dp, HTMLParser parser)
           
 java.net.URL getBase()
          Get the base URL, relative to which the page's links were interpreted.
 java.lang.String getContent()
          Get the content of the page as a String.
 byte[] getContentBytes()
          Get the content of the page as an array of bytes.
 java.lang.String getContentEncoding()
          Get content encoding of page.
 java.lang.String getContentType()
          Get MIME type of page.
 int getDepth()
          Get depth of page in crawl.
 Element[] getElements()
          Get the HTML elements in the page.
 long getExpiration()
          Get expiration date of page.
 long getLastModified()
          Get last-modified date of page.
 Link[] getLinks()
          Get the links found in the page.
 Link getOrigin()
          Get the Link that points to this page.
 int getResponseCode()
          Get response code returned by the Web server.
 java.lang.String getResponseMessage()
          Get response message returned by the Web server.
 Element getRootElement()
          Get the root HTML element of the page.
 Tag[] getTags()
          Get the tag sequence of the page.
 java.lang.String getTitle()
          Get the title of the page.
 Region[] getTokens()
          Get the token sequence of the page.
 java.net.URL getURL()
          Get the URL.
 Text[] getWords()
          Get the words in the page.
 boolean hasContent()
          Test if page content is available.
 boolean isHTML()
          Test whether page is HTML.
 boolean isImage()
          Test whether page is a GIF or JPEG image.
 boolean isParsed()
          Test whether page has been parsed.
 void keepContent()
          Lock the page's content (to prevent it from being discarded).
static void main(java.lang.String[] args)
           
 void parse(HTMLParser parser)
          Parse the page.
 void setContentEncoding(java.lang.String encoding)
          Set content encoding of page.
 void setContentType(java.lang.String type)
          Set MIME type of page.
 void setExpiration(long expire)
          Set expiration date of page.
 void setLastModified(long last)
          Set last-modified date of page.
 java.lang.String substringCanonicalTags(int start, int end)
          Get canonicalized HTML tags found in a region.
 java.lang.String substringContent(int start, int end)
          Get raw content found in a region.
 java.lang.String substringHTML(int start, int end)
          Get HTML found in a region.
 java.lang.String substringTags(int start, int end)
          Get HTML tags found in a region.
 java.lang.String substringText(int start, int end)
          Get tagless text found in a region.
 java.lang.String toDescription()
          Generate a human-readable description of the page.
 java.lang.String toString()
          Get page containing the region.
 java.lang.String toURL()
          Convert the link's URL to a String
 
Methods inherited from class websphinx.Region
enumerateObjectLabels, findEnd, findStart, getEnd, getField, getFields, getLabel, getLabel, getLength, getNumericLabel, getObjectLabel, getObjectLabels, getSource, getStart, hasAllLabels, hasAllLabels, hasAnyLabels, hasAnyLabels, hasLabel, removeLabel, setField, setFields, setLabel, setLabel, setObjectLabel, span, toHTML, toTags, toText
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Page

public Page(Link link)
     throws java.io.IOException
Make a Page by downloading and parsing a Link.

Parameters:
link - Link to download

Page

public Page(Link link,
            DownloadParameters dp)
     throws java.io.IOException
Make a Page by downloading a Link.

Parameters:
link - Link to download
dp - Download parameters to use

Page

public Page(Link link,
            DownloadParameters dp,
            HTMLParser parser)
     throws java.io.IOException
Make a Page by downloading a Link.

Parameters:
link - Link to download
parser - HTML parser to use

Page

public Page(java.net.URL url,
            java.lang.String html)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
url - URL to use as a base for relative links on the page
html - the HTML content of the page

Page

public Page(java.net.URL url,
            java.lang.String html,
            HTMLParser parser)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
url - URL to use as a base for relative links on the page
html - the HTML content of the page
parser - HTML parser to use

Page

public Page(java.lang.String content)
Make a Page from a string of content. The content is not parsed. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
content - HTML content of the page

Page

public Page(byte[] content)
Make a Page from a byte array of content. The content is not parsed. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.

Parameters:
content - byte content of the page
Method Detail

download

public void download(DownloadParameters dp,
                     HTMLParser parser)
              throws java.io.IOException

parse

public void parse(HTMLParser parser)
Parse the page. Assumes the page has already been downloaded.

Parameters:
parser - HTML parser to use
Throws:
java.lang.RuntimeException - if an error occurs in downloading the page

isParsed

public boolean isParsed()
Test whether page has been parsed. Pages are parsed during download only if its MIME type is HTML or unspecified.

Returns:
true if page was parsed, false if not

isHTML

public boolean isHTML()
Test whether page is HTML.

Returns:
true if page is HTML.

isImage

public boolean isImage()
Test whether page is a GIF or JPEG image.

Returns:
true if page is a GIF or JPEG image, false if not

keepContent

public void keepContent()
Lock the page's content (to prevent it from being discarded). This method increments a lock counter, representing all the callers interested in preserving the content. The lock counter is set to 1 when the page is initially downloaded.


discardContent

public void discardContent()
Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl). This method decrements a lock counter. If the counter falls to 0 (meaning no callers are interested in the content), the content is released. At least the following fields are discarded: content, tokens, tags, words, elements, and root. After the content has been discarded, calling getContent() (or getTokens(), getTags(), etc.) will force the page to be downloaded again. Hopefully the download will come from the cache, however.

Links are not considered part of the content, and are not subject to discarding by this method. Also, if the page was created from a string (rather than by downloading), its content is not subject to discarding (since there would be no way to recover it).


hasContent

public final boolean hasContent()
Test if page content is available.

Returns:
true if content is downloaded and available, false if content has not been downloaded or has been discarded.

getDepth

public int getDepth()
Get depth of page in crawl.

Returns:
depth of page from root (depth of page is same as depth of its originating link)

getOrigin

public Link getOrigin()
Get the Link that points to this page.

Returns:
the Link object that was used to download this page.

getBase

public java.net.URL getBase()
Get the base URL, relative to which the page's links were interpreted. The base URL defaults to the URL of the Link that was used to download the page. If any redirects occur while downloading the page, the final location becomes the new base URL. Lastly, if a element is found in the page, that becomes the new base URL.

Returns:
the page's base URL.

getURL

public java.net.URL getURL()
Get the URL.

Returns:
the URL of the link that was used to download this page

getTitle

public java.lang.String getTitle()
Get the title of the page.

Returns:
the page's title, or null if the page hasn't been parsed.

getContent

public java.lang.String getContent()
Get the content of the page as a String. May not work properly for binary data like images; use getContentBytes instead.

Returns:
the String content of the page.

getContentBytes

public byte[] getContentBytes()
Get the content of the page as an array of bytes.

Returns:
the content of the page in binary form.

getTokens

public Region[] getTokens()
Get the token sequence of the page. Tokens are tags and whitespace-delimited text.

Returns:
token regions in the page, or null if the page hasn't been downloaded or parsed.

getTags

public Tag[] getTags()
Get the tag sequence of the page.

Returns:
tags in the page, or null if the page hasn't been downloaded or parsed.

getWords

public Text[] getWords()
Get the words in the page. Words are whitespace- and tag-delimited text.

Returns:
words in the page, or null if the page hasn't been downloaded or parsed.

getElements

public Element[] getElements()
Get the HTML elements in the page. All elements in the page are included in the list, in the order they would appear in an inorder traversal of the HTML parse tree.

Returns:
HTML elements in the page ordered by inorder, or null if the page hasn't been downloaded or parsed.

getRootElement

public Element getRootElement()
Get the root HTML element of the page.

Overrides:
getRootElement in class Region
Returns:
first top-level HTML element in the page, or null if the page hasn't been downloaded or parsed.

getLinks

public Link[] getLinks()
Get the links found in the page.

Returns:
links in the page, or null if the page hasn't been downloaded or parsed.

toURL

public java.lang.String toURL()
Convert the link's URL to a String

Returns:
the URL represented as a string

toDescription

public java.lang.String toDescription()
Generate a human-readable description of the page.

Returns:
a description of the link, in the form "title [url]".

toString

public java.lang.String toString()
Get page containing the region.

Overrides:
toString in class Region
Returns:
page containing the region

getLastModified

public long getLastModified()
Get last-modified date of page.

Returns:
the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT

setLastModified

public void setLastModified(long last)
Set last-modified date of page.

Parameters:
last - the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT

getExpiration

public long getExpiration()
Get expiration date of page.

Returns:
the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.

setExpiration

public void setExpiration(long expire)
Set expiration date of page.

Parameters:
expire - the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.

getContentType

public java.lang.String getContentType()
Get MIME type of page.

Returns:
the MIME type of page, such as "text/html", or null if not known.

setContentType

public void setContentType(java.lang.String type)
Set MIME type of page.

Parameters:
type - the MIME type of page, such as "text/html", or null if not known.

getContentEncoding

public java.lang.String getContentEncoding()
Get content encoding of page.

Returns:
the encoding type of page, such as "base-64", or null if not known.

setContentEncoding

public void setContentEncoding(java.lang.String encoding)
Set content encoding of page.

Parameters:
encoding - the encoding type of page, such as "base-64", or null if not known.

getResponseCode

public int getResponseCode()
Get response code returned by the Web server. For list of possible values, see java.net.HttpURLConnection.

Returns:
response code, such as 200 (for OK) or 404 (not found). Code is -1 if unknown.
See Also:
HttpURLConnection

getResponseMessage

public java.lang.String getResponseMessage()
Get response message returned by the Web server.

Returns:
response message, such as "OK" or "Not Found". The response message is null if the page failed to be fetched or not known.

substringContent

public java.lang.String substringContent(int start,
                                         int end)
Get raw content found in a region.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
raw HTML contained in the region

substringHTML

public java.lang.String substringHTML(int start,
                                      int end)
Get HTML found in a region.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
representation of region as HTML

substringText

public java.lang.String substringText(int start,
                                      int end)
Get tagless text found in a region. Runs of whitespace and tags are reduced to a single space character.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
tagless text contained in the region

substringTags

public java.lang.String substringTags(int start,
                                      int end)
Get HTML tags found in a region. Whitespace and text among the tags are deleted.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
tags contained in the region

substringCanonicalTags

public java.lang.String substringCanonicalTags(int start,
                                               int end)
Get canonicalized HTML tags found in a region. A canonicalized tag looks like the following:
 <tagname#index attr=value attr=value attr=value ...>
 
 where tagname and attr are all lowercase, index is the tag's
 index in the page's tokens array.  Attributes are sorted in
 increasing order by attribute name. Attributes without values
 omit the entire "=value" portion.  Values are delimited by a 
 space.  All occurences of <, >, space, and % characters 
 in a value are URL-encoded (e.g., space is converted to %20).  
 Thus the only occurences of these characters in the canonical 
 tag are the tag delimiters.

 

For example, raw HTML that looks like:

 <IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>
 
would be canonicalized to:
 <img ismap src=http://foo.com/map%3C%3E.gif></img>
 

Comment and declaration tags (whose tag name is !) are omitted from the canonicalization.

Parameters:
start - starting offset of region
end - ending offset of region
Returns:
canonicalized tags contained in the region

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception