How Should Pages, Links and Goals be Represented?

Next: What Learning Method Up: Learning Previous: What Should be

How Should Pages, Links and Goals be Represented?

In order to learn and utilize knowledge of the target function , it is necessary to first choose an appropriate representation for . This representation must be compatible with available learning methods, and must allow the agent to evaluate learned knowledge efficiently (i.e., with a delay negligible compared to typical page access delays on the web). Notice that one issue here is that web pages, information associated with hyperlinks, and user information goals are all predominantly text-based, whereas most machine learning methods assume a more structured data representation such as a feature vector. We have experimented with a variety of representations that re-represent the arbitrary-length text associated with pages, links, and goals as a fixed-length feature vector. This idea is common within information retrieval retrieval systems [Salton and McGill, 1983]. It offers the advantage that the information in an arbitrary amount of text is summarized in a fixed length feature vector compatible with current machine learning methods. It also carries the disadvantage that much information is lost by this re-representation.

Table: Encoding of selected information for a given Page, Link, and Goal.

The experiments described here all use the same representation. Information about the current Page, the user's information search Goal, and a particular outgoing Link is represented by a vector of approximately 530 boolean features, each feature indicating the occurrence of a particular word within the text that originally defines these three attributes. The vector of 530 features is composed of four concatenated subvectors:

Underlined words in the hyperlink. 200 boolean features are allocated to encode selected words that occur within the scope of the hypertext link (i.e., the underlined words seen by the user). These 200 features correspond to only the 200 words found to be most informative over all links in the training data (see below.)
Words in the sentence containing the hyperlink. 200 boolean features are allocated to indicate the occurrence of 200 selected words within the sentence (if any) that contains Link.
Words in the headings associated with the hyperlink. 100 boolean features are allocated to indicate selected words that occur in the headings (if any) under which Link is found. This includes words occurring in headings at any level of nesting, as long as Link is within the scope of the heading. For example, in Figure 4, any of the words in the headings Machine Learning Information Services and General ML Information Sources may be used as features to describe the link that was highlighted.
Words used to define the user goal. These features indicate words entered by the user while defining the information search goal. In our experiments, the only goals considered were searches for technical papers, for which the user could optionally enter the title, author, organization, etc. (see Figure 3). All words entered in this way throughout the training set were included (approximately 30 words, though the exact number varied with the training set used in the particular experiment). The encoding of the boolean feature in this case is assigned a 1 if and only if the word occurs in the user-specified goal and occurs in the hyperlink, sentence, or headings associated with this example.

To choose the encodings for the first three fields, it was necessary to select which words would be considered. In each case, the words were selected by first gathering every distinct word that occurred over the training set, then ranking these according to their mutual information with respect to correctly classifying the training data, and finally choosing the top N words in this ranking. Mutual information is a common statistical measure (see, e.g., [Quinlan, 1993]) of the degree to which an individual feature (in this case a word) can correctly classify the observed data.

Figure 1 summarizes the encoding of information about the current Page, Link, and Goal.

Next: What Learning Method Up: Learning Previous: What Should be

Thorsten Joachims
Thu Feb 9 16:27:34 EST 1995