In order to learn and utilize knowledge of the target function , it is necessary to first choose an appropriate representation for . This representation must be compatible with available learning methods, and must allow the agent to evaluate learned knowledge efficiently (i.e., with a delay negligible compared to typical page access delays on the web). Notice that one issue here is that web pages, information associated with hyperlinks, and user information goals are all predominantly text-based, whereas most machine learning methods assume a more structured data representation such as a feature vector. We have experimented with a variety of representations that re-represent the arbitrary-length text associated with pages, links, and goals as a fixed-length feature vector. This idea is common within information retrieval retrieval systems [Salton and McGill, 1983]. It offers the advantage that the information in an arbitrary amount of text is summarized in a fixed length feature vector compatible with current machine learning methods. It also carries the disadvantage that much information is lost by this re-representation.
Table: Encoding of selected information for a given Page, Link, and
Goal.
The experiments described here all use the same representation. Information about the current Page, the user's information search Goal, and a particular outgoing Link is represented by a vector of approximately 530 boolean features, each feature indicating the occurrence of a particular word within the text that originally defines these three attributes. The vector of 530 features is composed of four concatenated subvectors:
To choose the encodings for the first three fields, it was necessary to select which words would be considered. In each case, the words were selected by first gathering every distinct word that occurred over the training set, then ranking these according to their mutual information with respect to correctly classifying the training data, and finally choosing the top N words in this ranking. Mutual information is a common statistical measure (see, e.g., [Quinlan, 1993]) of the degree to which an individual feature (in this case a word) can correctly classify the observed data.
Figure 1 summarizes the encoding of information about the current Page, Link, and Goal.