Next: ATTac-2001 Up: Hotel Price Prediction Previous: Hotel Price Prediction

## The Learning Algorithm

Having described how we set up the learning problem, we are now ready to describe the learning algorithm that we used. Briefly, we solved this learning problem by first reducing to a multiclass, multi-label classification problem (or alternatively a multiple logistic regression problem), and then applying boosting techniques developed by Schapire and Singer SchapireSi99,SchapireSi00 combined with a modification of boosting algorithms for logistic regression proposed by Collins, Schapire and Singer CollinsScSi02. The result is a new machine-learning algorithm for solving conditional density estimation problems, described in detail in the remainder of this section. Table 5 shows pseudo-code for the entire algorithm.

Abstractly, we are given pairs where each belongs to a space and each is in . In our case, the 's are the auction-specific feature vectors described above; for some , . Each target quantity is the difference between closing price and current price. Given a new , our goal is to estimate the conditional distribution of given .

We proceed with the working assumption that all training and test examples are i.i.d. (i.e, drawn independently from identical distributions). Although this assumption is false in our case (since the agents, including ours, are changing over time), it seems like a reasonable approximation that greatly reduces the difficulty of the learning task.

Our first step is to reduce the estimation problem to a classification problem by breaking the range of the 's into bins:

for some breakpoints where for our problem, we chose .7 The endpoints and are chosen to be the smallest and largest values observed during training. We choose the remaining breakpoints so that roughly an equal number of training labels fall into each bin. (More technically, breakpoints are chosen so that the entropy of the distribution of bin frequencies is maximized.)

For each of the breakpoints (), our learning algorithm attempts to estimate the probability that a new (given ) will be at least . Given such estimates for each , we can then estimate the probability that is in the bin by (and we can then use a constant density within each bin). We thus have reduced the problem to one of estimating multiple conditional Bernoulli variables corresponding to the event , and for this, we use a logistic regression algorithm based on boosting techniques as described by CollinsScSi02 CollinsScSi02.

Our learning algorithm constructs a real-valued function with the interpretation that

 (9)

is our estimate of the probability that , given . The negative log likelihood of the conditional Bernoulli variable corresponding to being above or below is then

where
 (10)

We attempt to minimize this quantity for all training examples and all breakpoints . Specifically, we try to find a function minimizing

We use a boosting-like algorithm described by CollinsScSi02 CollinsScSi02 for minimizing objective functions of exactly this form. Specifically, we build the function in rounds. On each round , we add a new base function . Let

be the accumulating sum. Following Collins, Schapire and Singer, to construct each , we first let

be a set of weights on example-breakpoint pairs. We then choose to minimize
 (11)

over some space of simple'' base functions . For this work, we considered all decision stumps'' of the form

where is one of the features described above, and , , and are all real numbers. In other words, such an simply compares one feature to a threshold and returns a vector of numbers that depends only on whether is unknown (), or above or below . SchapireSi00 SchapireSi00 show how to efficiently search for the best such over all possible choices of , , , and . (We also employed their technique for smoothing'' , and .)

When computed by this sort of iterative procedure, CollinsScSi02 CollinsScSi02 prove the asymptotic convergence of to the minimum of the objective function in Equation (11) over all linear combinations of the base functions. For this problem, we fixed the number of rounds to . Let be the final predictor.

As noted above, given a new feature vector , we compute as in Equation (9) to be our estimate for the probability that , and we let and . For this to make sense, we need , or equivalently, , a condition that may not hold for the learned function . To force this condition, we replace by a reasonable (albeit heuristic) approximation that is nonincreasing in , namely, where (respectively, ) is the pointwise minimum (respectively, maximum) of all nonincreasing functions that everywhere upper bound (respectively, lower bound ).

With this modified function , we can compute modified probabilities . To sample a single point according to the estimated distribution on associated with , we choose bin with probability , and then select a point from this bin uniformly at random. Expected value according to this distribution is easily computed as

Although we present results using this algorithm in the trading agent context, we did not test its performance on more general learning problems, nor did we compare to other methods for conditional density estimation, such as those studied by Stone94 Stone94. This clearly should be an area for future research.

Next: ATTac-2001 Up: Hotel Price Prediction Previous: Hotel Price Prediction
Peter Stone 2003-09-24