next up previous contents
Next: Multivariate Learning Up: Memory Based Learning Previous: Weighting function

How to fit the local points

Figure 8: Kernel regression on three different data sets, with localness values of 4, 3, and 3, respectively.

The fourth and last thing we need to define for a memory based learner is the function it will use to fit the local data. Kernel regression was an improvement over k-nearest neighbor on the data sets we've seen so far. However, it can still look fairly bad. The three data sets in fig. 8 show this.

File -> Open -> g1.mbl
Edit -> Metacode -> Localness 4: Very local
Model -> Graph -> Graph

File -> Open -> d1.mbl
Edit -> Metacode -> Localness 3: Very very local
Model -> Graph -> Graph

File -> Open -> h1.mbl
Model -> Graph -> Graph

Again, the kernel widths were chosen manually. Notice how the first two do an extremely poor job of extrapolating away from the data. The middle graph also has a strange bump where the gap in the data is. The data in h1.mbl was generated with a three segment piecewise linear function plus noise. Unfortunately, the gradient between the first two segments is not very trustworthy.

This problem can be addressed by changing the local model being fit. K-nearest neighbors and kernel regression use a constant, or averaging, local model. Sometimes, the fitted function can be improved by using a more sophisticated local model. Next, we consider a linear local model.

At the beginning of this tutorial we saw how global linear regression looks on some sample data sets. Now, we'll briefly summarize the math behind linear regression. It is fine to skim this section if looking at equations is not your thing. Assume that all of the data was generated from the following relationship:


where the k subscript is the data point number, tex2html_wrap_inline1458 is the kth input, tex2html_wrap_inline1462 is the kth output, tex2html_wrap_inline1466 is the coefficient for the input, and tex2html_wrap_inline1468 is drawn from a Gaussian distribution with mean zero and standard deviation tex2html_wrap_inline1470 . tex2html_wrap_inline1472 may be a vector of outputs, and tex2html_wrap_inline1474 may be a vector of inputs (which would make tex2html_wrap_inline1476 a vector and tex2html_wrap_inline1478 a matrix), but we leave them scalar here to simplify. The problem is to find the value tex2html_wrap_inline1480 which makes the given data set most likely. The tex2html_wrap_inline1482 which satisfies this condition turns out to be:


In order to find the value for tex2html_wrap_inline1484 which minimizes the right hand side of eq. 6, take the derivative of the right hand side with respect to tex2html_wrap_inline1486 and set it to zero. Doing so gives the following:


Usually, it is preferable to write this equation in its matrix form:


where tex2html_wrap_inline1488 is a matrix with one column for each input attribute and one row for each data point, tex2html_wrap_inline1490 is a matrix with one column for each output attribute and one row for each data point, and tex2html_wrap_inline1492 is a matrix with the number of rows equal to the number of outputs and the number of columns equal to the number of inputs.

Eq. 5 assumes that the regression line passes through the origin. In order to remove that constraint, it is necessary to add an extra ``input attribute'' whose value is always 1 for every data point. Therefore, if a data set contains n input attributes, the tex2html_wrap_inline1496 matrix has n+1 columns.

Fig. 2 shows global linear regression on several data sets. We can improve on the fit to these by fitting the linear model locally, just as we computed weighted averages earlier. A weight for each data point is computed just as before. Rather than computing a weighted average of the outputs as done in kernel regression, the weight is used on the inputs and outputs of the the data point. In matrix form, the weights for the data points are composed into a diagonal matrix, W, where the value of the ith diagonal element is the weight on the ith point. The regression equation becomes:


Fig. 9 show a graphical comparison between the way global linear regression works and the way local linear regression works.

Figure 9: Comparison between global and local linear regression. a) Global linear regression minimizes the sum of squared distances between the data and the line. The result is a line fit to the data. b) Local linear regression fits a line at the query point. It minimizes the sum of weighted squared distances between the data and the line. The line widths reflect the relative strengths of the data points in influencing the fit at the indicated query. The final result is a nonlinear curve fit to the data.

Figure 10: Local linear regression on three different data sets, with localness values of 4, 2, and 5, respectively.

Fig. 2 shows what global linear regression does on some sample data files. Vizier can show us what locally weighted linear regression does on these same files.

File -> Open -> j1.mbl
Edit -> Metacode -> Localness   4: Very local
                    Regression  L: Linear
Model -> Graph -> Graph

File -> Open -> k1.mbl
Edit -> Metacode -> Localness   2: Ultra local
Model -> Graph -> Graph

File -> Open -> a1.mbl
Edit -> Metacode -> Localness   5: Fairly local
Model -> Graph -> Graph

These graphs are shown in fig. 10. The fits appear better than the global linear regression graphs and the local averaging in fig. 7. Linear regression is also preferable to averaging because it can be used to make estimates of gradients (there will be more about gradients later).

Figure 11: Kernel, linear, and quadratic regression on one data set. The best localness values are determined manually to be 2, 3, and 4, respectively.

After observing that fitting a local linear model can be better than a local averaging model, we might ask whether there are other useful models. An obvious candidate is local quadratic regression. This is done by adding the quadratic terms to the tex2html_wrap_inline1504 matrix. For example, the full tex2html_wrap_inline1506 matrix for a two input quadratic regression would include separate columns for the terms: tex2html_wrap_inline1508 . This increases the dimension of the resulting tex2html_wrap_inline1510 matrix accordingly, but the equation for computing it is the same. In order to see the effects of quadratic regression, we can compare it against the average and linear models.

File -> Open -> i1.mbl
Edit -> Metacode -> Localness   2: Ultra local
                    Regression  A: Average
Model -> Graph -> Graph

Edit -> Metacode -> Localness   3: Very very local
                    Regression  L: Linear
Model -> Graph -> Graph

Edit -> Metacode -> Localness   4: Very local
                    Regression  Q: Quadratic
Model -> Graph -> Graph

Fig. 11 shows these graphs. The graphs get successively smoother as the regression goes from average to linear to quadratic. Also note that the best localness value increases. One way to get a smoother graph while using averaging is to increase its localness value, but this has the negative side effect of adding bias and making the fit worse (you can test this yourself with Vizier). One of the advantages of higher order polynomial regressions is that they allow you to use a wider smoothing kernel without causing bias problems. A wider kernel is advantageous because it makes use of more data in its fit. While linear regression has the advantage of giving gradient estimates, quadratic regression has the additional advantage of giving estimates of the local Hessian, which can be useful when doing optimization. The drawback of quadratic regression is that it has more parameters which means it requires more data to fit them well and it requires more computation to do the fit.

Vizier has two other models which are part way between linear and quadratic. The ``Ellipses'' model has all the pure quadratic terms (the tex2html_wrap_inline1512 ), but none of the cross terms ( tex2html_wrap_inline1514 for tex2html_wrap_inline1516 ). The ``Circles'' model is a linear model plus a single extra term for the sum of all the pure quadratic terms ( tex2html_wrap_inline1518 ). In some cases, one of these two models is a good compromise between linear and quadratic. It doesn't make sense to use them on the one dimensional data sets we've been using so far because one dimensional inputs have only one pure quadratic term and no cross terms.

At this point we have covered the four things that describe a locally weighted learning: the distance metric, the number of nearest neighbors, the weighting function, and the local model. In Vizier the choice of these four things is specified by the metacode. Using different metacode settings, you can create a variety of different locally weighted learners.

next up previous contents
Next: Multivariate Learning Up: Memory Based Learning Previous: Weighting function

Jeff Schneider
Fri Feb 7 18:00:08 EST 1997