Figure 5: Nine different weighting functions. The Gaussian function used by Vizier is the leftmost function in the middle row.
With nearest neighbor, a prediction at any point is made from a simple average of a small subset of nearby points. All the other points in the data set are completely ignored. The result is a discontinuous step function. All queries made in a region where the same subset of points is the nearest neighbor get exactly the same prediction, and there are discontinuities at the boundaries of these regions. An alternative which will smooth out the function is to use a weighted average instead. This is called kernel regression. Every point in the data set will receive a weight between 0.0 and 1.0 based on how close it is to the query. There are numerous different weighting functions that can be used. Fig. 5 shows nine common types. The first two at the top are undesirable because the weight goes to infinity as the distance goes to zero. This would defeat our attempts to average out noise in the data. The ``Uniform'' weighting is undesirable because it behaves much like k-nearest neighbor. Some subset of nearby points (all those within a fixed distance) get a weight of 1.0 while all the others get 0.0. This would cause the same discontinuities we saw with k-nearest neighbor. The functions on the bottom row are reasonable except that the weight drops completely to 0.0 after a certain distance. That's undesirable when a query is made far from the data. We would still like to give some weight to the nearest data points to give us some hint about what to predict. The functions in the center and the upper right have exactly the opposite problem. Their weight does not drop off fast enough and the result is that the numerous points far from the query collectively outweigh the points near the query and produce poor predictions. This leaves only the Gaussian function at the left of the middle row. Its tail drops off very quickly, but never becomes 0.0. We will focus only on the gaussian weighting function, which is what Vizier uses. A weight for each point is computed as follows:
Then a prediction is made with the weighted average:
Figure 6: Kernel regression with different kernel widths. The graphs are of localness = 3, 4, and 6, respectively.
Just as the choice of k in k-nearest neighbor is important for good predictions, the choice of (known as the kernel width) is important for good predictions with kernel regression. As gets large, the weight of all points approaches 1.0, and predictions become the global average of all the points. As gets small, all but the very nearest points get a weight of 0.0, and the fitted function starts to look like nearest neighbor.
We can observe this behavior with Vizier. For this example we will use a data file with only 9 data points so it will be more obvious what is happening. These plots are all shown in fig. 6
File -> Open -> kerex.mbl Edit -> Metacode -> Localness 3: Very very local Neighbors -> Neighbors 0: No nearest neighbors Model -> Graph -> Graph
The kernel width in Vizier is set by the ``localness'' parameter of the metacode editor. A localness value of 3 means the kernel width is 1/16 of the range of the x axis. Especially near the bottom of the graph, the function flattens out at the values of the actual data points rather than interpolating smoothly between them. This is reminiscent of the nearest neighbor plots. We can improve the model by increasing the kernel width.
Edit -> Metacode -> Localness 4: Very local Model -> Graph -> Graph
This kernel width is 1/8 of the range of the x axis. The general relationship between the kernel width and the localness number is:
The new graph is much better. It has smoothed out the noise and we might even trust its estimate of the function's gradient. Increasing the kernel width further will cause it to over-smooth the data.
Edit -> Metacode -> Localness 6: Somewhat local Model -> Graph -> Graph
This kernel width is half the range of the x axis and we can see that it has almost smeared all the data together into one flat line which is a poor fit to the data. This is a good time to try the other localness values to see how they effect the fitted function. If the localness is set lower than 3, you will notice the function going to a value of 375 when there are no nearby data points even though the nearest data points do not have values around 375. This is an artifact of the Bayesian priors which are described later in this tutorial. You may also notice that the label written next to localness 9 is ``global.'' 9 is a special case that does not follow eq. 4, but instead gives all the data a weight of 1.0.
Figure 7: Kernel regression on three different data sets, with localness values of 3, 3, and 4, respectively.
Now we return to the three data sets we used earlier and see how kernel regression does with them. They are shown in fig. 7.
File -> Open -> j1.mbl Edit -> Metacode -> Localness 3: Very very local Model -> Graph -> Graph File -> Open -> k1.mbl Model -> Graph -> Graph File -> Open -> a1.mbl Edit -> Metacode -> Localness 4: Very local Model -> Graph -> Graph
These three graphs look much nicer than the ones we got with linear regression or with k-nearest neighbor. They have smooth functions and do a decent job of fitting the data. The bumps in the graphs for j1.mbl and a1.mbl are a little worrisome. In general, the way to smooth out bumps is by increasing the kernel width. Unfortunately, doing that in these cases will cause over-smoothing of the data and a poorer fit. The kernel widths for these data sets were hand chosen and they demonstrate how important it is to choose the right kernel width for each data set. Finally, it should be noted that k-nearest neighbor can be combined with kernel regression. In that case, the k nearest data points automatically receive a weight of 1.0, while all the others are weighted by the weighting function.