**Figure 5:** Nine different weighting functions. The Gaussian function used by
Vizier is the leftmost function in the middle row.

With nearest neighbor, a prediction at any point is made from a simple
average of a small subset of nearby points. All the other points in the
data set are completely ignored. The result is a discontinuous step function.
All queries made in a region where the same subset of points is the nearest
neighbor get exactly the same prediction, and there are discontinuities at
the boundaries of these regions. An alternative which will smooth out the
function is to use a *weighted* average instead. This is called *
kernel regression*. Every point in the data set will receive a weight between
0.0 and 1.0 based on how close it is to the query. There are numerous
different weighting functions that can be used. Fig. 5
shows nine common types. The first two at the top are undesirable because
the weight goes to infinity as the distance goes to zero. This would defeat
our attempts to average out noise in the data. The ``Uniform'' weighting is
undesirable because it behaves much like k-nearest neighbor. Some subset of
nearby points (all those within a fixed distance) get a weight of 1.0 while
all the others get 0.0. This would cause the same discontinuities we saw
with k-nearest neighbor. The functions on the bottom row are reasonable
except that the weight drops completely to 0.0 after a certain distance.
That's undesirable when a query is made far from the data. We would still
like to give some weight to the nearest data points to give us some hint
about what to predict. The functions in the center and the upper right have
exactly the opposite problem. Their weight does not drop off fast enough and
the result is that the numerous points far from the query collectively
outweigh the points near the query and produce poor predictions. This leaves
only the Gaussian function at the left of the middle row. Its tail drops off
very quickly, but never becomes 0.0. We will focus only on the gaussian
weighting function, which is what Vizier uses. A weight for each point is
computed as follows:

Then a prediction is made with the weighted average:

**Figure 6:** Kernel regression with different kernel widths. The graphs are of
localness = 3, 4, and 6, respectively.

Just as the choice of k in k-nearest neighbor is important for good predictions, the choice of (known as the kernel width) is important for good predictions with kernel regression. As gets large, the weight of all points approaches 1.0, and predictions become the global average of all the points. As gets small, all but the very nearest points get a weight of 0.0, and the fitted function starts to look like nearest neighbor.

We can observe this behavior with Vizier. For this example we will use a data file with only 9 data points so it will be more obvious what is happening. These plots are all shown in fig. 6

File -> Open -> kerex.mbl Edit -> Metacode -> Localness 3: Very very local Neighbors -> Neighbors 0: No nearest neighbors Model -> Graph -> Graph

The kernel width in Vizier is set by the ``localness'' parameter of the metacode editor. A localness value of 3 means the kernel width is 1/16 of the range of the x axis. Especially near the bottom of the graph, the function flattens out at the values of the actual data points rather than interpolating smoothly between them. This is reminiscent of the nearest neighbor plots. We can improve the model by increasing the kernel width.

Edit -> Metacode -> Localness 4: Very local Model -> Graph -> Graph

This kernel width is 1/8 of the range of the x axis. The general relationship between the kernel width and the localness number is:

The new graph is much better. It has smoothed out the noise and we might even trust its estimate of the function's gradient. Increasing the kernel width further will cause it to over-smooth the data.

Edit -> Metacode -> Localness 6: Somewhat local Model -> Graph -> Graph

This kernel width is half the range of the x axis and we can see that it has almost smeared all the data together into one flat line which is a poor fit to the data. This is a good time to try the other localness values to see how they effect the fitted function. If the localness is set lower than 3, you will notice the function going to a value of 375 when there are no nearby data points even though the nearest data points do not have values around 375. This is an artifact of the Bayesian priors which are described later in this tutorial. You may also notice that the label written next to localness 9 is ``global.'' 9 is a special case that does not follow eq. 4, but instead gives all the data a weight of 1.0.

**Figure 7:** Kernel regression on three different data sets, with localness values
of 3, 3, and 4, respectively.

Now we return to the three data sets we used earlier and see how kernel regression does with them. They are shown in fig. 7.

File -> Open -> j1.mbl Edit -> Metacode -> Localness 3: Very very local Model -> Graph -> Graph File -> Open -> k1.mbl Model -> Graph -> Graph File -> Open -> a1.mbl Edit -> Metacode -> Localness 4: Very local Model -> Graph -> Graph

These three graphs look much nicer than the ones we got with linear
regression or with k-nearest neighbor. They have smooth functions and do a
decent job of fitting the data. The bumps in the graphs for *j1.mbl* and
*a1.mbl* are a little worrisome. In general, the way to smooth out bumps
is by increasing the kernel width. Unfortunately, doing that in these cases
will cause over-smoothing of the data and a poorer fit. The kernel widths
for these data sets were hand chosen and they demonstrate how important it is
to choose the right kernel width for each data set. Finally, it should be
noted that k-nearest neighbor can be combined with kernel regression. In
that case, the k nearest data points automatically receive a weight of 1.0,
while all the others are weighted by the weighting function.

Fri Feb 7 18:00:08 EST 1997