Next: Nearest neighbor Up: Simple Solutions Previous: Simple Solutions

## Linear regression

Figure 2: Global linear regression on some one dimensional data sets

Linear regression is an old statistical method of determining relationships between variables. It finds the linear function (in the 1-d case, a straight line) which minimizes the sum of squared error between the function and all the data points.

We can use Vizier to see what happens when linear regression is applied to some sample data sets. All of the data sets in this tutorial can be found in the data subdirectory of the Vizier installation.

```File -> Open -> j1.mbl
Edit -> Metacode -> Regression   L: Linear
Localness    9: Global
Neighbors    0: No Nearest Neighbors
Model -> Graph -> Graph
```

In the previous operations, you did 3 separate things. First you loaded the data file named j1.mbl. Second, you specified that you wanted to do linear regression. Don't worry about the meaning of the various fields in the Metacode editor yet. They'll be described in more detail later. Finally, you drew a graph showing the data in the file and a fitted line from the linear regression. Again, don't worry about all the options available for graphing. They'll be described in more detail later. If you are not using Vizier to draw the graph, you can see what it looks like in fig. 2a. From the graph, it is evident that linear regression has captured a significant trend in the data, but has not accurately modeled the relationship.

Next, we look at another data set and linear regression applied to it.

```File -> Open -> k1.mbl
Model -> Graph -> Graph
```

The resulting graph is shown in fig. 2b. Here there is no noise in the data and linear regression has a better fit. Unfortunately, it is also obvious that some of the relationship has been glossed over.

We'll look at one more data set and linear regression applied to it.

```File -> Open -> a1.mbl
Model -> Graph -> Graph
```

The resulting graph is shown in fig. 2c. In this case linear regression appears to be a reasonable choice. There is significant noise in the data, but the underlying relationship seems mostly linear.

The models for the first two graphs suffer from undesirable bias. Bias refers to the underlying assumption made about the form of the relationship made by a particular function approximator. In these examples, the assumption is that the relationship is a straight line and any data not matching that assumption is poorly represented.

Next: Nearest neighbor Up: Simple Solutions Previous: Simple Solutions

Jeff Schneider
Fri Feb 7 18:00:08 EST 1997