15-859B: Assignment 1: Regression

Due: Tue, 3 Oct, 5pm at my office (Newell-Simon 4205), or in class earlier that day.

Summary

Part 1: Find or compute some points on a fairly smooth function, guess the form of the function, use least squares techniques to solve for the coefficients of the function, check the fit, and repeat guessing and testing until you get a satisfactory fit. Fitting a function in this way is called regression. Write the fitting program using QR factorization in your own low-level Matlab code.

Part 2: Fit a power series to samples of a cosine curve. Do the fitting two ways and compare. First, fit using the QR code from part 1; second, using the normal equations (this can be implemented using high-level Matlab calls).

Part 1: QR Factorization

Pick a data set. This could be, for example:

The CPU time or number of flops required by some algorithm, e.g. Gaussian elimination, sort, or even QR factorization. The program you time could be written by you or not.
Data points that purportedly demonstrate Moore's Law or exponential population growth. To get data, try searching for "moore's law" at google.com, for example.
The number of primes less than x. (Use some large x). See a number theory book to get the predicted functional form, or Knuth, The Art of Computer Programming, volume 2, under "prime numbers, distribution".
Other interesting data of your choice. The data can be noisy, but should appear, by eye, to be reasonably approximated by some smooth function. Functions with step or slope discontinuities are probably poor candidates, as are fractal-like functions like the Dow Jones average!

Use enough data points. Probably at least 10 and possibly many more if the data is noisy.

Guess the functional form.

If your data is the time for Gaussian elimination, you'd guess that y=c*x³, where x is the number of rows in the matrix and y is CPU time.
If it's the time for quicksort, you'd guess y=c*x*log(x), where x is the array length.
If it's Moore's Law, you expect the number of transistors per chip to go as c1*exp(c2*time). This is not linear in c2, so we can't apply the linear least squares technique to it directly, but if you let y=log(ntransistors) and x=time, then you can.

Implement a linear least squares solver using QR factorization and Householder transformations, as described in chapter 3 of Heath's book. Implement factorization and solving using low-level Matlab code. By low-level, I mean that you shouldn't use Matlab's matrix multiply operator on a 2-D matrix, or use any of Matlab's operators or functions for solving a system or inverting a matrix, such as the "\" operator. You can: store your data in a matrix, extract vectors from a matrix, write a vector into a submatrix, compute inner products of vectors, and use Matlab's vector and scalar operators. Within these constraints, make your code concise. Comment it for clarity. Your QR factorization routine should work on any m by n matrix A, for m>=n. I suggest you test this code on points on a line to check that it's working.

See the course software web page for links to Matlab info.

Solve for the coefficients of the functional form you chose earlier, on your test data. Compute the RMS error (related to the 2-norm of the residual: RMS error = ||b-Ax||₂/sqrt(m)) and plot and eyeball the results to decide if you've got a good fit. When computing the error you can use matrix multiply.

If you don't have a good fit, consider changing your functional form and repeating. Perhaps the function is not cubic but quadratic, or perhaps it's got significant lower-order terms, e.g. c3*x³+c2*x²+c1*x+c0. The number of unknown coefficients in your functional form is n. Beware of adding too many terms, however.

Pick the results you like best, because of good fit and/or simple functional form, and graph them (both data points and fit curve). Make sure the graphs are well labeled. Write up a discussion of at least two paragraphs but not more than one page about your approach and your conclusions. Among other things, this discussion should state the coefficient values and RMS error of your best fit, and whether you got the result expected. Also turn in a listing of your Matlab code.

Part 2: Compare QR and Normal Equations

Run your code from part 1 on m data points from the function y=cos(x). Fit with a power series with n terms, all even powers of x.

Now implement the normal equations method. You're allowed to use Matlab's matrix multiply and solution operators and functions this time. But don't solve the overdetermined system Ax=b directly, or Matlab will be smart and use QR factorization. Instead compute the matrices A^TA and A^Tb and solve the normal equations, A^TAx=A^Tb.

Compare the results, for various n, to determine when the normal equations method becomes inaccurate. How does adding random noise, as discussed in computer problem 3.6 of Heath (page 111), affect the results? How do the coefficients of your truncated power series compare with the infinite power series?

Turn in plots of at least two cases, the first where both methods fit well, and another where the fit of the normal equations method is poor, on each showing the data points, the curves, and the formulas from the two fitting methods. Turn in code listing (for the new code) and discussion.

Optional: compare your results to those found in the book Handbook of mathematical functions with formulas, graphs, and mathematical tables, edited by M. Abramowitz and I. A. Stegun, 1964, (in CMU's E&S library). Look up "circular functions, polynomial approximations" in the index.

15-859B, Introduction to Scientific Computing
Paul Heckbert
22 Sept. 2000.
24 Sept. modified to clarify that vector inner product and assigning vector to submatrix are OK.