Introduction to ML Concepts, Regression, and Classification

Introduction

Welcome to the programming component of this assignment!

This assignment includes an autograder for you to grade your answers on your machine. This can be run with the command:

python3.6 autograder.py

The code for this assignment consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download and unzip all the code, data, and supporting files from hw2_programming.zip.

Files you will edit

`regression.py`	Your code to implement regression tasks.
`classification.py`	Your code to impelement handwritten digit classification tasks.
`additional_code.py`	Add additional code that you will need to write to answer various questions will go here. This code should be runnable by calling `python3.6 additional_code.py`, but there are no requirements on the format and it will not be executed by the autograder.

Files you might want to look at

`util.py`	Convenience methods to generate various plots that will be needed in this assignment.
`test_cases/Q/.py`	These are the unit tests that the autograder runs. Ideally, you would be writing these unit tests yourself, but we are saving you a bit of time and allowing the autograder to check these things. You should definitely be looking at these to see what is and is not being tested. The autograder on Gradescope may run a different version of these unit tests.

Files you can safely ignore

`autograder.py`	Autograder infrastructure code.

Files to Edit and Submit: You will fill in portions of regression.py, classification.py, and additional_code.py during the assignment. You should submit these files containing your code and comments to the Programming component on Gradescope. Please do not change the other files in this distribution or submit any of our original files other than these files. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.

Report: Many of the sections in this programming assignment will contain questions that are not autograded. You will place the requested results in the appropriate locations within the PDF of the Written component of this assignment.

Evaluation: Your assignment will be assessed based on your code, the output of the autograder, and the required contents of in the Written component.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, recitation, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these assignments to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

PyTorch Installation

Install PyTorch and look through some of the tutorials. Specifically, take a look at the What is PyTorch, Neural Networks, and Training a Classifier sections within the 60 Minute Blitz tutorial.

Install

You should be able to install PyTorch on the unix.andrew.cmu.edu machines by adding the --user option to pip3 install:

pip3 install --user torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

You will not need the CUDA (GPU) option. The autograder on Gradescope will not be using it.

Verify Installation

You can verify that your installation is successful by running the following Python3.6 code:

import torch
x = torch.rand(5, 3)
print(x)

Make sure you ask for help if you are having issues installing PyTorch, as you won't be able to complete this assignment without it.

Question 1 Regression: Load and split data

In regression.py, implement the load_and_split_data function to load our regression data and split it into training and validation sets. See function docstring for details.

You are required to use the Python NumPy library to implement code in this and many other questions in this course. If you are not familiar with NumPy, please take some time to walk through an online tutorial, such as https://docs.scipy.org/doc/numpy/user/quickstart.html.

You may run the following command to run a quick unit test on your Q1 implementaion:

python3.6 autograder.py -q Q1

We encourage you to write your own code to test out your implementation as you work through the assignment. For example, you may want to use some of the functions in util.py to plot the data that you just loaded.

Question for the write-up: Why didn't we give you a test set?

Question 2 Regression: Closed Form Solution

In regression.py, implement the setup_design_matrix and linear_closed_form_fit functions to calculate the closed form solution to this linear least squares problem.

We place the input training vector, $\boldsymbol{x} \in \mathbb{R}^{N\times 1}$, into a design matrix, $X \in \mathbb{R}^{N\times 2}$, so that we can account for a bias term in our linear model. We do this by inserting a column of ones in the first column of $X$. This also means that the weight is now a weight vector, $\boldsymbol{w} = [b, w]^T$.

Using the design matrix and the output training data, $\boldsymbol{y}$, we can then directly solve for the optimal weight vector, $\boldsymbol{w}^*$: $$\boldsymbol{w}^* = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}$$

We require that you use numpy operations, rather than for loops, to implement this closed form solution. You may use numpy.linalg.inv, but you may NOT use numpy solvers, such as numpy.linalg.solve or numpy.linalg.lstsq.

You may run the following command to run a quick unit test on your implementaion:

python3.6 autograder.py -q Q2

The autograder will also plot data and the hypothesis function line and save the plot as regression_closed_form.png in a new directory named figures. You are required to include this figure as part of the written component of this assignment.

Question for the write-up: What is the mean squared error on the training set? (Don't divide by 2.)

Question for the write-up: What is the mean squared error on the validation set? (Don't divide by 2.) You will need to write additional code to answer this.

Additional code: Any additional code that you write to answer these questions should be included in additional_code.py

Question 3 Regression: PyTorch SGD

So far, we have been using NumPy; this is where we transition to PyTorch. PyTorch is designed for building neural networks, but in this question we are going to leverage PyTorch to do stochastic gradient descent on our 1-D linear regression model.

See PyTorch section to make sure you have PyTorch installed and take some time to work through some PyTorch tutorials.

In regression.py, implement the LinearRegressionNet.__init__ and LinearRegressionNet.forward methods. Follow https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#define-a-convolutional-neural-network as a template, but instead use only one torch.nn.Linear layer with one input and one output. No need for convolution, ReLU, or pool layers. (The x.view is not necessary either.)

The code to setup the loss function, the SGD optimization algorithm, and actually run the trainning is provided for you in train_linear_regression_net. You should become familiar with this code as you will need to implement a similar version later in this assignment.

You may run the following command to run a unit test on your implementaion:

python3.6 autograder.py -q Q3

The autograder will also plot data and the hypothesis function line and save the plot as regression_sgd.png in the figures directory. You are required to include this figure as part of the written component of this assignment.

Question 4 Regression: PyTorch Neural Network

Neural networks for 1-D regression!

In regression.py, implement the RegressionNeuralNet.__init__ and RegressionNeuralNet.forward methods. Follow https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#define-a-convolutional-neural-network as a template, but instead of convolution and pool layers, use as many torch.nn.Linear and torch.nn.functional.relu layers as you like. You can also set the numer of outputs for each linear layer however you like, with the exception of the last one, which should have just one output. Make sure that the number of inputs argument to torch.nn.Linear is the same as the number of outputs in the previous layer.

The goal in this question is to design a neural network that has even better mean squared error on the training set than the closed form solution (Q2) or the linear SGD solution (Q3).

Similar to Q3, the code to setup the loss function, the SGD optimization algorithm, and actually run the trainning is provided for you in train_regression_neural_net.

You may run the following command to run a unit test on your implementaion:

python3.6 autograder.py -q Q4

The autograder will also plot data and the hypothesis function line and save the plot as regression_net.png in the figures directory. You are required to include this figure as part of the written component of this assignment.

Question for the write-up: What is the mean squared error on the training set? (Don't divide by 2.)

Question for the write-up: What is the mean squared error on the validation set? (Don't divide by 2.) You will need to write additional code to answer this.

For the write-up, it is ok if these numbers come from a different training run than the autograder.

Additional code: Any additional code that you write to answer these questions should be included in additional_code.py

Question 5 Classification: Load and split data

In classification.py, implement the load_and_split_data function to load handwritten digit data from the MNIST dataset and split it into training and validation sets. See function docstring for details.

You may run the following command to run a quick unit test on your Q5 implementaion:

python3.6 autograder.py -q Q5

Question 6 Classification: Using Linear Regression!

In classification.py, implement the following functions to formulate this classification problem as a linear regresssion problem and solve using the closed form solution:

setup_design_matrix
setup_onehot_label_matrix
linear_closed_form_fit
predict_labels_from_regression
compute_accuracy
confusion_matrix

You will have to rely on work from Q2 on the written component of this assignment for how to formulate this as a linear least squares problem and solve.

We require that you use numpy operations, rather than for loops, to implement this closed form solution. You may use numpy.linalg.pinv, but you may NOT use numpy solvers, such as numpy.linalg.solve or numpy.linalg.lstsq.

You may run the following command to run a quick unit test on your implementaion:

python3.6 autograder.py -q Q6

Question for the write-up: How many times is an eight in the training set incorrectly labelled as a nine?

Question for the write-up: What is the accuracy on the training set?

Question for the write-up: What is the accuracy on the validation set? You will need to write additional code to answer this.

Additional code: Any additional code that you write to answer these questions should be included in additional_code.py

Question 7 Classification: PyTorch Neural Network

Neural networks for digit classification!

In classification.py, implement the following methods and functions:

DigitNet.__init__
DigitNet.forward
train_neural_net
predict_labels_from_network

In this question, you will implement the following specific neural network in the DigitNet class:

$$Input_{784} \rightarrow Linear_{50} \rightarrow ReLU \rightarrow Linear_{50} \rightarrow ReLU \rightarrow Linear_{10}$$ $$\text{where the }N \text{ in } Linear_N \text{ is the number of output values for that linear function.}$$

In the classification.train_neural_net function, you'll have to provide the code to setup the PyTorch data loader, loss function, SGD optimization, and for loops to train the data. This code is a lot like the code we provided for you in Q3 and Q4. See the docstring in the code for more details on exactly which settings you may use for batch size, learning rate, number of iterations, etc.

Again, it will be helpful to follow https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#define-a-convolutional-neural-network as a template, but don't use convolution and pool layers, just use torch.nn.Linear and torch.nn.functional.relu layers. Make sure that the number of inputs argument to torch.nn.Linear is the same as the number of outputs in the previous layer.

You may run the following command to run a unit test on your implementaion:

python3.6 autograder.py -q Q7

Question for the write-up: What is the accuracy on the training set?

Question for the write-up: What is the accuracy on the validation set? You will need to write additional code to answer this.

For the write-up, it is ok if these numbers come from a different training run than the autograder.

Additional code: Any additional code that you write to answer these questions should be included in additional_code.py

Submission

Complete all questions as specified in the above instructions. Then upload regression.py, classification.py, and additional_code.py to Gradescope. Your submission should finish running within 20 minutes, after which it will time out on Gradescope.

Don't forget to include any request results in the PDF of the Written component, which is to be submitted on Gradescope as well.

You may submit to Gradescope as many times as you like. You may also run the autograder on your own machine to speed up the development process. Just note that the autograder on Gradescope will be slightly different than the local autograder. The autograder can be invoked on your own machine using the command:

python3.6 autograder.py

Note that running the autograder locally will not register your grades with us. Remember to submit your code when you want to register your grades for this assignment.

The autograder on Gradescope might take a while but don't worry: so long as you submit before the deadline, it's not late.