Recommender Systems,
K-means, and GMM

Introduction

Welcome to the programming component of this assignment!

This assignment includes an autograder for you to grade your answers on your machine. This can be run with the command:

python3.6 autograder.py

The code for this assignment consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download and unzip all the code, data, and supporting files from hw10_programming.zip.

Files you will edit

`recommender.py`	Your code to implement matrix factorization for recommender systems.
`kmeans.py`	Your code to implement K-means clustering algorithm.
`gmm.py`	Your code to implement clustering with Gaussian mixture models.
`additional_code.py`	Add additional code that you will need to write to answer various questions will go here. This code should be runnable by calling `python3.6 additional_code.py`, but there are no requirements on the format and it will not be executed by the autograder.

Files you might want to look at

`util.py`	Convenience methods to generate various plots that will be needed in this assignment.
`test_cases/Q/.py`	These are the unit tests that the autograder runs. Ideally, you would be writing these unit tests yourself, but we are saving you a bit of time and allowing the autograder to check these things. You should definitely be looking at these to see what is and is not being tested. These test cases also generate plots related to the task; the plots are saved in a directory named `figures`. The autograder on Gradescope may run a different version of these unit tests.

Files you can safely ignore

`autograder.py`	Autograder infrastructure code.

Files to Edit and Submit: You will fill in portions of kmeans.py (and additional_code.py, if necessary) during the assignment. You should submit this file containing your code and comments to the Programming component on Gradescope. Please do not change the other files in this distribution or submit any of our original files other than these files. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder.

Report: Many of the sections in this programming assignment will contain questions that are not autograded. You will place the requested results in the appropriate locations within the PDF of the Written component of this assignment.

Evaluation: Your assignment will be assessed based on your code, the output of the autograder, and the required contents of in the Written component.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, recitation, and Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these assignments to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

Question 1: Recommender Systems

In this question, you will train a recommender system to predict book ratings! The data used to train your system is from actual book ratings pulled from goodreads.com via UCSD Book Graph:

Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18.
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19.

You'll be using the young adult book dataset with 500 books, 1,002 users, and 10,000 ratings. The book information, including the book index, may be found in data/books.csv, and the ratings for user/book pairs are in data/ratings_train.csv and data/ratings_val.csv.

In recommender.py, implement the matrix_factorization_alt_min function to perform matrix factorization of ratings data using alternating minimization. See function docstring for details.

We'll be considering the following objective function:

\[J(\boldsymbol{U}, \boldsymbol{V}) = \frac12 \sum_{i,j\in\mathcal{S}} (R_{i,j} - \boldsymbol{u}_i^T \boldsymbol{v}_j)^2\]

You must implement stochastic gradient descent with alternating minimization. Specifically, loop through ratings, one at a time and in order, and for each triple \((i, j, r)\), update the two associated vectors in the rows of \(\boldsymbol{U}\) and \(\boldsymbol{V}\):

\[\boldsymbol{u}_i^{(t+1)} \leftarrow \boldsymbol{u}_i^{(t)} - \alpha \nabla_{\boldsymbol{u}_i} J(\boldsymbol{U}^{(t)}, \boldsymbol{V}^{(t)})\]

\[\boldsymbol{v}_j^{(t+1)} \leftarrow \boldsymbol{v}_j^{(t)} - \alpha \nabla_{\boldsymbol{v}_j} J(\boldsymbol{U}^{(t+1)}, \boldsymbol{V}^{(t)})\]

In order to be consistent with the autograder, the initial values in both of your matrices should be all ones.

You may run the following command to run some unit tests on your Q1 implementation:

python3.6 autograder.py -q Q1

We encourage you to write your own code to test out your implementation as you work through the assignment.

Question for the write-up: After running matrix_factorization_alt_min with K=20, alpha=0.001, and num_epoch=200, which of the first 10 users (user indices 0-9) do you predict would rate The Lightning Thief (book index 6) the highest? Which would rate it the lowest? What are these respective ratings?

Feel free to have fun with this dataset and your recommender code by rating a few books yourself and see what your resulting system recommends!

Question 2: K-means

In kmeans.py, implement the kmeans function to perform K-means clustering algorithm on the MNIST dataset. See function docstring for details.

As you'll see in the code, we initialize the cluster centers to the first K points in the dataset. While this isn't an ideal initialization, it helps take remove any randomness in the autograder.

You may run the following command to run some unit tests on your Q2 implementation:

python3.6 autograder.py -q Q2

The autograder will also render your 784 dimensional K-means centers as images. It will save these plots as as png files in a new directory named figures. You are required to include these some of these figures as part of the written component of this assignment. See the written component for details about which images to include.

Question 3: Gaussian Mixture Models

In this question, we'll be testing our GMM skills on an MNIST dataset with just the zero and one digits. Of course, since this in an unsupervised clustering task, we don't have any of the class labels for this dataset. To make our probabilistic clusting easier to visualize, we'll preprocess the MNIST dataset with PCA to compress it down to a 2-dimensional feature space.

PCA

In gmm.py, implement the pca function to use PCA to project data onto the first K principal components. See function docstring for details.

You may use either numpy.linalg.eig or numpy.linalg.svd to compute your eigenvectors. We recommend numpy.linalg.svd as it returns sorted arrays. You may NOT use high-level PCA solvers, for instance sklearn.decomposition.PCA.

NOTE: Eigenvectors are not unique. They can be scaled and still be correct eigenvectors. To make the autograder happy, you must pass your eigenvector matrix through gmm.consistent_scale_eigenvectors before using them to transform your data.

You may run the following command to run some unit tests on your Q3 PCA implementation:

python3.6 autograder.py -t test_cases/Q3/test1.py

python3.6 autograder.py -t test_cases/Q3/test2.py

EM for GMM

We've already given you the high level structure for the EM algorithm for Gaussian mixture models in gmm.learn_gmm(), which alternates between: E) updating the conditional class probabilities given the data and the parameters and M) updating the parameters given the data and the conditional class probabilities. While you shouldn't edit the core content of this learn_gmm() function, you may want to uncomment some of the printing and plotting statements to help you better understand what is happening at each iteration.

We've also provided a function to compute PDF of a multivariate Gaussian, gmm.gaussian_pdf(), and a function to compute the log likelihood of the GMM, gmm.log_likelihood(), that we are trying to maximize:

\[ \ell(\boldsymbol{\theta}; \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \cdots, \boldsymbol{x}^{(N)}) = \sum_{i=1}^N \log p\left(\boldsymbol{x}^{(i)} \mid \boldsymbol{\theta}\right) = \sum_{i=1}^N \log \sum_{k=1}^K \pi_k \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right) \]

where \(\pi_k = p\left(z_k^{(i)} = 1\right), \forall i \).

E-step

Implement the update_Z() function in gmm.py. See the function docstring for details.

For a fixed set of Gaussian mixture model parameters, \(\boldsymbol{\theta}^{(t)}\), update the probability that each point, \(\boldsymbol{x}^{(i)}\), belongs to cluster \(k\):

\[p\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \leftarrow \frac{\pi_k^{(t)} \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)}\right)}{\sum_{j=1}^K \pi_j^{(t)} \mathcal{N}\left(\boldsymbol{x}^{(i)}; \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)}\right)}, \forall i, k\]

M-step

Implement the update_parameters() function in gmm.py. See the function docstring for details.

\[ \pi_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) }{N}, \forall k \]

\[ \boldsymbol{\mu}_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \boldsymbol{x}^{(i)} }{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right)}, \forall k \]

\[ \boldsymbol{\Sigma}_k^{(t+1)} \leftarrow \frac{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right) \left(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_k^{(t+1)}\right) \left(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_k^{(t+1)}\right)^T }{\sum_{i=1}^N P\left(z_k^{(i)} = 1 \mid \boldsymbol{x}^{(i)}, \boldsymbol{\theta}^{(t)}\right)}, \forall k \]

You may run the following command to run some unit tests on your Q3 implementation:

python3.6 autograder.py -q Q3

The autograder will also save various plots of PCA and GMM results as png files in a new directory named figures. You are required to include these some of these figures as part of the written component of this assignment. See the written component for details about which plots to include.

Submission

Complete all questions as specified in the above instructions. Then upload recommender.py, kmeans.py, gmm.py and additional_code.py to Gradescope. Your submission should finish running within 20 minutes, after which it will time out on Gradescope.

Don't forget to include any request results in the PDF of the Written component, which is to be submitted on Gradescope as well.

You may submit to Gradescope as many times as you like. You may also run the autograder on your own machine to speed up the development process. Just note that the autograder on Gradescope will be slightly different than the local autograder. The autograder can be invoked on your own machine using the command:

python3.6 autograder.py

Note that running the autograder locally will not register your grades with us. Remember to submit your code when you want to register your grades for this assignment.

The autograder on Gradescope might take a while but don't worry: so long as you submit before the deadline, it's not late.

Recommender Systems,K-means, and GMM