Description
Conceptual Questions

The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words.


[2 points] True or False: Given a data matrix X ∈ R^{n×d} where d is much smaller than n and k = rank(X), if we project our data onto a k dimensional subspace using PCA, our projection will have zero reconstruction error (in other words, we find a perfect representation of our data, with no information loss).



[2 points] True or False: Suppose that an n × n matrix X has a singular value decomposition of U SV ^{⊤}, where S is a diagonal n × n matrix. Then, the rows of V are equal to the eigenvectors of X^{⊤}X.



[2 points] True or False: choosing k to minimize the kmeans objective (see Equation (1) below) is a good way to find meaningful clusters.



[2 points] True or False: The singular value decomposition of a matrix is unique.



[2 points] True or False: The rank of a square matrix equals the number of its nonzero eigenvalues.



[2 points] True or False: Autoencoders, where the encoder and decoder functions are both neural networks with nonlinear activations, can capture more variance of the data in its encoded representation than PCA using the same number of dimensions.

What to Submit:
• Parts af: 12 sentence explanation containing your answer.

The first part of this problem (parts a, b) explores how you would apply machine learning theory and techniques to realworld problems. There are two scenarios detailing a setting, a dataset, and a specific result we hope to achieve. Your job is to describe how you would handle each of the below scenarios with the tools we’ve learned in this class. Your response should include


any preprocessing steps you would take (i.e., data acquisition and processing),



the specific machine learning pipeline you would use (i.e., algorithms and techniques learned in this class),



how your setup acknowledges the constraints and achieves the desired result.

You should also aim to leverage some of the theory we have covered in this class. Some things to consider may be: the nature of the data (i.e., How hard is it to learn? Do we need more data? Are the data sources good? ), the effectiveness of the pipeline (i.e., How strong is the model when properly trained and tuned? ), and the time needed to effectively perform the pipeline.
a. [5 points] Scenario 1: Disease Susceptibility Predictor


Setting: You are tasked by a research institute to create an algorithm that learns the factors that contribute most to acquiring a specific disease.



Dataset: A rich dataset of personal demographic information, location information, risk factors, and whether a person has the disease or not.



Result: The company wants a system that can determine how susceptible someone is to this disease when they enter in personal information. The pipeline should take limited amount of personal data from a new user and infer more detailed metrics about the person.


[5 points] Scenario 2: Social Media App Facial Recognition Technology


Setting: You are tasked with developing a machine learning pipeline that can quickly map someone’s face for the application of filters (i.e., Snapchat, Instagram).



Dataset: A set of face images compiled from the company’s employees and their families.



Result: The company wants an algorithm that can quickly identify the key features of a person’s face

to apply a filter. (Note: Do not worry about describing the actual filter application).
The second part of this problem (parts c, d) focuses on exploring possible shortcomings of these models, and what realworld implications might follow from ignoring these issues.
c. [5 points] Recall in Homework 2 we trained models to predict crime rates using various features. It is important to note that datasets describing crime have various shortcomings in describing the entire landscape of illegal behavior in a city, and that these shortcomings often fall disproportionately on minority communities. Some of these shortcomings include that crimes are reported at different rates in different neighborhoods, that police respond differently to the same crime reported or observed in different neighborhoods, and that police spend more time patrolling in some neighborhoods than others. What realworld implications might follow from ignoring these issues?

[5 points] Pick one of either Scenario 1 or Scenario 2 (in parts a and b). Briefly describe (1) some potential shortcomings of your training process that may result in your algorithm having different accuracy on different populations, and (2) how you may modify your procedure to address these shortcomings.
What to Submit:

For parts (a) and (b): One short paragraph (47) sentences for each of the described scenarios.

For part (c): One short paragraph on realworld implications that may follow from ignoring dataset issues.

For part (d): Clear and wellthoughtout answers addressing (1) and (2) (as described in the problem). Two short paragraphs or one medium paragraph suffice. You only need to pick one of the scenarios to expand on here.
3. Given a dataset x_{1}, …, x_{n} ∈ R^{d} and an integer 1 ≤ k ≤ n, recall the following kmeans objective function





k
1
min
^{∥}^{x}j ^{−} ^{µ}i^{∥}_{2}
,
^{µ}^{i}^{ =} π_{i} _{j}
_{π}_{i} ^{x}j ^{.}
(1)
π_{1}
^{,…,π}k _{i=1} _{j π}_{i}
∈
∈




Above, {π_{i}}^{k}_{i=1} is a partition of {1, 2, …, n}. The objective (1) is NPhard^{1} to find a global minimizer of. Nevertheless the commonlyused algorithm we discussed in lecture (this is called “Lloyd’s algorithm”), typically works well in practice.
a. [5 points] Implement Lloyd’s algorithm for solving the kmeans objective (1). Do not use any offtheshelf implementations, such as those found in scikitlearn. Include your code in your submission.
b. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function
(1) as a function of the iteration number. Visualize (and include in your report) the cluster centers as a
28 × 28 image.
c. [5 points] 
For k = {2, 4, 8, 16, 32, 64} run the algorithm on the training dataset to obtain centers {µ_{i}}_{i}^{k}_{=1}. 

If 
{ 
(x 
, y 
) 
n 
and 
{ 
(x^{′}, y^{′}) 
m 
denote the training and test sets, respectively, plot the training error 

1 
n ^{i} 
i 
^{}}i=1 
i i 
^{}}i=1 
1 
m 

_{i}_{=1} min_{j=1,…,k} ∥µ_{j} − x_{i}∥_{2}^{2} and test error 
_{i}_{=1} min_{j=1,…,k} ∥µ_{j} − x_{i}^{′}∥_{2}^{2} as a function of k on the same 

n 
m 
plot.
What to Submit:

For part (a): Llyod’s algorithm code

For part (b): Plot of objective function. 10 images of cluster centers.

For part (c): Plot of training and test error as function of k.

Code for parts ac

To be more precise, it is both NPhard in d when k = 2 and k when d = 2. See the references on the wikipedia page for kmeans for more details.
^{n}train
1
PCA

Let’s do PCA on MNIST dataset and reconstruct the digits in the dimensionalityreduced PCA basis. You will actually compute your PCA basis using the training dataset only, and evaluate the quality of the basis on
the test set, similar to the kmeans reconstructions of above. We have n_{train} = 50, 000 training examples of size
28 × 28. Begin by flattening each example to a vector to obtain X_{train} ∈ R^{50,000×d} and X_{test} ∈ R^{10,000×d} for
d := 784.
Let µ ∈ R^{d} denote the average of the training examples in X_{train}, i.e., µ = X_{train}^{⊤}1^{⊤}. Now let Σ =
(X_{train} − 1µ^{⊤})^{⊤}(X_{train} − 1µ^{⊤})/50000 denote the sample covariance matrix of the training examples, and let

= U DU^{T} denote the eigenvalue decomposition of Σ.
a. [2 points] If λ_{i} denotes the ith largest eigenvalue of Σ, what are the eigenvalues λ_{1}, λ_{2}, λ_{10}, λ_{30}, and λ_{50}?

What is the sum of eigenvalues
d
λ_{i}?
i=1

[5 points] Let x ∈ R^{d} and k ∈ 1, 2, . . . , d. Write a formula for the rankk PCA approximation of x.

[5 points] Using this approximation, plot the reconstruction error from k = 1 to 100 (the Xaxis is k and the Y axis is the meansquared error reconstruction error) on the training set and the test set (using the

µ and the basis learned from the training set). On a separate plot, plot 1
−
k
^{λ}i
from k = 1 to 100.
d
i=1
_{i}_{=1} ^{λ}i
d. [3 points] Now let us get a sense of what the top PCA directions are capturing. Display the first 10 eigenvectors as images, and provide a brief interpretation of what you think they capture.

[3 points] Finally, visualize a set of reconstructed digits from the training set for different values of k. In particular provide the reconstructions for digits 2, 6, 7 with values k = 5, 15, 40, 100 (just choose an image from each digit arbitrarily). Show the original image sidebyside with its reconstruction. Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions and the dimensionality you used.
What to Submit:

For part (a): Eigenvalues 1, 2, 10, 30, and 50 and the sum. At least 6 leading digits.

For part (b): The Formula. If you are defining new variables/matrices make sure their definition is stated clearly.
Unsupervised Learning with Autoencoders

In this exercise, we will train two simple autoencoders to perform dimensionality reduction on MNIST. As discussed in lecture, autoencoders are a longstudied neural network architecture comprised of an encoder component to summarize the latent features of input data and a decoder component to try and reconstruct the original data from the latent features.
Weight Initialization and PyTorch
Last assignment, we had you refrain from using torch.nn modules. For this assignment, we recommend using nn.Linear for your linear layers. You will not need to initialize the weights yourself; the default He/Kaiming uniform initialization in PyTorch will be sufficient for this problem. Hint: we also recommend using the nn.Sequential module to organize your network class and simplify the process of writing the forward pass. However, you may choose to organize your code however you’d like.
Training
Use optim.Adam for this question. Feel free to experiment with different learning rates, though you can use 5 · 10^{−5} as mentioned in the code. Use mean squared error (nn.MSELoss() or F.mse loss()) for the loss function.

[10 points] Use a network with a single linear layer. Let W_{e} ∈ R^{h×d} and W_{d} ∈ R^{d×h}. Given some x ∈ R^{d}, the forward pass is formulated as
F_{1}(x) = W_{d}W_{e}x.
Run experiments for h ∈ {32, 64, 128}. For each of the different h values, report your final training error and visualize a set of 10 reconstructed digits, sidebyside with the original image. Note: we omit the bias term in the formulation for notational convenience since nn.Linear learns bias parameters alongside weight parameters by default.

[10 points] Use a singlelayer network with nonlinearity. Let W_{e} ∈ R^{h×d}, W_{d} ∈ R^{d×h}, and activation σ : R→ R, where σ is the ReLU function. Given some x ∈ R^{d}, the forward pass is formulated as
F_{2}(x) = σ(W_{d}σ(W_{e}x)).
Report the same findings as asked for in part a (for h ∈ {32, 64, 128}).

[5 points] Now, evaluate F_{1}(x) and F_{2}(x) (use h = 128 here) on the test set. Provide the test reconstruction errors in a table.

[5 points] In a few sentences, compare the quality of the reconstructions from these two autoencoders with those of PCA from problem A5. You may need to rerun your code for PCA using the ranks k ∈ {32, 64, 128} to match the h values used above.
What to Submit:

For parts (a, b): Final training error and set of 10 reconstructed images of digits, sidebyside with the original image (10 images for each h).

For part (c): Errors of networks from part a and b on testing set.

For part (d): 23 sentences on differences in quality of solutions between PCA and Autoencoders, with example images

Code for parts ac
Extra Credit: Image Classification on CIFAR10

In this problem we will explore different deep learning architectures for image classification on the CIFAR10 dataset. Make sure that you are familiar with tensors, twodimensional convolutions (nn.Conv2d) and fullyconnected layers (nn.Linear), ReLU nonlinearities (F.relu), pooling (nn.MaxPool2d), and tensor reshaping (view).
A few preliminaries:

Each network f maps an image x^{in} ∈ R^{32×32×3} (3 channels for RGB) to an output f(x^{in}) = x^{out} ∈ R^{10}. The class label is predicted as arg max_{i=0,1,…,9} x^{out}_{i}. An error occurs if the predicted label differs from the true label for a given image.

The network is trained via multiclass crossentropy loss.

Create a validation dataset by appropriately partitioning the train dataset. Hint: look at the documentation for torch.utils.data.random split. Make sure to tune hyperparameters like network architecture and step size on the validation dataset. Do NOT validate your hyperparameters on the test dataset.

At the end of each epoch (one pass over the training data), compute and print the training and validation classification accuracy.

While one would usually train a network for hundreds of epochs to reach convergence and maximize accuracy, this can be prohibitively timeconsuming, so feel free to train for just a dozen or so epochs.
For parts (a)–(c), apply a hyperparameter tuning method (e.g. random search, grid search, etc.) using the validation set, report the hyperparameter configurations you evaluated and the best set of hyperparameters from this set, and plot the training and validation classification accuracy as a function of epochs. Produce a separate line or plot for each hyperparameter configuration evaluated (top 5 configurations is sufficient to keep the plots clean). Finally, evaluate your best set of hyperparameters on the test data and report the test accuracy.
Note: If you are attempting this problem and do not have access to GPU we highly recommend using Google Colab. You can copy this notebook, which will show how to enable/use GPU on that platform.
Here are the network architectures you will construct and compare.
a. [14 points] Fullyconnected output, 0 hidden layers (logistic regression): this network has no hidden layers and linearly maps the input layer to the output layer. This can be written as
x^{out} = W (x^{in}) + b
where x^{out} ∈ R^{10}, x^{in} ∈ R^{32×32×3}, W ∈ R^{10×3072}, b ∈ R^{10} since 3072 = 32 ·32 ·3. For a tensor x ∈ R^{a×b×c}, we let (x) ∈ R^{abc} be the reshaped form of the tensor into a vector (in an arbitrary but consistent pattern). There is no required benchmark validation accuracy for this part.
b. [18 points] Fullyconnected output, 1 fullyconnected hidden layer: this network has one hidden layer denoted as x^{hidden} ∈ R^{M} where M will be a hyperparameter you choose (M could be in the hundreds). The nonlinearity applied to the hidden layer will be the relu (relu(x) = max{0, x}. This network can be written as
x^{out} = W_{2}relu(W_{1}(x^{in}) + b_{1}) + b_{2}
where W_{1} ∈ R^{M×3072}, b_{1} ∈ R^{M} , W_{2} ∈ R ^{10×M} , b_{2} ∈ R^{10}. Tune the different hyperparameters and train for a sufficient number of epochs to achieve avalidation accuracy of at least 50%. Provide the hyperparameter configuration used to achieve this performance.
c. [18 points] Convolutional layer with maxpool and fullyconnected output: for a convolutional layer W_{1} with filters of sizek × k × 3, and M filters (reasonable choices areM = 100, k = 5), we have that Conv2d(x^{in}, W_{1}) ∈ R^{(33−k)×(33−k)×M} .

Each convolution will have its own offset applied to each of the output pixels of the convolution; we denote this as Conv2d(x^{in}, W ) + b_{1} where b_{1} is parameterized in R^{M} . Apply a relu activation to the result of the convolutional layer.

Next, use a maxpool of size N × N (a reasonable choice is N = 14 to pool to 2 × 2 with k = 5) we have that MaxPool(relu(Conv2d(x^{in}, W_{1}) + b_{1})) ∈ R^{⌊} ^{33}N^{−k} ^{ × } ^{33}N^{−k} ^{ ×M} .

We will then apply a fullyconnected layer to the output to get a final network given as
x^{output} = W_{2}(MaxPool(relu(Conv2d(x^{input}, W_{1}) + b_{1}))) + b_{2}
where W_{2} ∈ R^{10×M(}^{⌊} ^{33}N^{−k} ^{⌋}^{)}^{2} , b_{2} ∈ R^{10}.
The parameters M, k, N (in addition to the step size and momentum) are all hyperparameters, but you can choose a reasonable value. Tune the different hyperparameters (number of convolutional filters, filter sizes, dimensionality of the fullyconnected layers, stepsize, etc.) and train for a sufficient number of epochs to achieve a validation accuracy of at least 65%. Provide the hyperparameter configuration used to achieve this performance. Make sure to save this model so that you can do the next part.
The number of hyperparameters to tune, combined with the slow training times, will hopefully give you a taste of how difficult it is to construct networks with good generalization performance. Stateoftheart networks can have dozens of layers, each with their own hyperparameters to tune. Additional hyperparameters you are welcome to play with if you are so inclined, include: changing the activation function, replace maxpool with averagepool, adding more convolutional or fully connected layers, and experimenting with batch normalization or dropout.
What to Submit:

Parts ac: Code in PDF (in addition to code submission).

Part d: Loss and accuracy on both validation and training sets for each of 3 three different types of models. Also what parameters were used to achieve these values.

Part e: Few sentences on modification of architecture.

Code for parts ad
Administrative
7. [2 points] About how many hours did you spend on this homework? There is no right or wrong answer 🙂
8