Description
Conceptual Questions
-
The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words.
-
-
[2 points] True or False: Given a data matrix X ∈ Rn×d where d is much smaller than n and k = rank(X), if we project our data onto a k dimensional subspace using PCA, our projection will have zero reconstruction error (in other words, we find a perfect representation of our data, with no information loss).
-
-
-
[2 points] True or False: Suppose that an n × n matrix X has a singular value decomposition of U SV ⊤, where S is a diagonal n × n matrix. Then, the rows of V are equal to the eigenvectors of X⊤X.
-
-
-
[2 points] True or False: choosing k to minimize the k-means objective (see Equation (1) below) is a good way to find meaningful clusters.
-
-
-
[2 points] True or False: The singular value decomposition of a matrix is unique.
-
-
-
[2 points] True or False: The rank of a square matrix equals the number of its nonzero eigenvalues.
-
-
-
[2 points] True or False: Autoencoders, where the encoder and decoder functions are both neural networks with nonlinear activations, can capture more variance of the data in its encoded representation than PCA using the same number of dimensions.
-
What to Submit:
• Parts a-f: 1-2 sentence explanation containing your answer.
-
The first part of this problem (parts a, b) explores how you would apply machine learning theory and techniques to real-world problems. There are two scenarios detailing a setting, a dataset, and a specific result we hope to achieve. Your job is to describe how you would handle each of the below scenarios with the tools we’ve learned in this class. Your response should include
-
-
any pre-processing steps you would take (i.e., data acquisition and processing),
-
-
-
the specific machine learning pipeline you would use (i.e., algorithms and techniques learned in this class),
-
-
-
how your setup acknowledges the constraints and achieves the desired result.
-
You should also aim to leverage some of the theory we have covered in this class. Some things to consider may be: the nature of the data (i.e., How hard is it to learn? Do we need more data? Are the data sources good? ), the effectiveness of the pipeline (i.e., How strong is the model when properly trained and tuned? ), and the time needed to effectively perform the pipeline.
a. [5 points] Scenario 1: Disease Susceptibility Predictor
-
-
Setting: You are tasked by a research institute to create an algorithm that learns the factors that contribute most to acquiring a specific disease.
-
-
-
Dataset: A rich dataset of personal demographic information, location information, risk factors, and whether a person has the disease or not.
-
-
-
Result: The company wants a system that can determine how susceptible someone is to this disease when they enter in personal information. The pipeline should take limited amount of personal data from a new user and infer more detailed metrics about the person.
-
-
[5 points] Scenario 2: Social Media App Facial Recognition Technology
-
-
Setting: You are tasked with developing a machine learning pipeline that can quickly map someone’s face for the application of filters (i.e., Snapchat, Instagram).
-
-
-
Dataset: A set of face images compiled from the company’s employees and their families.
-
-
-
Result: The company wants an algorithm that can quickly identify the key features of a person’s face
-
to apply a filter. (Note: Do not worry about describing the actual filter application).
The second part of this problem (parts c, d) focuses on exploring possible shortcomings of these models, and what real-world implications might follow from ignoring these issues.
c. [5 points] Recall in Homework 2 we trained models to predict crime rates using various features. It is important to note that datasets describing crime have various shortcomings in describing the entire landscape of illegal behavior in a city, and that these shortcomings often fall disproportionately on minority communities. Some of these shortcomings include that crimes are reported at different rates in different neighborhoods, that police respond differently to the same crime reported or observed in different neighborhoods, and that police spend more time patrolling in some neighborhoods than others. What real-world implications might follow from ignoring these issues?
-
[5 points] Pick one of either Scenario 1 or Scenario 2 (in parts a and b). Briefly describe (1) some potential shortcomings of your training process that may result in your algorithm having different accuracy on different populations, and (2) how you may modify your procedure to address these shortcomings.
What to Submit:
-
For parts (a) and (b): One short paragraph (4-7) sentences for each of the described scenarios.
-
For part (c): One short paragraph on real-world implications that may follow from ignoring dataset issues.
-
For part (d): Clear and well-thought-out answers addressing (1) and (2) (as described in the problem). Two short paragraphs or one medium paragraph suffice. You only need to pick one of the scenarios to expand on here.
3. Given a dataset x1, …, xn ∈ Rd and an integer 1 ≤ k ≤ n, recall the following k-means objective function
-
-
-
-
-
k
1
min
∥xj − µi∥2
,
µi = |πi| j
πi xj .
(1)
π1
,…,πk i=1 j πi
∈
∈
-
-
-
-
Above, {πi}ki=1 is a partition of {1, 2, …, n}. The objective (1) is NP-hard1 to find a global minimizer of. Nevertheless the commonly-used algorithm we discussed in lecture (this is called “Lloyd’s algorithm”), typically works well in practice.
a. [5 points] Implement Lloyd’s algorithm for solving the k-means objective (1). Do not use any off-the-shelf implementations, such as those found in scikit-learn. Include your code in your submission.
b. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function
(1) as a function of the iteration number. Visualize (and include in your report) the cluster centers as a
28 × 28 image.
c. [5 points] |
For k = {2, 4, 8, 16, 32, 64} run the algorithm on the training dataset to obtain centers {µi}ik=1. |
|||||||||||||
If |
{ |
(x |
, y |
) |
n |
and |
{ |
(x′, y′) |
m |
denote the training and test sets, respectively, plot the training error |
||||
1 |
n i |
i |
}i=1 |
i i |
}i=1 |
1 |
m |
|||||||
i=1 minj=1,…,k ∥µj − xi∥22 and test error |
i=1 minj=1,…,k ∥µj − xi′∥22 as a function of k on the same |
|||||||||||||
n |
m |
plot.
What to Submit:
-
For part (a): Llyod’s algorithm code
-
For part (b): Plot of objective function. 10 images of cluster centers.
-
For part (c): Plot of training and test error as function of k.
-
Code for parts a-c
-
To be more precise, it is both NP-hard in d when k = 2 and k when d = 2. See the references on the wikipedia page for k-means for more details.
ntrain
1
PCA
-
Let’s do PCA on MNIST dataset and reconstruct the digits in the dimensionality-reduced PCA basis. You will actually compute your PCA basis using the training dataset only, and evaluate the quality of the basis on
the test set, similar to the k-means reconstructions of above. We have ntrain = 50, 000 training examples of size
28 × 28. Begin by flattening each example to a vector to obtain Xtrain ∈ R50,000×d and Xtest ∈ R10,000×d for
d := 784.
Let µ ∈ Rd denote the average of the training examples in Xtrain, i.e., µ = Xtrain⊤1⊤. Now let Σ =
(Xtrain − 1µ⊤)⊤(Xtrain − 1µ⊤)/50000 denote the sample covariance matrix of the training examples, and let
-
= U DUT denote the eigenvalue decomposition of Σ.
a. [2 points] If λi denotes the ith largest eigenvalue of Σ, what are the eigenvalues λ1, λ2, λ10, λ30, and λ50?
-
What is the sum of eigenvalues
d
λi?
i=1
-
[5 points] Let x ∈ Rd and k ∈ 1, 2, . . . , d. Write a formula for the rank-k PCA approximation of x.
-
[5 points] Using this approximation, plot the reconstruction error from k = 1 to 100 (the X-axis is k and the Y -axis is the mean-squared error reconstruction error) on the training set and the test set (using the
-
µ and the basis learned from the training set). On a separate plot, plot 1
−
k
λi
from k = 1 to 100.
d
i=1
i=1 λi
d. [3 points] Now let us get a sense of what the top PCA directions are capturing. Display the first 10 eigenvectors as images, and provide a brief interpretation of what you think they capture.
-
[3 points] Finally, visualize a set of reconstructed digits from the training set for different values of k. In particular provide the reconstructions for digits 2, 6, 7 with values k = 5, 15, 40, 100 (just choose an image from each digit arbitrarily). Show the original image side-by-side with its reconstruction. Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions and the dimensionality you used.
What to Submit:
-
For part (a): Eigenvalues 1, 2, 10, 30, and 50 and the sum. At least 6 leading digits.
-
For part (b): The Formula. If you are defining new variables/matrices make sure their definition is stated clearly.
Unsupervised Learning with Autoencoders
-
In this exercise, we will train two simple autoencoders to perform dimensionality reduction on MNIST. As discussed in lecture, autoencoders are a long-studied neural network architecture comprised of an encoder component to summarize the latent features of input data and a decoder component to try and reconstruct the original data from the latent features.
Weight Initialization and PyTorch
Last assignment, we had you refrain from using torch.nn modules. For this assignment, we recommend using nn.Linear for your linear layers. You will not need to initialize the weights yourself; the default He/Kaim-ing uniform initialization in PyTorch will be sufficient for this problem. Hint: we also recommend using the nn.Sequential module to organize your network class and simplify the process of writing the forward pass. However, you may choose to organize your code however you’d like.
Training
Use optim.Adam for this question. Feel free to experiment with different learning rates, though you can use 5 · 10−5 as mentioned in the code. Use mean squared error (nn.MSELoss() or F.mse loss()) for the loss function.
-
[10 points] Use a network with a single linear layer. Let We ∈ Rh×d and Wd ∈ Rd×h. Given some x ∈ Rd, the forward pass is formulated as
F1(x) = WdWex.
Run experiments for h ∈ {32, 64, 128}. For each of the different h values, report your final training error and visualize a set of 10 reconstructed digits, side-by-side with the original image. Note: we omit the bias term in the formulation for notational convenience since nn.Linear learns bias parameters alongside weight parameters by default.
-
[10 points] Use a single-layer network with non-linearity. Let We ∈ Rh×d, Wd ∈ Rd×h, and activation σ : R→ R, where σ is the ReLU function. Given some x ∈ Rd, the forward pass is formulated as
F2(x) = σ(Wdσ(Wex)).
Report the same findings as asked for in part a (for h ∈ {32, 64, 128}).
-
[5 points] Now, evaluate F1(x) and F2(x) (use h = 128 here) on the test set. Provide the test reconstruction errors in a table.
-
[5 points] In a few sentences, compare the quality of the reconstructions from these two autoencoders with those of PCA from problem A5. You may need to re-run your code for PCA using the ranks k ∈ {32, 64, 128} to match the h values used above.
What to Submit:
-
For parts (a, b): Final training error and set of 10 reconstructed images of digits, side-by-side with the original image (10 images for each h).
-
For part (c): Errors of networks from part a and b on testing set.
-
For part (d): 2-3 sentences on differences in quality of solutions between PCA and Autoencoders, with example images
-
Code for parts a-c
Extra Credit: Image Classification on CIFAR-10
-
In this problem we will explore different deep learning architectures for image classification on the CIFAR-10 dataset. Make sure that you are familiar with tensors, two-dimensional convolutions (nn.Conv2d) and fully-connected layers (nn.Linear), ReLU non-linearities (F.relu), pooling (nn.MaxPool2d), and tensor reshaping (view).
A few preliminaries:
-
Each network f maps an image xin ∈ R32×32×3 (3 channels for RGB) to an output f(xin) = xout ∈ R10. The class label is predicted as arg maxi=0,1,…,9 xouti. An error occurs if the predicted label differs from the true label for a given image.
-
The network is trained via multiclass cross-entropy loss.
-
Create a validation dataset by appropriately partitioning the train dataset. Hint: look at the documenta-tion for torch.utils.data.random split. Make sure to tune hyperparameters like network architecture and step size on the validation dataset. Do NOT validate your hyperparameters on the test dataset.
-
At the end of each epoch (one pass over the training data), compute and print the training and validation classification accuracy.
-
While one would usually train a network for hundreds of epochs to reach convergence and maximize accuracy, this can be prohibitively time-consuming, so feel free to train for just a dozen or so epochs.
For parts (a)–(c), apply a hyperparameter tuning method (e.g. random search, grid search, etc.) using the validation set, report the hyperparameter configurations you evaluated and the best set of hyperparameters from this set, and plot the training and validation classification accuracy as a function of epochs. Produce a separate line or plot for each hyperparameter configuration evaluated (top 5 configurations is sufficient to keep the plots clean). Finally, evaluate your best set of hyperparameters on the test data and report the test accuracy.
Note: If you are attempting this problem and do not have access to GPU we highly recommend using Google Colab. You can copy this notebook, which will show how to enable/use GPU on that platform.
Here are the network architectures you will construct and compare.
a. [14 points] Fully-connected output, 0 hidden layers (logistic regression): this network has no hidden layers and linearly maps the input layer to the output layer. This can be written as
xout = W (xin) + b
where xout ∈ R10, xin ∈ R32×32×3, W ∈ R10×3072, b ∈ R10 since 3072 = 32 ·32 ·3. For a tensor x ∈ Ra×b×c, we let (x) ∈ Rabc be the reshaped form of the tensor into a vector (in an arbitrary but consistent pattern). There is no required benchmark validation accuracy for this part.
b. [18 points] Fully-connected output, 1 fully-connected hidden layer: this network has one hidden layer denoted as xhidden ∈ RM where M will be a hyperparameter you choose (M could be in the hundreds). The nonlinearity applied to the hidden layer will be the relu (relu(x) = max{0, x}. This network can be written as
xout = W2relu(W1(xin) + b1) + b2
where W1 ∈ RM×3072, b1 ∈ RM , W2 ∈ R 10×M , b2 ∈ R10. Tune the different hyperparameters and train for a sufficient number of epochs to achieve avalidation accuracy of at least 50%. Provide the hyperparameter configuration used to achieve this performance.
c. [18 points] Convolutional layer with max-pool and fully-connected output: for a convolutional layer W1 with filters of sizek × k × 3, and M filters (reasonable choices areM = 100, k = 5), we have that Conv2d(xin, W1) ∈ R(33−k)×(33−k)×M .
-
Each convolution will have its own offset applied to each of the output pixels of the convolution; we denote this as Conv2d(xin, W ) + b1 where b1 is parameterized in RM . Apply a relu activation to the result of the convolutional layer.
-
Next, use a max-pool of size N × N (a reasonable choice is N = 14 to pool to 2 × 2 with k = 5) we have that MaxPool(relu(Conv2d(xin, W1) + b1)) ∈ R⌊ 33N−k × 33N−k ×M .
-
We will then apply a fully-connected layer to the output to get a final network given as
xoutput = W2(MaxPool(relu(Conv2d(xinput, W1) + b1))) + b2
where W2 ∈ R10×M(⌊ 33N−k ⌋)2 , b2 ∈ R10.
The parameters M, k, N (in addition to the step size and momentum) are all hyperparameters, but you can choose a reasonable value. Tune the different hyperparameters (number of convolutional filters, filter sizes, dimensionality of the fully-connected layers, stepsize, etc.) and train for a sufficient number of epochs to achieve a validation accuracy of at least 65%. Provide the hyperparameter configuration used to achieve this performance. Make sure to save this model so that you can do the next part.
The number of hyperparameters to tune, combined with the slow training times, will hopefully give you a taste of how difficult it is to construct networks with good generalization performance. State-of-the-art networks can have dozens of layers, each with their own hyperparameters to tune. Additional hyperparameters you are welcome to play with if you are so inclined, include: changing the activation function, replace max-pool with average-pool, adding more convolutional or fully connected layers, and experimenting with batch normalization or dropout.
What to Submit:
-
Parts a-c: Code in PDF (in addition to code submission).
-
Part d: Loss and accuracy on both validation and training sets for each of 3 three different types of models. Also what parameters were used to achieve these values.
-
Part e: Few sentences on modification of architecture.
-
Code for parts a-d
Administrative
7. [2 points] About how many hours did you spend on this homework? There is no right or wrong answer 🙂
8