Description
Please review all homework guidance posted on the website before submitting to Gradescope. Reminders:
Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely create tables and/or gures when appropriate to organize the experimental results. In other words, all your explanations, tables, and gures for any particular part of a question must be grouped together.
When submitting to gradescope, please link each question from the homework in gradescope to the location of its answer in your homework PDF. Failure to do so may result in point deductions. For instructions, see https://www.gradescope.com/get_started#studentsubmission.
Please recall that B problems, indicated in boxed text are only graded for 546 students, and that they will be weighted at most 0.2 of your nal GPA (see website for details). In Gradescope there is a place to submit solutions to A and B problems seperately. You are welcome to create just a single PDF that contains answers to both, submit the same PDF twice, but associate the answers with the individual questions in gradescope.
If you collaborate on this homework with others, you must indicate who you worked with on your homework. Failure to do so may result in accusations of plagiarism.
Short Answer and \True or False” Conceptual questions
A.0 The answers to these questions should be answerable without referring to external materials.
[2 points] In your own words, describe what bias and variance are? What is biasvariance tradeo ? [2 points] What happens to bias and variance when the model complexity increases/decreases?
[1 points] True or False: The bias of a model increases as the amount of training data available increases.
[1 points] True or False: The variance of a model decreases as the amount of training data available increases.
[1 points] True or False: A learning algorithm will generalize better if we use less features to represent our data
[2 points] To get better generalization, should we use the train set or the test set to tune our hyperparameters?
[1 points] True or False: The training error of a function on the training set provides an overestimate of the true error of that function.
Maximum Likelihood Estimation (MLE)
A.1. You’re a Reign FC fan, and the team is ve games into its 2018 season. The number of goals scored by the team in each game so far are given below:
[2; 0; 1; 1; 2]:
1
Let’s call these scores x_{1}; : : : ; x_{5}. Based on your (assumed iid) data, you’d like to build a model to understand how many goals the Reign are likely to score in their next game. You decide to model the number of goals scored per game using a Poisson distribution. The Poisson distribution with parameter assigns every nonnegative integer x = 0; 1; 2; : : : a probability given by
Poi(xj ) = e ^{x }:
x!
So, for example, if = 1:5, then the probability that the Reign score 2 goals in their next game is e ^{1:5} ^{1}^{:}_{2!}^{5}^{2}
0:25. To check your understanding of the Poisson, make sure you have a sense of whether raising will mean more goals in general, or fewer.
a. [5 points] Derive an expression for the maximumlikelihood estimate of the parameter governing the Poisson distribution, in terms of your goal counts x_{1}; : : : ; x_{5}. (Hint: remember that the log of the likelihood has the same maximum as the likelihood function itself.)

[5 points] Suppose the team scores 4 goals in its sixth game. Derive the same expression for the estimate of the parameter as in the prior example, now using the 6 games x_{1}; : : : ; x_{5}; x_{6} = 4.

[5 points] Given the goal counts, please give numerical estimates of after 5 and 6 games.
A.2. [10 points] In World War 2, the Allies attempted to estimate the total number of tanks the Germans had manufactured by looking at the serial numbers of the German tanks they had destroyed. The idea was that if there were n total tanks with serial numbers f1; : : : ; ng then its reasonable to expect the observed serial numbers of the destroyed tanks constituted a uniform random sample (without replacement) from this set. The exact maximum likelihood estimator for this socalled German tank problem is nontrivial and quite challenging to work out (try it!). For our homework, we will consider a much easier problem with a similar avor.
Let x_{1}; : : : ; x_{n} be independent, uniformly distributed on the continuous domain [0; ] for some . What is the Maximum likelihood estimate for ?
Over tting
A.3. Suppose we have N labeled samples S = f(x_{i}; y_{i})g^{N}_{i=1} drawn i.i.d. from an underlying distribution D. Suppose we decide to break this set into a set S_{train} of size N_{train} and a set S_{test} of size N_{test} samples for our training and test set, so N = N_{train} + N_{test}, and S = S_{train} [S_{test}. Recall the de nition of the true least squares error of f:
(f) = E_{(x;y)} _{D}[(f(x) y)^{2}];
where the subscript (x; y) D makes clear that our inputoutput pairs are sampled according to D. Our training and test losses are de ned as:

b
1
X
2
1
_{train}(f) =
^{N}train
(f(x)
y)^{2}
(x;y)2S_{train}
X
_{test}(f) =
(f(x)
y)
^{N}test
b
(x;y)2S_{test}
We then train our algorithm (for example, using linear least squares regression) using the training set to obtain fb.
a. [3 points] (bias: the test error) For all xed f (before we’ve seen any data) show that
E_{train}[b_{train}(f)] = E_{test}[b_{test}(f)] = (f):
Use a similar line of reasoning to show that the test error is an unbiased estimate of our true error for f^{^}.
Speci cally, show that:
E_{test}[b_{test}(fb)] = (fb)
2

[4 points] (bias: the train/dev error) Is the above equation true (in general) with regards to the training loss? Speci cally, does E_{train}[b_{train}(fb)] equal E_{train}[ (fb)]? If so, why? If not, give a clear argument as to where your previous argument breaks down.

[8 points] Let F = (f_{1}; f_{2}; : : : ) be a collection of functions and let fb_{train} minimize the training error such that b_{train}(fb_{train}) b_{train}(f) for all f 2 F. Show that
^{E}train^{[}btrain^{(f}btrain^{)]} ^{E}train,test^{[}btest^{(f}btrain^{)]:}
(Hint: note that

E_{train,test}[ _{test}(f_{train})] = ^{X}
^{E}train,test^{[} test^{(f)1}f^{f}train ^{=} ^{f}g^{]}
b
b
b
b
f2F
b
b
X
b
b
X
=
^{E}test^{[} test^{(f)]E}train^{[1}f^{f}train ^{=} ^{f}g^{] =}
^{E}test^{[} test^{(f)]P}train^{(f}train ^{=} ^{f)}
f2F
f2F
where the second equality follows from the independence between the train and test set.)
3
B.1. For i = 1; : : : ; n let x_{i} = i=n and y_{i} = f(x_{i}) + _{i} where _{i} N (0; ^{2}) for some unknown f we wish to approximate at values fx_{i}g^{n}_{i=1}. We will approximate f with a step function estimator. For some m n such that n=m is an integer de ne the estimator

n=m
(j
n
; _{n}
g
^{where} ^{c}j ^{=} _{m}
jm
y_{i}:
^{f}m^{(x) =} _{j=1} ^{c}j^{1}f^{x} 2
i=(j 1)m+1
b
X
1)m jm
1
X
Note that this estimator just partitions f1; : : : ; ng into intervals f1; : : : ; mg; fm + 1; : : : ; 2mg; : : : ; fn m +
1; : : : ; ng and predicts the average of the observations within each interval (see Figure 1).
Figure 1: Step function estimator with n = 256, m = 16, and ^{2 }= 1.
By the biasvariance decomposition at some x_{i} we have

E ^{h}(f_{m}(x_{i})
f(x_{i}))^{2}^{i} = (
E[f_{m}(x_{i}
)]_{2}
f(x_{i}))^{2} + E
^{h}(f_{m}(x_{i}) E[f_{m}(x_{i})])^{2}
i
b

b
{z
} _{}
{z
}
Bias (x_{i})
b
Variance(x^{b}_{i})

[5 points] Intuitively, how do you expect the bias and variance to behave for small values of m? What about large values of m?

1
jm
1
n
b. [5 points]
If we de ne f^{(j)}
^{P}i=(j
^{P}i=1
(E[f_{m}(x_{i})]
=
m
_{1)m+1} f(x_{i}) and the average biassquared as _{n}
f(x_{i}))^{2}, show that
b
1
n
(E[f_{m}(x_{i})]
f(x_{i}))^{2} =
1
n=m
jm
_{(}_{f}^{(j)} _{f(x}_{i}_{))}^{2}
^{X} i=(j^{X}
n
X
b
n
i=1
j=1
1)m+1
1
n
2
If we de ne the average variance as E ^{h}
P
b
b
^{i}, show (both equali
^{c.} ties)
n
_{i=1}^{(f}m^{(x}i^{)}
E[f_{m}(x_{i})])
[5 points]
_{E} “
1
n
(f_{m}(x_{i})
E[f_{m}(x_{i})])^{2}
# _{=}
_{1} n=m
^{2}
n
n
mE[(c_{j} f^{(j)})^{2}] =
m
X_{i}
b
b
X
=1
j=1
4
d. [15 points] Let n = 256, ^{2 }= 1, and f(x) = 4 sin( x) cos(6 x^{2}). For values of m = 1; 2; 4; 8; 16; 32

plot the average empirical error
1
n
(f_{m}(x_{i}) f(x
_{i}))^{2}
using randomly drawn data as a function
n
^{P}i=1
b
of m on the xaxis. On the same plot, using parts b and c of above, plot the average biassquared, the average variance, and their sum (the average error). Thus, there should be 4 lines on your plot, each described in a legend.
e. [5 points] By the MeanValue theorem we have that min_{i=(j} _{1)m+1;:::;jm} f(x_{i}) f^{(j)} max_{i=(j} _{1)m+1;:::;jm} f(x_{i}).
Suppose f is LLipschitz so that jf(x_{i}) f(x_{j})j ^{L} ji jj for all i; j 2 f1; : : : ; ng for some L > 0. n
Show that the average biassquared is O( ^{L}^{2}^{m}^{2} ). Using the expression for average variance above, the
_{n}2
total error behaves like O( ^{L}^{2}^{m}^{2} + ^{2 }). Minimize this expression with respect to m. Does this value
n^{2} m
of m, and the total error when you plug this value of m back in, behave in an intuitive way with respect to n, L, ^{2}? That is, how does m scale with each of these parameters? It turns out that this simple estimator (with the optimized choice of m) obtains the best achievable error rate up to a universal constant in this setup for this class of LLipschitz functions (see Tsybakov’s Introduction to Nonparametric Estimation for details).
This setup of each x_{i} deterministically placed at i=n is a good approximation for the more natural setting where each x_{i} is drawn uniformly at random from [0; 1]. In fact, one can redo this problem and obtain nearly identical conclusions, but the calculations are messier.
5
Polynomial Regression
Relevant Files^{1} 

polyreg.py 
test 
polyreg 
univariate.py 

test 
polyreg 
learningCurve.py 

linreg 
closedform.py 
data/polydata.dat 

A.4.[10 points] Recall that polynomial regression learns a function h (x) = _{0} + _{1}x + _{2}x^{2} + : : : + _{d}x^{d}. In this case, d represents the polynomial’s degree. We can equivalently write this in the form of a linear model

h (x) = _{0 0}(x) + _{1 1}(x) + _{2 2}(x) + : : : + _{d d}(x) ;
(1)
using the basis expansion that _{j}(x) = x^{j}. Notice that, with this basis expansion, we obtain a linear model where the features are various powers of the single univariate x. We’re still solving a linear regression problem, but are tting a polynomial function of the input.
Implement regularized polynomial regression in polyreg.py. You may implement it however you like, using gradient descent or a closedform solution. However, I would recommend the closedform solution since the data sets are small; for this reason, we’ve included an example closedform implementation of linear regression in linreg closedform.py (you are welcome to build upon this implementation, but make CERTAIN you understand it, since you’ll need to change several lines of it). You are also welcome to build upon your implementation from the previous assignment, but you must follow the API below. Note that all matrices are actually 2D numpy arrays in the implementation.
init (degree=1, regLambda=1E8) : constructor with arguments of d and
fit(X,Y): method to train the polynomial regression model
predict(X): method to use the trained polynomial regression model for prediction
polyfeatures(X, degree): expands the given n 1 matrix X into an n d matrix of polynomial features
of degree d. Note that the returned matrix will not include the zeroth power.
Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order powers. Speci cally, X will be an n 1 matrix (X 2 R^{n} ^{1}) and this function will return the polynomial expansion of this data, a n d matrix. Note that this function will not add in the zeroth order feature (i.e., x_{0} = 1). You should add the x_{0} feature separately, outside of this function, before training the model. By not including the x_{0} column in the matrix polyfeatures(),
this allows the polyfeatures function to be more general, so it could be applied to multivariate data as well. (If it did add the x_{0} feature, we’d end up with multiple columns of 1’s for multivariate data.)
Also, notice that the resulting features will be badly scaled if we use them in raw form. For example, with a polynomial of degree d = 8 and x = 20, the basis expansion yields x^{1} = 20 while x^{8} = 2:56 10^{10} { an absolutely huge di erence in range. Consequently, we will need to standardize the data before solving linear regression. Standardize the data in fit() after you perform the polynomial feature expansion. You’ll need to apply the same standardization transformation in predict() before you apply it to new data.
Figure 2: Fit of polynomial regression with = 0 and d = 8
Run test polyreg univariate.py to test your implementation, which will plot the learned function. In this case, the script ts a polynomial of degree d = 8 with no regularization = 0. From the plot, we see that the function ts the data well, but will not generalize well to new data points. Try increasing the amount of regularization, and examine the resulting e ect on the function.
^{1}Bold text indicates les or functions that you will need to complete; you should not need to modify any of the other les.
6
A.5. [10 points] In this problem we will examine the biasvariance tradeo through learning curves. Learning curves provide a valuable mechanism for evaluating the biasvariance tradeo . Implement the learningCurve() function in polyreg.py to compute the learning curves for a given training/test set. The learningCurve(Xtrain, ytrain, Xtest, ytest, degree, regLambda) function should take in the training data (Xtrain, ytrain), the testing data (Xtest, ytest), and values for the polynomial degree d and regularization parameter .
The function should return two arrays, errorTrain (the array of training errors) and errorTest (the array of testing errors). The i^{th} index (start from 0) of each array should return the training error (or testing error) for learning with i + 1 training instances. Note that the 0^{th} index actually won’t matter, since we typically start displaying the learning curves with two or more instances.
When computing the learning curves, you should learn on Xtrain[0:i] for i = 1; : : : ; numInstances(Xtrain) + 1, each time computing the testing error over the entire test set. There is no need to shu e the training data, or to average the error over multiple trials { just produce the learning curves for the given training/testing sets with the instances in their given order. Recall that the error for regression problems is given by

1
n
^{X}i
n
(h (x_{i}) y_{i})^{2} :
(2)
=1
Once the function is written to compute the learning curves, run the test polyreg learningCurve.py script to plot the learning curves for various values of and d. You should see plots similar to the following:
Notice the following:
The yaxis is using a logscale and the ranges of the yscale are all di erent for the plots. The dashed black line indicates the y = 1 line as a point of reference between the plots.
The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e., it is a standard univariate linear regression t).
The plot of the unregularized model ( = 0) with d = 8 shows that the training error is low, but that the testing error is high. There is a huge gap between the training and testing errors caused by the model over tting the training data, indicating a high variance problem.
7
As the regularization parameter increases (e.g., = 1) with d = 8, we see that the gap between the training and testing error narrows, with both the training and testing errors converging to a low value. We can see that the model ts the data well and generalizes well, and therefore does not have either a high bias or a high variance problem. E ectively, it has a good tradeo between bias and variance.
Once the regularization parameter is too high ( = 100), we see that the training and testing errors are once again high, indicating a poor t. E ectively, there is too much regularization, resulting in high bias.
Make absolutely certain that you understand these observations, and how they relate to the learning curve plots.
In practice, we can choose the value for via crossvalidation to achieve the best biasvariance tradeo .
Ridge Regression on MNIST
A.6. In this problem we will implement a regularized least squares classi er for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9.
You are NOT allowed to use any of the prebuilt classi ers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve).
Get the data from https://pypi.python.org/pypi/pythonmnist.
Load the data as follows:
from mnist import MNIST
def load_dataset():
mndata = MNIST(’./data/’)
X_train, labels_train = map(np.array, mndata.load_training())
X_test, labels_test = map(np.array, mndata.load_testing())
X_train = X_train/255.0
X_test = X_test/255.0
Each example has features x_{i} 2 R^{d} (with d = 28 28 = 784) and label z_{j} 2 f0; : : : ; 9g. You can visualize a single example x_{i} with imshow after reshaping it to its original 28 28 image shape (and noting that the label z_{j} is accurate). We wish to learn a predictor fbthat takes as input a vector in R^{d} and outputs an index in f0; : : : ; 9g. We de ne our training and testing classi cation error on a predictor f as

b
1
(x;z)2 ^{X}
1ff(x) 6= zg
1
_{train}(f) =
^{N}train
Training Set
_{test}(f) =
X
1ff(x) 6= zg
^{N}^{test} (x;z)2Test Set
b
We will use onehot encoding of the labels, i.e. of (x; z) the original label z 2 f0; : : : ; 9g is mapped to the standard basis vector e_{z} where e_{z} is a vector of all zeros except for a 1 in the zth position. We adopt the notation where we have n data points in our training objective with features x_{i} 2 R^{d} and label onehot encoded as y_{i} 2 f0; 1g^{k} where in this case k = 10 since there are 10 digits.
a. [10 points] In this problem we will choose a linear classi er to minimize the regularized least squares objective:

c
n
X_{i}
y_{i}k_{2}^{2} + kW k_{F}^{2}
W = argmin_{W} _{2R}d k
kW ^{T} x_{i}
=0
Note that kW k_{F} corresponds to the Frobenius norm of W , i.e. kW k_{F}^{2}
d
k
= ^{P}_{i=1}
^{P}_{j=1} W_{i;j}^{2} . To classify a
8

n
j=0;:::;9 ^{e}_{j} ^{W}
_{k} ^{x}i^{.} _{n}
1
: : :
k
point x we will use the rule arg max
T
T
Note that if W = w
w
then
c
“
e_{j}^{T} y_{i})^{2} + kW e_{j}k^{2}^{#}
kW ^{T} x_{i}
y_{i}k_{2}^{2} + kW k_{F}^{2} =
X_{j}
X
(e_{j}^{T} W ^{T} x_{i}
X
i=0
=
=0
“
i=1
(w_{j}^{T} x_{i} e_{j}^{T} y_{i})^{2} + kw_{j}k^{2}^{#}
k
n
X X_{i}
j=0
=1
k
X_{j}
Y e_{j}k^{2}
+ kw_{j}k^{2}
=
=0
kXw_{j}
where X =
x_{1}
: : : x_{n}
>
2 R^{n d} and Y
=
y_{1}
: : :
y_{n} ^{>} 2 R^{n k}. Show that
T
_{1}
X
T
Y
Wc = (X
X+ I)

[10 points]
Code up a function train that takes as input X 2 R^{n d}, Y 2 f0; 1g^{n k}, > 0 and returns Wc.
Code up a function predict that takes as input W 2 R^{d k}, X^{0} 2 R^{m d} and returns an mlength vector with the ith entry equal to arg max_{j=0;:::;9} e^{T}_{j} W ^{T} x^{0}_{i} where x^{0}_{i} is a column vector representing the ith example from X^{0}.
Train Wc on the MNIST training data with = 10 ^{4} and make label predictions on the test data. What is the training and testing error? Note that they should both be about 15%.
B.2
a. [10 points] We just t a classi er that was linear in the pixel intensities to the MNIST data. For classi cation of digits the raw pixel values are very, very bad features: it’s pretty hard to separate digits with linear functions in pixel space. The standard solution to this is to come up with some transform h : R^{d} ! R^{p} of the original pixel values such that the transformed points are (more easily) linearly separable. In this problem, you’ll use the feature transform:
h(x) = cos(Gx + b):
where G 2 R^{p d}, b 2 R^{p}, and the cosine function is applied elementwise. We’ll choose G to be a random matrix, with each entry sampled i.i.d. from a Gaussian with mean = 0 and variance ^{2} = 0:1, and b to be a random vector sampled i.i.d. from the uniform distribution on [0; 2 ]: The big question is: how do we choose p? Using crossvalidation, of course!
Randomly partition your training set into proportions 80/20 to use as a new training set and validation set, respectively. Using the train function you wrote above, train a Wc^{p} for di erent values of p and plot the classi cation training error and validation error on a single plot with p on the xaxis. Be careful, your computer may run out of memory and slow to a crawl if p is too large (p 6000 should t into 4 GB of memory that is a minimum for most computers, but if you’re having trouble you can set p in the several hundreds). You can use the same value of as above but feel free to study the e ect of using di erent values of and ^{2} for fun.

[5 points] Instead of reporting just the test error, which is an unbiased estimate of the true error, we would like to report a con dence interval around the test error that contains the true error.
Lemma 1. (Hoe ding’s inequality) Fix 2 (0; 1). If for all i = 1; : : : ; m we have that X_{i} are i.i.d. random variables with X_{i} 2 [a; b] and E[X_{i}] = then

P
m
X_{i}^{!}
r
2m
!
1
m
(b
a)^{2} log(2= )
i=1
X
9
We will use the above equation to construct a con dence interval around the true classi cation error (fb) = E_{test}[b_{test}(fb)] since the test error b_{test}(fb) is just the average of indicator variables taking values in f0; 1g corresponding to the ith test example being classi ed correctly or not, respectively, where an error happens with probability = (fb) = E_{test}[b_{test}(fb)], the true classi cation error.
Let pb be the value of p that approximately minimizes the validation error on the plot you just made and use fb(x) = arg max_{j} x^{T} Wc^{p}be_{j} to compute the classi cation test error b_{test}(fb). Use Hoe ding’s inequality, of above, to compute a con dence interval that contains E_{test}[b_{test}(fb)] (i.e., the true error) with probability at least 0:95 (i.e., = 0:05). Report b_{test}(fb) and the con dence interval.
10