Description
All questions have multiplechoice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and approach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers” announcement at the top there).
c 20122015 Yaser AbuMostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Linear Regression Error
Consider a noisy target y = w ^{T} x + , where x 2 R^{d} (with the added coordinate x_{0} = 1), y 2 R, w is an unknown vector, and is a noise term with zero mean and
2
variance. Assume is independent of x and of all other ’s. If linear regression is carried out using a training data set D = f(x_{1}; y_{1}); : : : ; (x_{N} ; y_{N} )g, and outputs the parameter vector w_{lin}, it can be shown that the expected insample error E_{in} with respect to D is given by:

E_{D}[E_{in}(w_{lin})] = ^{2}
1
d + 1
N


For = 0:1 and d = 8, which among the following choices is the smallest number of examples N that will result in an expected E_{in} greater than 0.008?




10





25





100





500





1000


Nonlinear Transforms
In linear classi cation, consider the feature transform : R^{2} ! R^{2} (plus the added zeroth coordinate) given by:
(1; x_{1}; x_{2}) = (1; x^{2}_{1}; x^{2}_{2})

Which of the following sets of constraints on the weights in the Z space could correspond to the hyperbolic decision boundary in X depicted in the gure?
x _{2}
−1 +1 −1
^{x} 1
You may assume that w~_{0} can be selected to achieve the desired boundary.
2

w~_{1} = 0; w~_{2} > 0

w~_{1} > 0; w~_{2} = 0

w~_{1} > 0; w~_{2} > 0

w~_{1} < 0; w~_{2} > 0

w~_{1} > 0; w~_{2} < 0
Now, consider the 4th order polynomial transform from the input space R^{2}:
_{4} : x ! (1; x_{1}; x_{2}; x^{2}_{1}; x_{1}x_{2}; x^{2}_{2}; x^{3}_{1}; x^{2}_{1}x_{2}; x_{1}x^{2}_{2}; x^{3}_{2}; x^{4}_{1}; x^{3}_{1}x_{2}; x^{2}_{1}x^{2}_{2}; x_{1}x^{3}_{2}; x^{4}_{2})


What is the smallest value among the following choices that is not smaller than the VC dimension of a linear model in this transformed space?




3





5





15





20





21


Gradient Descent
Consider the nonlinear error surface E(u; v) = (ue^{v} 2ve ^{u})^{2}. We start at the point (u; v) = (1; 1) and minimize this error using gradient descent in the uv space. Use
= 0:1 (learning rate, not step size).


What is the partial derivative of E(u; v) with respect to u, i.e., ^{@E}_{@u} ?




(ue^{v} 2ve ^{u})^{2}





2(ue^{v} 2ve ^{u})





2(e^{v} + 2ve ^{u})





2(e^{v} 2ve ^{u})(ue^{v} 2ve ^{u})


[e] 2(e^{v} + 2ve ^{u})(ue^{v} 2ve ^{u})

How many iterations (among the given choices) does it take for the error E(u; v) to fall below 10 ^{14} for the rst time? In your programs, make sure to use double precision to get the needed accuracy.


1

3


3



5



10



17


After running enough iterations such that the error has just dropped below 10 ^{14}, what are the closest values (in Euclidean distance) among the following choices to the nal (u; v) you got in Problem 5?


(1:000; 1:000)



(0:713; 0:045)



(0:016; 0:112)



( 0:083; 0:029)



(0:045; 0:024)


Now, we will compare the performance of \coordinate descent.” In each iteration, we have two steps along the 2 coordinates. Step 1 is to move only along the u coordinate to reduce the error (assume rstorder approximation holds like in gradient descent), and step 2 is to reevaluate and move only along the v coordinate to reduce the error (again, assume rstorder approximation holds). Use the same learning rate of = 0:1 as we did in gradient descent. What will the error E(u; v) be closest to after 15 full iterations (30 steps)?

10

10

10

10

10
1
7
14
17
20
Logistic Regression
In this problem you will create your own target function f (probability in this case) and data set D to see how Logistic Regression works. For simplicity, we will take f to be a 0=1 probability so y is a deterministic function of x.
Take d = 2 so you can visualize the problem, and let X = [ 1; 1] [ 1; 1] with uniform probability of picking each x 2 X . Choose a line in the plane as the boundary between f(x) = 1 (where y has to be +1) and f(x) = 0 (where y has to be 1) by taking two random, uniformly distributed points from X and taking the line passing through
4
them as the boundary between y = 1. Pick N = 100 training points at random from X , and evaluate the outputs y_{n} for each of these points x_{n}.
Run Logistic Regression with Stochastic Gradient Descent to nd g, and estimate E_{out} (the cross entropy error) by generating a su ciently large, separate set of points to evaluate the error. Repeat the experiment for 100 runs with di erent targets and take the average. Initialize the weight vector of Logistic Regression to all zeros in each run. Stop the algorithm when kw^{(t} ^{1)} w^{(t)}k < 0:01, where w^{(t)} denotes the weight vector at the end of epoch t. An epoch is a full pass through the N data points (use a random permutation of 1; 2; ; N to present the data points to the algorithm within each epoch, and use di erent permutations for di erent epochs). Use a learning rate of 0.01.


Which of the following is closest to E_{out} for N = 100?




0.025





0.050





0.075





0.100





0.125




How many epochs does it take on average for Logistic Regression to converge for N = 100 using the above initialization and termination rules and the speci ed learning rate? Pick the value that is closest to your results.




350





550





750





950





1750


PLA as SGD


The Perceptron Learning Algorithm can be implemented as SGD using which of the following error functions e_{n}(w)? Ignore the points w at which e_{n}(w) is not twice di erentiable.




e_{n}(w) = e ^{y}n^{w}^{}^{x}n





e_{n}(w) = y_{n}w^{}x_{n}





e_{n}(w) = (y_{n} w^{}x_{n})^{2}

e_{n}(w) = ln(1 + e ^{y}n^{w}^{}^{x}n )

e_{n}(w) = min(0; y_{n}w^{}x_{n})


5