Description
All questions have multiplechoice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the nal
There are twice as many problems in this nal as there are in a homework set, and some problems require packages that will need time to get to work properly.
Problems cover di erent parts of the course. To facilitate your search for relevant lecture parts, an indexed version of the lecture video segments can be found at the Machine Learning Video Library:
http://work.caltech.edu/library
To discuss the nal, you are encouraged to take part in the forum http://book.caltech.edu/bookforum
where there is a dedicated subforum for this nal.
Please follow the forum guidelines for posting answers (see the \BEFORE posting answers” announcement at the top there).
c 20122015 Yaser AbuMostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Nonlinear transforms


The polynomial transform of order Q = 10 applied to X of dimension d = 2 results in a Z space of what dimensionality (not counting the constant coordinate x_{0} = 1 or z_{0} = 1)?




12





20





35





100





None of the above


Bias and Variance


Recall that the average hypothesis g was based on training the same model H on di erent data sets D to get g^{(D)} 2 H, and taking the expected value of g^{(D)} w.r.t. D to get g. Which of the following models H could result in g 62 H?




A singleton H (H has one hypothesis)





H is the set of all constant, realvalued hypotheses





H is the linear regression model





H is the logistic regression model





None of the above


Over tting


Which of the following statements is false?




If there is over tting, there must be two or more hypotheses that have di erent values of E_{in}.





If there is over tting, there must be two or more hypotheses that have di erent values of E_{out}.





If there is over tting, there must be two or more hypotheses that have


di erent values of (E_{out} E_{in}).

We can always determine if there is over tting by comparing the values of
^{(E}out ^{E}in^{).}

We cannot determine over tting based on one hypothesis only.
2


Which of the following statements is true?




Deterministic noise cannot occur with stochastic noise.





Deterministic noise does not depend on the hypothesis set.





Deterministic noise does not depend on the target function.





Stochastic noise does not depend on the hypothesis set.





Stochastic noise does not depend on the target distribution.


Regularization


The regularized weight w_{reg} is a solution to:


1
N
X
minimize
(w^{T}x_{n} y_{n})^{2} subject to w^{T T} w C;
N
n=1
where is a matrix. If w^{T}_{lin} ^{T} w_{lin} C, where w_{lin} is the linear regression solution, then what is w_{reg}?



w_{reg} = w_{lin}





w_{reg} = w_{lin}





w_{reg} = ^{T} w_{lin}





w_{reg} = C w_{lin}





w_{reg} = Cw_{lin}




Softorder constraints that regularize polynomial models can be




written as hardorder constraints





translated into augmented error





determined from the value of the VC dimension





used to decrease both E_{in} and E_{out}





None of the above is true


Regularized Linear Regression
We are going to experiment with linear regression for classi cation on the processed US Postal Service Zip Code data set from Homework 8. Download the data (extracted features of intensity and symmetry) for training and testing:
http://www.amlbook.com/data/zip/features.train
3
http://www.amlbook.com/data/zip/features.test
(the format of each row is: digit intensity symmetry). We will train two types of binary classi ers; oneversusone (one digit is class +1 and another digit is class 1, with the rest of the digits disregarded), and oneversusall (one digit is class +1 and the rest of the digits are class 1). When evaluating E_{in} and E_{out}, use binary classi cation error. Implement the regularized leastsquares linear regression
for classi cation that minimizes

1
N
w^{T}z_{n}
^{y}n
2
_{+w}T_{w}
X
N
n=1
N
where w includes w_{0}.

Set = 1 and do not apply a feature transform (i.e., use z = x = (1; x_{1}; x_{2})). Which among the following classi ers has the lowest E_{in}?


5 versus all



6 versus all



7 versus all



8 versus all



9 versus all


Now, apply a feature transform z = (1; x_{1}; x_{2}; x_{1}x_{2}; x^{2}_{1}; x^{2}_{2}), and set = 1. Which among the following classi ers has the lowest E_{out}?


0 versus all



1 versus all



2 versus all



3 versus all



4 versus all


If we compare using the transform versus not using it, and apply that to ‘0 versus all’ through ‘9 versus all’, which of the following statements is correct for = 1?


Over tting always occurs when we use the transform.



The transform always improves the outofsample performance by at least 5% (E_{out} with transform 0:95E_{out} without transform).



The transform does not make any di erence in the outofsample performance.

4



The transform always worsens the outofsample performance by at least 5%.





The transform improves the outofsample performance of ‘5 versus all,’ but by less than 5%.




Train the ‘1 versus 5’ classi er with z = (1; x_{1}; x_{2}; x_{1}x_{2}; x^{2}_{1}; x^{2}_{2}) with = 0:01 and = 1. Which of the following statements is correct?




Over tting occurs (from = 1 to = 0:01).





The two classi ers have the same E_{in}.





The two classi ers have the same E_{out}.





When goes up, both E_{in} and E_{out} go up.





When goes up, both E_{in} and E_{out} go down.


Support Vector Machines


Consider the following training set generated from a target function f : X ! f 1; +1g where X = R^{2}


x_{1} = (1; 0); y_{1} = 1
x_{2} = (0; 1); y_{2} = 1
x_{3} = (0; 1); y_{3} = 1
x_{4} = ( 1; 0); y_{4} = +1 x_{5} = (0; 2); y_{5} = +1
x_{6} = (0; 2); y_{6} = +1
x_{7} = ( 2; 0); y_{7} = +1
Transform this training set into another twodimensional space Z
z_{1} = x_{2}^{2}
2x_{1} 1
z_{2} = x_{1}^{2}
2x_{2} + 1
Using geometry (not quadratic programming), what values of w (without w_{0}) and b specify the separating plane w^{T}z + b = 0 that maximizes the margin in the Z space? The values of w_{1}; w_{2}; b are:

1; 1; 0:5

1; 1; 0:5

1; 0; 0:5

0; 1; 0:5

None of the above would work.
5

Consider the same training set of the previous problem, but instead of explicitly transforming the input space X , apply the hardmargin SVM algorithm with the kernel
K(x; x^{0}) = (1 + x^{T}x^{0})^{2}
(which corresponds to a secondorder polynomial transformation). Set up the expression for L( _{1}::: _{7}) and solve for the optimal _{1}; :::; _{7} (numerically, using a quadratic programming package). The number of support vectors you get is in what range?


01



23



45



67



>7

Radial Basis Functions
We experiment with the RBF model, both in regular form (Lloyd + pseudoinverse)
with K centers:

sign
K
w_{k} exp
jjx _{k}jj^{2}
+ b^{!}
^{X}k
=1
(notice that there is a bias term), and in kernel form (using the RBF kernel in hardmargin SVM):
!
X
sign _{n}y_{n} exp jjx x_{n}jj^{2} + b :
_{n}>0
The input space is X = [ 1; 1] [ 1; 1] with uniform probability distribution, and the target is
f(x) = sign(x_{2} x_{1} + 0:25 sin( x_{1}))
which is slightly nonlinear in the X space. In each run, generate 100 training points at random using this target, and apply both forms of RBF to these training points. Here are some guidelines:

Repeat the experiment for as many runs as needed to get the answer to be stable (statistically away from ipping to the closest competing answer).

In case a data set is not separable in the ‘Z space’ by the RBF kernel using hardmargin SVM, discard the run but keep track of how often this happens, if ever.
6

When you use Lloyd’s algorithm, initialize the centers to random points in X and iterate until there is no change from iteration to iteration. If a cluster becomes empty, discard the run and repeat.


For = 1:5, how often do you get a data set that is not separable by the RBF

kernel (using hardmargin SVM)? Hint: Run the hardmargin SVM, then check that the solution has E_{in} = 0.



5% of the time





> 5% but 10% of the time





> 10% but 20% of the time





> 20% but 40% of the time





> 40% of the time




If we use K = 9 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of E_{out}?




15% of the time





> 15% but 30% of the time





> 30% but 50% of the time





> 50% but 75% of the time





> 75% of the time




If we use K = 12 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of E_{out}?




10% of the time





> 10% but 30% of the time





> 30% but 60% of the time





> 60% but 90% of the time





> 90% of the time




Now we focus on regular RBF only, with = 1:5. If we go from K = 9 clusters to K = 12 clusters (only 9 and 12), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.




E_{in} goes down, but E_{out} goes up.


7



E_{in} goes up, but E_{out} goes down.





Both E_{in} and E_{out} go up.





Both E_{in} and E_{out} go down.





E_{in} and E_{out} remain the same.




For regular RBF with K = 9, if we go from = 1:5 to = 2 (only 1.5 and 2), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.




E_{in} goes down, but E_{out} goes up.





E_{in} goes up, but E_{out} goes down.





Both E_{in} and E_{out} go up.





Both E_{in} and E_{out} go down.





E_{in} and E_{out} remain the same.




What is the percentage of time that regular RBF achieves E_{in} = 0 with K = 9 and = 1:5 (excluding runs with empty clusters, if any)?




10% of the time





> 10% but 20% of the time





> 20% but 30% of the time





> 30% but 50% of the time





> 50% of the time


Bayesian Priors


Let f 2 [0; 1] be the unknown probability of getting a heart attack for people in a certain population. Notice that f is just a constant, not a function, for simplicity. We want to model f using a hypothesis h 2 [0; 1]. Before we see any data, we assume that P (h = f) is uniform over h 2 [0; 1] (the prior). We pick one person from the population, and it turns out that he or she had a heart attack. Which of the following is true about the posterior probability that h = f given this sample point?




The posterior is uniform over [0; 1].





The posterior increases linearly over [0; 1].





The posterior increases nonlinearly over [0; 1].





The posterior is a delta function at 1 (implying f has to be 1).





The posterior cannot be evaluated based on the given information.


8
Aggregation


Given two learned hypotheses g_{1} and g_{2}, we construct the aggregate hypothesis g given by g(x) = ^{1}_{2} (g_{1}(x) + g_{2}(x)) for all x 2 X . If we use meansquared error, which of the following statements is true?




E_{out}(g) cannot be worse than E_{out}(g_{1}).





E_{out}(g) cannot be worse than the smaller of E_{out}(g_{1}) and E_{out}(g_{2}).





E_{out}(g) cannot be worse than the average of E_{out}(g_{1}) and E_{out}(g_{2}).





E_{out}(g) has to be between E_{out}(g_{1}) and E_{out}(g_{2}) (including the end values of that interval).





None of the above


9