Description
All questions have multiple-choice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the nal
There are twice as many problems in this nal as there are in a homework set, and some problems require packages that will need time to get to work properly.
Problems cover di erent parts of the course. To facilitate your search for rel-evant lecture parts, an indexed version of the lecture video segments can be found at the Machine Learning Video Library:
http://work.caltech.edu/library
To discuss the nal, you are encouraged to take part in the forum http://book.caltech.edu/bookforum
where there is a dedicated subforum for this nal.
Please follow the forum guidelines for posting answers (see the \BEFORE post-ing answers” announcement at the top there).
c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Nonlinear transforms
-
-
The polynomial transform of order Q = 10 applied to X of dimension d = 2 re-sults in a Z space of what dimensionality (not counting the constant coordinate x0 = 1 or z0 = 1)?
-
-
-
-
12
-
-
-
-
-
20
-
-
-
-
-
35
-
-
-
-
-
100
-
-
-
-
-
None of the above
-
-
Bias and Variance
-
-
Recall that the average hypothesis g was based on training the same model H on di erent data sets D to get g(D) 2 H, and taking the expected value of g(D) w.r.t. D to get g. Which of the following models H could result in g 62 H?
-
-
-
-
A singleton H (H has one hypothesis)
-
-
-
-
-
H is the set of all constant, real-valued hypotheses
-
-
-
-
-
H is the linear regression model
-
-
-
-
-
H is the logistic regression model
-
-
-
-
-
None of the above
-
-
Over tting
-
-
Which of the following statements is false?
-
-
-
-
If there is over tting, there must be two or more hypotheses that have di erent values of Ein.
-
-
-
-
-
If there is over tting, there must be two or more hypotheses that have di erent values of Eout.
-
-
-
-
-
If there is over tting, there must be two or more hypotheses that have
-
-
di erent values of (Eout Ein).
-
We can always determine if there is over tting by comparing the values of
(Eout Ein).
-
We cannot determine over tting based on one hypothesis only.
2
-
-
Which of the following statements is true?
-
-
-
-
Deterministic noise cannot occur with stochastic noise.
-
-
-
-
-
Deterministic noise does not depend on the hypothesis set.
-
-
-
-
-
Deterministic noise does not depend on the target function.
-
-
-
-
-
Stochastic noise does not depend on the hypothesis set.
-
-
-
-
-
Stochastic noise does not depend on the target distribution.
-
-
Regularization
-
-
The regularized weight wreg is a solution to:
-
-
1
N
X
minimize
(wTxn yn)2 subject to wT T w C;
N
n=1
where is a matrix. If wTlin T wlin C, where wlin is the linear regression solution, then what is wreg?
-
-
-
wreg = wlin
-
-
-
-
-
wreg = wlin
-
-
-
-
-
wreg = T wlin
-
-
-
-
-
wreg = C wlin
-
-
-
-
-
wreg = Cwlin
-
-
-
-
Soft-order constraints that regularize polynomial models can be
-
-
-
-
written as hard-order constraints
-
-
-
-
-
translated into augmented error
-
-
-
-
-
determined from the value of the VC dimension
-
-
-
-
-
used to decrease both Ein and Eout
-
-
-
-
-
None of the above is true
-
-
Regularized Linear Regression
We are going to experiment with linear regression for classi cation on the processed US Postal Service Zip Code data set from Homework 8. Download the data (extracted features of intensity and symmetry) for training and testing:
http://www.amlbook.com/data/zip/features.train
3
http://www.amlbook.com/data/zip/features.test
(the format of each row is: digit intensity symmetry). We will train two types of binary classi ers; one-versus-one (one digit is class +1 and another digit is class 1, with the rest of the digits disregarded), and one-versus-all (one digit is class +1 and the rest of the digits are class 1). When evaluating Ein and Eout, use binary classi cation error. Implement the regularized least-squares linear regression
for classi cation that minimizes
-
1
N
wTzn
yn
2
+wTw
X
N
n=1
N
where w includes w0.
-
Set = 1 and do not apply a feature transform (i.e., use z = x = (1; x1; x2)). Which among the following classi ers has the lowest Ein?
-
-
5 versus all
-
-
-
6 versus all
-
-
-
7 versus all
-
-
-
8 versus all
-
-
-
9 versus all
-
-
Now, apply a feature transform z = (1; x1; x2; x1x2; x21; x22), and set = 1. Which among the following classi ers has the lowest Eout?
-
-
0 versus all
-
-
-
1 versus all
-
-
-
2 versus all
-
-
-
3 versus all
-
-
-
4 versus all
-
-
If we compare using the transform versus not using it, and apply that to ‘0 versus all’ through ‘9 versus all’, which of the following statements is correct for = 1?
-
-
Over tting always occurs when we use the transform.
-
-
-
The transform always improves the out-of-sample performance by at least 5% (Eout with transform 0:95Eout without transform).
-
-
-
The transform does not make any di erence in the out-of-sample perfor-mance.
-
4
-
-
-
The transform always worsens the out-of-sample performance by at least 5%.
-
-
-
-
-
The transform improves the out-of-sample performance of ‘5 versus all,’ but by less than 5%.
-
-
-
-
Train the ‘1 versus 5’ classi er with z = (1; x1; x2; x1x2; x21; x22) with = 0:01 and = 1. Which of the following statements is correct?
-
-
-
-
Over tting occurs (from = 1 to = 0:01).
-
-
-
-
-
The two classi ers have the same Ein.
-
-
-
-
-
The two classi ers have the same Eout.
-
-
-
-
-
When goes up, both Ein and Eout go up.
-
-
-
-
-
When goes up, both Ein and Eout go down.
-
-
Support Vector Machines
-
-
Consider the following training set generated from a target function f : X ! f 1; +1g where X = R2
-
-
x1 = (1; 0); y1 = 1
x2 = (0; 1); y2 = 1
x3 = (0; 1); y3 = 1
x4 = ( 1; 0); y4 = +1 x5 = (0; 2); y5 = +1
x6 = (0; 2); y6 = +1
x7 = ( 2; 0); y7 = +1
Transform this training set into another two-dimensional space Z
z1 = x22
2x1 1
z2 = x12
2x2 + 1
Using geometry (not quadratic programming), what values of w (without w0) and b specify the separating plane wTz + b = 0 that maximizes the margin in the Z space? The values of w1; w2; b are:
-
1; 1; 0:5
-
1; 1; 0:5
-
1; 0; 0:5
-
0; 1; 0:5
-
None of the above would work.
5
-
Consider the same training set of the previous problem, but instead of explicitly transforming the input space X , apply the hard-margin SVM algorithm with the kernel
K(x; x0) = (1 + xTx0)2
(which corresponds to a second-order polynomial transformation). Set up the expression for L( 1::: 7) and solve for the optimal 1; :::; 7 (numerically, using a quadratic programming package). The number of support vectors you get is in what range?
-
-
0-1
-
-
-
2-3
-
-
-
4-5
-
-
-
6-7
-
-
-
>7
-
Radial Basis Functions
We experiment with the RBF model, both in regular form (Lloyd + pseudo-inverse)
with K centers:
-
sign
K
wk exp
jjx kjj2
+ b!
Xk
=1
(notice that there is a bias term), and in kernel form (using the RBF kernel in hard-margin SVM):
!
X
sign nyn exp jjx xnjj2 + b :
n>0
The input space is X = [ 1; 1] [ 1; 1] with uniform probability distribution, and the target is
f(x) = sign(x2 x1 + 0:25 sin( x1))
which is slightly nonlinear in the X space. In each run, generate 100 training points at random using this target, and apply both forms of RBF to these training points. Here are some guidelines:
-
Repeat the experiment for as many runs as needed to get the answer to be stable (statistically away from ipping to the closest competing answer).
-
In case a data set is not separable in the ‘Z space’ by the RBF kernel using hard-margin SVM, discard the run but keep track of how often this happens, if ever.
6
-
When you use Lloyd’s algorithm, initialize the centers to random points in X and iterate until there is no change from iteration to iteration. If a cluster becomes empty, discard the run and repeat.
-
-
For = 1:5, how often do you get a data set that is not separable by the RBF
-
kernel (using hard-margin SVM)? Hint: Run the hard-margin SVM, then check that the solution has Ein = 0.
-
-
-
5% of the time
-
-
-
-
-
> 5% but 10% of the time
-
-
-
-
-
> 10% but 20% of the time
-
-
-
-
-
> 20% but 40% of the time
-
-
-
-
-
> 40% of the time
-
-
-
-
If we use K = 9 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of Eout?
-
-
-
-
15% of the time
-
-
-
-
-
> 15% but 30% of the time
-
-
-
-
-
> 30% but 50% of the time
-
-
-
-
-
> 50% but 75% of the time
-
-
-
-
-
> 75% of the time
-
-
-
-
If we use K = 12 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of Eout?
-
-
-
-
10% of the time
-
-
-
-
-
> 10% but 30% of the time
-
-
-
-
-
> 30% but 60% of the time
-
-
-
-
-
> 60% but 90% of the time
-
-
-
-
-
> 90% of the time
-
-
-
-
Now we focus on regular RBF only, with = 1:5. If we go from K = 9 clusters to K = 12 clusters (only 9 and 12), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.
-
-
-
-
Ein goes down, but Eout goes up.
-
-
7
-
-
-
Ein goes up, but Eout goes down.
-
-
-
-
-
Both Ein and Eout go up.
-
-
-
-
-
Both Ein and Eout go down.
-
-
-
-
-
Ein and Eout remain the same.
-
-
-
-
For regular RBF with K = 9, if we go from = 1:5 to = 2 (only 1.5 and 2), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.
-
-
-
-
Ein goes down, but Eout goes up.
-
-
-
-
-
Ein goes up, but Eout goes down.
-
-
-
-
-
Both Ein and Eout go up.
-
-
-
-
-
Both Ein and Eout go down.
-
-
-
-
-
Ein and Eout remain the same.
-
-
-
-
What is the percentage of time that regular RBF achieves Ein = 0 with K = 9 and = 1:5 (excluding runs with empty clusters, if any)?
-
-
-
-
10% of the time
-
-
-
-
-
> 10% but 20% of the time
-
-
-
-
-
> 20% but 30% of the time
-
-
-
-
-
> 30% but 50% of the time
-
-
-
-
-
> 50% of the time
-
-
Bayesian Priors
-
-
Let f 2 [0; 1] be the unknown probability of getting a heart attack for people in a certain population. Notice that f is just a constant, not a function, for simplicity. We want to model f using a hypothesis h 2 [0; 1]. Before we see any data, we assume that P (h = f) is uniform over h 2 [0; 1] (the prior). We pick one person from the population, and it turns out that he or she had a heart attack. Which of the following is true about the posterior probability that h = f given this sample point?
-
-
-
-
The posterior is uniform over [0; 1].
-
-
-
-
-
The posterior increases linearly over [0; 1].
-
-
-
-
-
The posterior increases nonlinearly over [0; 1].
-
-
-
-
-
The posterior is a delta function at 1 (implying f has to be 1).
-
-
-
-
-
The posterior cannot be evaluated based on the given information.
-
-
8
Aggregation
-
-
Given two learned hypotheses g1 and g2, we construct the aggregate hypothesis g given by g(x) = 12 (g1(x) + g2(x)) for all x 2 X . If we use mean-squared error, which of the following statements is true?
-
-
-
-
Eout(g) cannot be worse than Eout(g1).
-
-
-
-
-
Eout(g) cannot be worse than the smaller of Eout(g1) and Eout(g2).
-
-
-
-
-
Eout(g) cannot be worse than the average of Eout(g1) and Eout(g2).
-
-
-
-
-
Eout(g) has to be between Eout(g1) and Eout(g2) (including the end values of that interval).
-
-
-
-
-
None of the above
-
-
9