Description
All questions have multiplechoice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and approach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers” announcement at the top there).
c 20122015 Yaser AbuMostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Validation
In the following problems, use the data provided in the les in.dta and out.dta for Homework # 6. We are going to apply linear regression with a nonlinear transformation for classi cation (without regularization). The nonlinear transformation is given by _{0} through _{7} which transform (x_{1}; x_{2}) into
1 x_{1} x_{2} x^{2}_{1} x^{2}_{2} x_{1}x_{2} jx_{1} x_{2}j jx_{1} + x_{2}j
To illustrate how taking out points for validation a ects the performance, we will consider the hypotheses trained on D_{train} (without restoring the full D for training after validation is done).

Split in.dta into training ( rst 25 examples) and validation (last 10 examples). Train on the 25 examples only, using the validation set of 10 examples to select between ve models that apply linear regression to _{0} through _{k}, with k = 3; 4; 5; 6; 7. For which model is the classi cation error on the validation set smallest?


k = 3



k = 4



k = 5



k = 6



k = 7


Evaluate the outofsample classi cation error using out.dta on the 5 models to see how well the validation set predicted the best of the 5 models. For which model is the outofsample classi cation error smallest?


k = 3



k = 4



k = 5



k = 6



k = 7


Reverse the role of training and validation sets; now training with the last 10 examples and validating with the rst 25 examples. For which model is the classi cation error on the validation set smallest?


k = 3



k = 4

2



k = 5





k = 6





k = 7




Once again, evaluate the outofsample classi cation error using out.dta on the 5 models to see how well the validation set predicted the best of the 5 models. For which model is the outofsample classi cation error smallest?




k = 3





k = 4





k = 5





k = 6





k = 7




What values are closest in Euclidean distance to the outofsample classi cation error obtained for the model chosen in Problems 1 and 3, respectively?




0.0, 0.1





0.1, 0.2





0.1, 0.3





0.2, 0.2





0.2, 0.3


Validation Bias


Let e_{1} and e_{2} be independent random variables, distributed uniformly over the interval [0; 1]. Let e = min(e_{1}; e_{2}). The expected values of e_{1}; e_{2}; e are closest to




0.5, 0.5, 0





0.5, 0.5, 0.1





0.5, 0.5, 0.25





0.5, 0.5, 0.4





0.5, 0.5, 0.5


Cross Validation
7. You are given the data points (x; y): ( 1; 0); ( ; 1); (1; 0); 0; and a choice between two models: constant f h_{0}(x) = b g and linear f h_{1}(x) = ax + b g. For which value of would the two models be tied using leaveoneout crossvalidation with the squared error measure?
3
pp


3 + 4

pp


3 1 p _{p}

9+4 6 p _{p}



96



None of the above

PLA vs. SVM
Notice: Quadratic Programming packages sometimes need tweaking and have numerical issues, and this is characteristic of packages you will use in practical ML situations. Your understanding of support vectors will help you get to the correct answers.
In the following problems, we compare PLA to SVM with hard margin^{1} on linearly separable data sets. For each run, you will create your own target function f and data set D. Take d = 2 and choose a random line in the plane as your target function

(do this by taking two random, uniformly distributed points on [ 1; 1] [ 1; 1] and taking the line passing through them), where one side of the line maps to +1
and the other maps to 1. Choose the inputs x_{n} of the data set as random points in X = [ 1; 1] [ 1; 1], and evaluate the target function on each x_{n} to get the corresponding output y_{n}. If all data points are on one side of the line, discard the run and start a new run.
Start PLA with the allzero vector and pick the misclassi ed point for each PLA iteration at random. Run PLA to nd the nal hypothesis g_{PLA} and measure the disagreement between f and g_{PLA} as P[f(x) 6= g_{PLA}(x)] (you can either calculate this exactly, or approximate it by generating a su ciently large, separate set of points to evaluate it). Now, run SVM on the same data to nd the nal hypothesis g_{SVM} by solving

min
1
w^{T}w
2
w;b
s:t:
y_{n} w^{T}x_{n} + b 1
using quadratic programming on the primal or the dual problem. Measure the disagreement between f and g_{SVM} as P[f(x) 6= g_{SVM}(x)], and count the number of support vectors you get in each run.

For N = 10, repeat the above experiment for 1000 runs. How often is g_{SVM} better than g_{PLA} in approximating f? The percentage of time is closest to:

20%

For hard margin in SVM packages, set C ! 1.
4


40%



60%



80%



100%


For N = 100, repeat the above experiment for 1000 runs. How often is g_{SVM} better than g_{PLA} in approximating f? The percentage of time is closest to:


10%



30%



50%



70%



90%


For the case N = 100, which of the following is the closest to the average number of support vectors of g_{SVM} (averaged over the 1000 runs)?


2



3



5



10



20

5