Description
Ground rules
For this assignment, you may use any of our EE 559 course materials, and you may use your computer for coding and accessing the assignment, EE 559 materials, piazza, and D2L. You may also use the internet for information about Python, Python coding, and Python libraries that you are allowed to use for the problem you are working on (e.g., looking up information for sklearn functions that you might want to use in a problem that allows the sklearn library).
You are not allowed to use the internet to look up answers or partial answers to any of this assignment’s problems, or communicate with other people (e.g., other students, people on discussion forums, etc.) on topics related to this assignment.
For all coding in this assignment, you are allowed to use builtin Python functions, NumPy, matplotlib (and pandas only for reading/writing csv files). Other libraries are allowed where stated so in the problem.
For questions on this assignment, when posting on piazza, please consider whether this is a question that is appropriate for all students (e.g., clarifying the problem statement, what is allowed, suspected typos or omissions in the problem statement), or only for the instructors (e.g., something that includes your approach or solution) which should be posted as a “private” post that is visible only to the professor, TAs and you.
Please respect your classmates and follow all of these guidelines, so that everyone can be graded fairly. Any violations that are detected will result in substantial penalties.
Uploading your solutions. Please upload the following files before the due date and time:

A single pdf file of all your solutions/answers.

A single computerreadable pdf file of all your code.
Problems and points. This Midterm Assignment has 3 problems, worth a total of 100 points possible. There will be partial credit on most problem parts. Good luck!

Kernel nearest means.
In this problem you will write code and run a nonlinear version of a nearest means classifier, on data that we provide. This problem has = 2 features.
Coding: you must use Python, and code the functions yourself. sklearn and other supplementary libraries are not allowed for this problem. Similar to Homework 4, you may use Python builtin functions, NumPy, and matplotlib. You may also use plotting functions provided in our previous homework assignments, and modify them as you like. Pandas may be used only for reading and writing csv files.
Datasets: there are 2 synthetic datasets provided. Pr1_dataset1 is the synthetic dataset from HW1, but divided up to include a validation set. Pr1_dataset2 is also synthetic but is new. Each of these 2 datasets is given as 3 sets: training set, validation set, and test set.
Comment: This problem uses model selection with a validation set (covered in Lecture 13); the procedure is straightforward, and we will answer any questions you have regarding the model selection procedure.


Write an expression for the discriminant function % ‘ for a 2class nearestmeans classifier. No need to derive it; but do express it in simplest form.



From (a), develop and give the dual representation of the discriminant function; call it

_{!}% ‘.
Hint: Lecture 12 for the multiclass (MVM) version.


Use kernel substitution to give an expression for the discriminant function _{“#$}% ‘ for a nearestmeans classifier using an RBF kernel, with parameter γ (as defined in Lecture 12).



Code a 2class kernel nearest means classifier that can use RBF kernel or linear kernel. Use model selection with the validation set to find an optimal value of (call it ^{∗}) in the case of RBF kernel, separately for each dataset.

Tip: not knowing what range is best for a priori, you might first try a range covering
a few orders of magnitude, and use a log scale for the increment: e.g., to cover
0.01 ≤ ≤ 100, one could use a loop over the index k in the range −2 ≤ ≤ 2 with some increment (step size) specified for k. Then the value of for each k can be computed from = 10^{&} .

Plot the validationset classification error as a function of γ, for each dataset, for RBF kernel. (If you used a log scale for your increments as described in the tip of (d), then use a log scale for in your plot; or equivalently, plot vs. k instead of vs. .) Pick the optimal value ^{∗} for each dataset based on validationset classification error.

Compare testset error using the linear kernel with testset error using the RBF kernel, for each dataset. Comment on the results.

For the linear kernel, plot the training data, decision regions and boundary, in the feature space, for each dataset.
For the RBF kernel with optimal γ, plot the training data and decision regions in the original feature space, for each dataset.
Tip: any point in the original feature space can be classified (using nearestmeans with RBF kernel) from your code.


For the RBF kernel, repeat part (h) except for

= 0.01 ^{∗}, 0.1 ^{∗}, 0.3 ^{∗}, 3 ^{∗}, 10 ^{∗}, 100 ^{∗}. Comment on the results.

Learning algorithm from criterion function.
In this problem you will derive and interpret a learning algorithm based on the following criterion function in a 2class problem, using augmented notation:

(
^{+}_{‘} _{‘}
% ‘ = − 5 67 % _{‘ ‘}‘ ≤ 09:
‘)*
∥ ∥_{,}
in which 7[… ]9 denotes the indicator function.

Interpret the meaning of the criterion function; use a figure if it helps explain it. Tip: just stating what is different from Perceptron criterion function is not sufficient.

Derive a gradient descent (GD) algorithm for each GD method given below. Simplify your answers algebraically. Give a statement of the entire algorithm.
Give both:


Batch GD



Stochastic GD (Variant 1)


In this part use nonaugmented notation. Consider a problem with = 2 features. Draw a plot in 2D weight space ( _{,} . _{*}), for the case _{–} = 0, showing items (i)(v) below. Label your axes with numbers and tick marks, so the lines and vectors drawn are unambiguous.
Let ( ) = _{*}^{*}_{./} . For the first epoch, the data has already been shuffled so take the data points in the order given ( _{*}, _{,}, _{0}, _{1}).
Tip: This plot can be done by hand. If you prefer to do it by computer, that is also OK.


The following data points (as lines _{2}_{!}_{3}_{!} % ‘ = 0 ), with an arrow pointing to the correctly classified side.

_{*}: _{*} = [−4,1]^{+}, _{,} = [−3, −2]^{+}
_{,}: _{0} = [−1,3]^{+}, _{1} = [3, −2]^{+}

The weight vector at the second iteration ( = 1) :
(1) = [−1,4]^{+}.

The vector showing the weight update at iteration = 1 (based on data point _{,}). Also state the vector’s value in numbers.
The updated weight vector (2). Also give the vector’s value in numbers.



The solution region; if there is no solution region, state so.




This part also uses nonaugmented notation. For the scalar case, = , _{‘} = _{‘} , with _{–} = 0 : is the criterion function convex? Prove your answer.


Nonlinear MSE regression – 2 approaches.
In this problem you will compare two approaches, to using a degree 3 polynomial nonlinearity in a regression problem. Assume there is some reason to believe that a degree 3 polynomial is a good model for the data. This problem has = 3 (original) features, and the training dataset has data points.
Please use augmented notation throughout this problem.
Comment: many (but not all) of the parts below have short answers, so the problem is shorter than it probably looks.
Approach A1.
We use a degree 3 polynomial mapping from the original feature space ( space) to the expanded feature space ( space); then use a linear MSE regressor in space. Use w^{4} to denote the weight vector in space (as in lecture). The nonlinear mapping uses the most general form of a third degree polynomial of the vector , for the case of = 3 features (that is, it includes all terms of the third degree polynomial).
Approach A2.
This is different than what we have done in class. We use a linear MSE regressor in the original feature space to obtain the regression function ^{L}% ‘ = w^{5} x, and then we apply a degree 3 polynomial to its output value ^{L}% ‘, to obtain the final output of the nonlinear regressor, O% ‘, such that:





′
L
′
L
,
′
L
0
*
,
0
O% ‘ =
P % ‘Q +
P % ‘Q
+
P % ‘Q




Note that here ^{4} is defined differently than in approach A.1, and that there is no _{–}^{4}. In this approach all the weights ( and w^{4}) are learned from the data during the learning procedure.
Answer the following questions.

Give the MSE criterion function _{6.*}% ^{4}‘ for approach A.1. You may write this in terms of , if you also define the vector (e.g., by writing out its components in terms of the components of ). State how many components there are in the vector ^{4}.

How many d.o.f. are there during learning using Approach A.1? Briefly justify your answer.
Give a linearalgebra solution to the optimal weight vector wS ^{4} for approach A.1. No need to derive it. Be sure to define all variables and quantities in your solution, that aren’t already defined in this problem.

Give the MSE criterion function _{6.,}% , ^{4}‘ for approach A.2. You may write it in terms of O% ‘ if you also define all terms used in the expression for O% ‘ , including writing out your expression for ^{L}% ‘.

How many d.o.f. are there during learning using Approach A.2? Briefly justify your answer.

In Approach A.2, is O% ‘ a linear or nonlinear function of ? Is O% ‘ a linear or nonlinear function of ? Is O% ‘ a linear or nonlinear function of w^{4}? (No need to prove or justify your answers.)

Is the criterion function of (d) a continuous, differentiable function of ? Is it a continuous, differentiable function of w^{4} ? (No need to prove or justify your answers.)

Derive the weightupdate formulas for and w^{4} for Approach A.2, for sequential GD, if weight update formulas exist. Use ( ) for the learning rate parameter for update, and ( ) for the learning rate parameter for w^{4} update.
If the weight update formulas don’t exist, justify why not.

Give the criterion function for Approach A.1 that also includes _{,} regularization of all the weights.

Give the criterion function for Approach A.2 that also includes _{,} regularization of all the weights. Use regularization parameter for , and regularization parameter ^{4} for ^{4}. Be sure to define all quantities you use.

Which approach is likely to give lower error on unknowns if there is plenty of data? Briefly justify.

Which approach is likely to give lower error on unknowns if there is a very limited amount of data? Briefly justify.
p. 5 of 5