Description

We perform Parzen Window density estimation, using trapezoidal window functions given (in unnormalized form) in the gure below: (15 pts)
φ(u)
1
u
h h/2 0 h/2 h
Choose h = 1. Assume that we have the following data
D_{!}_{1} = f0; 2; 5g
D_{!}_{2} = f4; 7g


Sketch or plot the Parzen window estimates of the pdfs p(xj!_{1}) and p(xj!_{2}). Please label pertinent values on both axes.



Estimate the prior probabilities based on frequency of occurrence of the prototypes in each class.



Use the estimates you have developed in above to nd the decision boundaries and regions for a Bayes minimumerror classi er based on Parzen windows. Only the part of feature space where at least one density is nonzero need to be classi ed.


We perform knearest neighbor density estimation, using k = 2 . Assume that you are given the following training set for a 2class problem with one feature: (20 pts)
D_{!}_{1} = f2; 5g
D_{!}_{2} = f4; 7g

Sketch or plot the knearest neighbors estimates of the pdfs p(xj!_{1}). Please label pertinent values on both axes. Also give the density estimates algebraically, for each region in feature space.

Estimate the prior probabilities based on frequency of occurrence of the prototypes in each class.

Use the estimates you have developed in above to nd the decision boundaries and regions for a Bayes minimumerror classi er based on knearest neighbors.
Homework 6 EE 559, Instructor: Mohammad Reza Rajati


Derive a classi er based on using KNN as a discriminative technique that estimates p(!_{i}jx) directly using nearest neighbors, and compare it to the classi er you obtained in 2c. If there are ties, break them in favor of !_{2}.


Consider the following training data set:
x_{1} = [1; 0]^{T} ; z_{1} = 1
x_{2} = [0; 1]^{T} ; z_{2} = 1
x_{3} = [0; 1]; z_{3} = 1
x_{4} = [ 1; 0]^{T} ; z_{4} = 1
x_{5} = [0; 2]^{T} ; z_{5} = 1
x_{6} = [0; 2]^{T} ; z_{6} = 1
x_{7} = [ 2; 0]^{T} ; z_{7} = 1
Use following nonlinear transformation of the input vector x = [x_{1}; x_{2}]^{T} to the transformed vector u = [’_{1}(x); ’_{2}(x)]^{T} : ’_{1}(x) = x^{2}_{2} 2x_{1} + 3 and ’_{2}(x) = x^{2}_{1} 2x_{2} 3. What is the equation of the optimal separating \hyperplane” in the u space? (15 pts)
4. Consider the following training data set : (25 pts)
x_{1} = [0; 0]^{T} ; z_{1} = 1
x_{2} = [1; 0]^{T} ; z_{2} = 1
x_{3} = [0; 1]; z_{3} = 1
x_{4} = [ 1; 0]^{T} ; z_{4} = 1
Note that in the following, you need to use equations that describe w and give rise to the dual optimization problem.

Write down the dual optimization problem for training a Support Vector Machine with this data set using the polynomial kernel function
(x_{i}; x_{j}) = (x^{T}_{i} x_{j} + 1)^{2}

Solve the optimization problem and nd the optimal _{i}’s using results about quadratic forms and check the results with Wolfram Alpha or any software package.

Show that the equation of the decision boundary in a kernel SVM w^{T} u + w_{0} = 0 can be represented as g(x) = ^{P}^{N}_{i=1} _{i}z_{i} (x_{i}; x) + w_{0}.

We learned that for vectors that do not violate the margin^{1} (i.e. z_{j}(w^{T} u_{j} + w_{0})
1 > 0), the Lagrange multiplier is zero, i.e. _{j} = 0. On the other hand, for

For simplicity, consider Kernel SVM with hard margins, i.e. no slack variables.
Homework 6 EE 559, Instructor: Mohammad Reza Rajati
vectors on the margin (z_{j}(w^{T} u_{j} + w_{0}) 1 = 0), _{j} 6= 0. Show that, consequently, one can nd a vector x_{j} for which _{j} 6= 0 and calculate w_{0} as w_{0} = 1=z_{j}
^{P}^{N}_{i=1} _{i}z_{i} (x_{i}; x_{j}).


Sketch the decision boundary for this data set based on parts (4c) and (4d).


In the following gure, there are di erent SVMs with di erent decision boundaries. The training data is labeled as z_{i} 2 f 1; 1g, represented as circles and squares respectively. Support vectors are drawn in solid circles. Determine which of the scenarios described below matches one of the 6 plots (note that one of the plots does not match any scenario). Each scenario should be matched to a unique plot. Explain your reason for matching each gure to each scenario. (10 pts)

A softmargin linear SVM with C = 0:02

A softmargin linear SVM with C = 20

A hardmargin kernel SVM with (x_{i}; x_{j}) = x^{T}_{i} x_{j} + (x^{T}_{i} x_{j})^{2}

(d)
A hardmargin kernel SVM with (x_{i}; x_{j}) = exp(
5kx_{i}
x_{j}k^{2})
(e)
A hardmargin kernel SVM with (x_{i}; x_{j}) = exp(
1
kx_{i}
x_{j}k^{2})
5

Programming Part: Multiclass and MultiLabel Classi cation Using Support Vector Machines


Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics. uci.edu/ml/datasets/Anuran+Calls+%28MFCCs). Choose 70% of the data randomly as the training set.



Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multiclass and multilabel problem. One of the most important approaches to multiclass classi cation is to train a classi er for each label. We rst try this approach:

Homework 6 EE 559, Instructor: Mohammad Reza Rajati

Research exact match and hamming score/ loss methods for evaluating multilabel classi cation and use them in evaluating the classi ers in this problem.

Train a SVM for each of the labels, using Gaussian kernels and one versus all classi ers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation.^{2} You are welcome to try to solve the problem with both normalized^{3} and raw attributes and report the results. (15 pts)

Repeat 6(b)ii with L_{1}penalized SVMs.^{4} Remember to normalize the attributes. (10 pts)

Repeat 6(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the classi ers you trained. (10 pts)

Extra Practice: Study the Classi er Chain method and apply it to the above problem.

Extra Practice: Research how confusion matrices, precision, recall, ROC, and AUC are de ned for multilabel classi cation and compute them for the classi ers you trained in above.
^{2}How to choose parameter ranges for SVMs? One can use wide ranges for the parameters and a ne grid (e.g. 1000 points) for cross validation; however,this method may be computationally expensive. An alternative way is to train the SVM with very large and very small parameters on the whole training data and nd very large and very small parameters for which the training accuracy is not below a threshold (e.g., 70%). Then one can select a xed number of parameters (e.g., 20) between those points for cross validation. For the penalty parameter, usually one has to consider increments in log( ). For example, if one found that the accuracy of a support vector machine will not be below 70% for = 10 ^{3} and = 10^{6}, one has to choose log( ) 2 f 3; 2; : : : ; 4; 5; 6g. For the Gaussian Kernel parameter, one usually chooses linear increments,e.g.
2 f:1; :2; : : : ; 2g. When both and are to be chosen using crossvalidation, combinations of very small and very large ’s and ’s that keep the accuracy above a threshold (e.g.70%) can be used to determine the ranges for and . Please note that these are very rough rules of thumb, not general procedures.
^{3}It seems that this dataset is already normalized!
^{4}The convention is to use L_{1} penalty with linear kernel.
4