Description

In this problem, you will code up a linear regressor based on the MSE criterion function. You will also investigate different learning rate parameter schedules and their effect on convergence in iterative gradient descent (GD) optimization.
Coding yourself. As in previous homework assignments, in this problem you are required to write the code yourself; you may use only Python builtin functions, NumPy, and matplotlib; you may use the “random” module because it is part of the Python standard library; and you may use pandas only for reading and/or writing csv files.
Dataset. Throughout this problem, you will use a real dataset based on the “Combined Cycle Power Plant Data Set” [https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant]
from the UCI Machine Learning Repository .
We have subsampled the data (to save on computation time) and preprocessed it, so be sure to use the dataset files h5w7_pr1_power_train.csv and h5w7_pr1_power_test.csv provided with this homework assignment. There are a total of 500 data points, with a split of 75% of points in the training set, and 25% of points in the test set.
The dataset has 4 realvalued features: temperature (T), ambient pressure (AP), relative humidity (RH), and exhaust vacuum (V). The output value _{!} is the net hourly electrical energy output (EP). Note that the 5^{th} column in the dataset files is the output value. We have normalized the input features already, so that each input feature value is in the range
0≤ _{!”}≤1.



This part does not require any coding. Do the convergence criteria given in Lecture 10, page 12 for MSE classification, also apply to MSE regression? Justify your answer.


Hint: compare the weight update formulas.




1 of 4



( ) =
Answer the parts below as instructed, regardless of your answer to (a).
Please code the regressor as follows:

Code a linear MSE regressor that uses iterative GD for optimization (LMS algorithm); use basic sequential GD. The schedule to use for η(i) will be given below. Hints on how to perform the random shuffle are given in Homework 4.

For the initial weight vector, let each w_{#} = a random number drawn independent and identically distributed (i.i.d.) according to the uniform density function
[−0.1, +0.1]. Hint: use numpy.random.uniform().

Before the first epoch (at = 0, call it epoch = 0), and at the end of each epoch, compute and store the RMS error for epoch m as:
*
_{$}^{(()}_{%&} = 6 8 :;^{+}


For halting condition, halt when either one of these 2 conditions are met, and have the code output which halting condition was reached:

iv.1 E_{,}^{(/}_{–}^{)}_{.} < 0.001E_{,}^{(0}_{–}^{)}_{.}
iv.2 100 epochs have been completed.


When the iterations halt, store the final value of as > for each value of ( , ) (as described below), in a table.

You will also need a function that gives the output prediction A8 : for any given input , given the optimal w> from the learning algorithm.


In this part you try a learning rate parameter of the form:
+
in which i is the iteration index (increases by 1 with each successive data point). Try a grid search over the following values: = 0.01, 0.1, 1, 10, 100 , and = 1, 10, 100, 1000 (i.e., use two nested loops to try all combinations of A and B). For each pair (A, B), plot a learning curve as _{$}^{(()}_{%&} . .
Tip: do 5 plots total, one plot for each value of A. Have each plot show 4 curves (one curve for each value of B).

Comment on the dependence of the learning curves on A and B.

Pick the resulting best pair ( , ) from (b) above (based on final _{$}^{(()}_{%&} for each pair), and use its value of > to calculate the _{$%&} error on the test set.

As a comparison, consider a trivial regressor that always outputs A8 : = H = {mean value of the outputs over the training data}. Calculate the _{$%&} of this trivial regressor on the test set. Is your regressor’s error from (d) substantially lower than the error of this trivial regressor?
In lecture we derived the solution weight vector wˆ^{(}^{+}^{)} for ridge regression, in which the entire augmented weight vector is regularized. Often the nonaugmented weight vector is regularized, and the bias term w_{0} is used to help minimize the MSE but is not regularized. For this case, the criterion function becomes:
_{J} (_{w}) _{= }_{ X} (+) _{w}(+) _{−} _{y }^{2} _{+} _{λ }_{w}(0) ^{2}
2 2
Derive the optimal weight vector solution wˆ^{(}^{+}^{)} for this case.
Hint: use augmented notation throughout, and make use of the diagonal matrix
I ′ = diag {0,1,1,!,1}.

In this problem, you will implement a nonlinear mapping from a 2D feature space to an expanded feature space, perform learning in the expanded space, then map the resulting decision regions into a more optimal 2D space, and separately into the original 2D feature space. Most of this you will do on computer; you will have help on the coding in the form of given functions and tips. You may use nonbuiltin functions from libraries like sklearn only where specifically stated as allowed or where you are given sklearn classes or functions to use. If you are not yet familiar with OOP in Python, this problem gives you a good opportunity to try it.
The file h5w7_pr3_helper_functions.py contains functions that will help you complete the problem and the file h5w7_pr3_data.csv contains the synthetic dataset. Each data point is represented by two features, _{*} and _{+}, and it may belong to class 1 or class 2. You will use all the data points in every one of the following parts.


Plot the data points in (nonaugmented) feature space. Use different colors or different markers to indicate the class to which each point belongs. Without running any learning algorithm, state whether you think this data set is linearly separable in this feature space.



Use sklearn’s implementation of the perceptron to train a perceptron on this dataset. Obtain and report the classification accuracy.

Hints:



Sklearn’s perceptron documentation at https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html contains an example of how to use this class.

You can either use the score method to directly obtain the mean accuracy or the predict method, which will return the predicted labels for each sample.





This dataset is already centered, that is, the mean of each feature is zero.





False. Use the default value for all other parameters.
w>_{0}
,
fit_intercept
, to
Therefore, you should set the parameter for finding



Plot the learned decision boundaries from the perceptron in item (b).
We provide two functions to help with this task, but you can also code your own.


3 of 4


The function linear_decision_function implements the perceptron linear decision function. You do not need to call it in your code. It is called automatically by the plotting function.

The function plot_perceptron_boundaries takes as parameters, in addition to the training data, the learned weights and a decision function. Here’s an example of how to call it with the linear_decision_function:
weights = classifier.coef_[0] # get the perceptron weights plot_perceptron_boundaries(X_train, y_train, weights,
linear_decision_function)
in which classifier is a trained perceptron object.
Also note that the provided function returns a figure object, which can be used to programmatically save the plot.

Let’s try to use a quadratic feature space expansion to improve the result above, i.e.,
( _{*}, _{+}) → ( _{*}, _{+}, _{* +}, _{*}^{+}, _{+}^{+})
Code a function to apply the feature expansion procedure to the entire dataset. Then train a perceptron on the expanded dataset and report its classification accuracy. Is the dataset linearly separable in the expanded feature space?
Hint: remember that you do not need to add the bias term.

[Extra credit] We want to plot the decision boundary and regions as a function of the 2 most relevant features of the expanded feature space. For this, follow these steps:


Provide the learned weight vector from item (d)



Which two features received the highest weights in absolute value? Note: your answer must be two elements from ( _{*}, _{+}, _{* +}, _{*}^{+}, _{+}^{+}). Write the decision function only in terms of these two features and their weights.



Create a new feature matrix that contains only the two most relevant features found above. Create a new weight vector that contains only the two highest weights in absolute value.



Call the plotting function with the new feature matrix and new weight vector. Example:

plot_perceptron_boundaries(phi_X_best_2,y_train, weights_best_2, linear_decision_function)
Is the dataset linearly separable in this feature space?

[Extra credit] Next, we want to plot the decision boundary and regions in the original feature space. To do this, you need to code the decision function in the original feature space. The skeleton of the nonlinear_decision_function is already
provided. You can call the plotting function as:
plot_perceptron_boundaries(X_train, y_train, relevant_weights nonlinear_decision_function)
p. 4 of 4