Description

AML Problem 1.7a,b (p. 36), plus the following: For part (b), add: also plot for = 60 .
Hint: in this problem, tossing one coin N times corresponds to one draw of a dataset of size N. Doing this with 1000 coins independently, corresponds to drawing 1000 different datasets, each of size N.

Suppose you have trained a model in a binary classification problem, and you want to estimate its error based on a test set.
You want to estimate this error empirically for the case where labeled data is expensive. So you decide to first do some simulations to see how your testset error might vary with the “luck of the draw” of the test set.
Let the true probability of error on your model be E_{out} (h) = µ . Because µ is unknown to you, for purposes of simulation you will try different values of µ . Method: Conceptually, a test set ^{(“)} is created by drawing N data points randomly from the input space, with replacement. An expert could correctly label each data point, and then you could color each data point as “correct” or “incorrect” depending on the ML classifier’s prediction.

Repeat the experiment of part (a)(i) 100 times. Keep a record of the error rate E_{D}_{(}_{10}_{)} (h) from each run (no need to turn in these 100 values). Give the
following statistics and answer these questions:
max {E_{D}_{(}_{10}_{)} (h)},
min{E_{D}_{(}_{10}_{)} (h)},
^{(i)} sample mean{E_{D}_{(}_{10}_{)} (h)},
sample standard deviation{E_{D}_{(}_{10}_{)} (h)}.
(ii) How many of your 100 runs had an error rate different than µ? Does this agree with the value of + _{&}(“#)(ℎ) = µ1 from item (a)(ii) ?
(iii) From your results of the 100 runs, give an estimate for the probability
P+5 _{&}(%)(ℎ) − µ5 < 0.051
(c) Repeat parts (a)(ii) and (b) for the following parameters:
(i) µ = 0.10, = 10
= 0.10, = 100
(ii) = 0.30, = 100
(iii) = 0.50, = 10
= 0.50, = 100
Tip: present your results in a table (including values given in (a)) for easy interpretation. A sample table is as below.



{
}
{
}
mean
std
( _{“}
max
min
sample
sample
# runs
< 0.05)
{ _{&}}
{ _{&}}
_{&} ≠
0.1
10
0.1
100
0.3
10
0.3
100
0.5
10
0.5
100



Comment on your results, in the following:
2
vary with N ? Hint: look at min, max, std, and P+5 _{&}(%) (ℎ) − µ5 < 0.051.



For the classifier of (c)(iii):






based on the given true error rate (P(incorrect)), did the classifier learn anything?







How many test datasets of your draws in (c)(iii) gave an error rate indicating that the classifier did learn something, assuming that _{&}(%)(ℎ) ≤ 0.45 or _{&}(%)(ℎ) ≥ 0.55 means it learned something?




Consider a hypothesis set ℋ consisting of all positiveinterval hypotheses and all negativeinterval hypotheses. (See AML pp.4344, Example 2.2, positive intervals, for a definition of positive interval hypotheses.) Negativeinterval hypotheses return


= −1 within the interval and ℎ = +1 outside the interval. Thus, any hypothesis h in ℋ contains one interval, and that interval can be positive or negative.

Hint: before answering the questions below, you may find it helpful to read or review AML Example 2.2, cases 1 and 2, or Discussion 5 notes.


Find the growth function _{ℋ}(N) as a function of N. Show your work or reasoning.

Tip: you may use the AML book’s result for negative intervals, and build on top of that.


Give the smallest break point k of ℋ. Briefly justify your answer.

Give the VCdim of ℋ. Briefly justify your answer.


Consider the WiFi localization dataset used in Homework 2: it had N = 2000 data points and 7 features, split into a training set of 1500 data points and a test set of 500 data points. Suppose you intend to use a linear perceptron classifier on that data instead of random forest. In the parts below, for the tolerance δ in the VC generalization bound, use 0.1 (for a certainty of 0.9). The parts below have short answers.
Hint: You may use without proof the relation that if H is a linear perceptron classifier in D dimensions (D features), d_{VC} (H ) = D + 1 .

What is the VC dimension of the hypothesis set?

Expressing the upper bound on the outofsample error as
^{E}out (^{h}g ) ^{≤} ^{E}in (^{h}g )^{+} ^{ε}vc
For E_{in} (h_{g} ) measured on the training data, use d_{vc} from part (a) to get a value for ε_{vc} .

To get a lower ε_{vc} , suppose you reduce the number of features to = 4. Now what is ε_{vc} ?

Suppose that you had control over the number of training samples N_{Tr} (by
collecting more WiFi data). How many training samples would ensure a generalization error of ε_{vc} = 0.1 again with probability 0.9 (the same tolerance

= 0.1), and using the reduced feature set (4 features)?
Hint: if you’re not sure how to solve for N, see an example in AML Sec. 2.2.1 (“Sample Complexity”, pp.57).

Instead suppose you use the test set to measure E_{in} (h_{g} ), so let’s call it E_{test} (h_{g} ) . What is the hypothesis set now? What is its cardinality?

Continuing from part (e), use the bound:
^{E}out (^{h}g ) ^{≤} ^{E}test (^{h}g )^{+} ^{ε}
Use the original feature set and the original test set, so that _{()*+} = 500. Use the version of that gives the tightest bound. Give the appropriate expression for ε and calculate it numerically.
Does the number of features have any effect on this ε ?

You are given a regression problem in 1D. The target function is
^{,}
f( ) = σ(3 ) = _{1 +} _{,}
and feature space extends over −1 ≤ ≤ +1 . ( ) is a uniform distribution over the feature space.
Each draw of the dataset consists of = 2 points:
= {( _{$}, _{$}), ( _{.}, _{.})} = {+ _{$}, σ(3 _{$})1, + _{.}, σ(3 _{.})1} .
The hypothesis set consists of all lines ℎ( ) = + .
In this problem you will explore bias and var numerically.
You may use builtin python functions, including functions in NumPy to draw random numbers, such as numpy.random.uniform and numpy.random.normal.

(a) Give an
algebraic expression for the best hypothesis
^{( )} , in terms of
_{$},_{.},_{$}
, _{.}
. This is the hypothesis that minimizes the
MSE on
.
ℎ_{/}
Hint: no need to take derivatives and do a formal minimization.

Find the mean best hypothesis ℎ^{VVV}_{/} numerically. Tip: average over many draws of . Give resulting a and b. Draw a plot containing curves of ( ), ℎ^{VVV}_{/}, and several ℎ_{/}^{(&)} over the range of −1 ≤ ≤ +1. The number of ℎ_{/}^{(&)} curves can be determined yourself by best visibility.

Numerically compute bias and var. Also numerically compute E_{D}{ _{23+}Yℎ_{/}^{( )}Z}.

Now let there be sensor noise on the data, so that
( ) = ( ) + ϵ_{4}, ϵ_{4} ∼ ( ϵ_{4} ∣ _{4} = 0, _{4} = 0.004 )
Repeat parts (a)(c) for this noisy data. Tip: if there is no change in (a), just state so.
Is var of the learned system very sensitive to σ_{4} ? Conjecture a reason why or why not.
(e) Now let there also be some sensor bias on the data, so that
( ) = ( ) + _{4}, _{4} ∼ ( _{4} ∣ _{4} = 0.1, _{4} = 0.004 )
Repeat parts (a)(c) for this biased and noisy data. Tip: if there is no change in (a), just state so.
What is the effect of the sensor bias on the bias and var of the learned system (i.e., does it increase, decrease, or have no effect, on bias and on var)?
5