Description
Computational Learning Theory: boosting
a. Prove that e ^{x} (1 x).
Before you answer the following questions, you need to read the paper ’A decisiontheoretic generalization of online learning and an application to boosting’ by Y. Freund and R.E. Schapire, 1995 (also available from Blackboard, under Course Documents, Reading Material, Computation Learning Theory). You do not have to focus too much on sections 1, 2 and 3, but section 4 and 4.1 are important.
b. Implement a ’weak learner’: the decision stump. The decision stump is a very simple classi er that chooses one feature f from a dataset, and checks if the feature value x_{f} is larger (or smaller) than a certain threshold . If x_{f} > (<) then the object is assigned to class !_{1} and otherwise to !_{2}.
To nd the optimal feature f and threshold , you have to do an exhaustive search for a training set. So, try all values for f and and the sign y of < or >, and remember those values f ; ; y for which the classi cation error on the trainingset is minimum.
Make sure that the function accepts a dataset with labels as input, and that the function outputs the optimal f , and y .
Show your (Matlab) code.
1

Test the implementation on the dataset generated by gendats from Prtools. (If you don’t want to use Prtools, just generate the two classes from two Gaussian distributions, where the means are _{1} = [0; 0]^{T} and _{2} = [2; 0]^{T} , and the covariance matrices are identify matrices.) Make a scatterplot of the data, and give the optimal parameters obtained by your decision stump.
Does the decision stump change if you rescale one of the features (for instance, when you multiply feature 2 by a factor of 10)?

Test the implementation on dataset optdigitsubset of last week (available from Blackboard). It consists of the pixel values of small 8 8 images, which are ordered in N = 1125 rows of 64 columns wide. The rst 554 rows contains the values of 8 8 images of zeros, while the remaining block of 571 rows contains the 64 pixel values of 571 ones. Use the rst 50 objects for each class for training, and the rest for testing. What is the classi cation error on the test objects? How much does this vary when you take other random subsets of 50 for training? Show the mean and standard deviation of the error. Is the performance with training on the rst 50 object unexpectedly good or bad?

Extend the implementation of the weak learner such that it accepts a weight per object. Therefore, next to a dataset and labels, the function should accept a weight w_{i} > 0 per object x_{i}. The weighted weak learner should now minimize the weighted classi cation error.
Test the code by using a simple classi cation problem again, and convince yourself that the code is Bug Free. (That means not only that ‘it does not crash’, but also show convincingly that objects with heigher weight have a larger in uence on the solution.)

Implement the algorithm that is described in Figure 2 of the paper: AdaBoost. Show the code.
Also implement the code to classify new and unseen data. Show the code.

Test your implementation on some dataset. Not only use a simple dataset like gendats but also a more complicated dataset like gendatb. Find
2
out which objects obtain a large weight w_{i}^{t}.^{1} Keep the number of iterations T xed to, say, T = 100.
h. Test the implementation on dataset optdigitsubset of last week. Use the rst 50 objects for each class for training, and the rest for testing. What is the classi cation error on the test objects? How does this error depend on the number of iterations T ? If you take the classi er with the optimal T , which training objects get a high weight? Show them!
^{1}And no, it is not su cient to just say ’object 13 and 7 have high weight’. Show me, do the more simple or more di cult objects get high weight?
3