Description
Problem 1 (10pt): Independence and uncorrelation

(5pt) Suppose X and Y are two continuous random variables, show that if X and Y are independent, then they are uncorrelated.

(5pt) Suppose X and Y are uncorrelated, can we conclude X and Y are independent? If so, prove it, otherwise, give one counterexample. (Hint: consider X U nif orm[ 1; 1] and Y = X^{2})
Problem 2 (15pt): [Minimum Error Rate Decision] Let !_{max}(x) be state of nature for which
P (!_{max}jx) P (!_{i}jx) for all i = 1; : : : ; c.

(5pt) Show that P (!_{max}jx) ^{1}_{c}

(5pt) Show that for minimumerrorrate decision rule, the average probability of error is
given by
Z
P (error) = 1 P (!_{max}jx)p(x)dx
(3) (5pt) Show that P (error) ^{c }_{c} ^{1}
Problem 3 (10pt): [Likelihood Ratio] Suppose we consider two category classi cation, the class conditionals are assumed to be Gaussian, i.e., p(xj!_{1}) = N(4; 1) and p(xj!_{2}) = N(8; 1), based on prior knowledge, we have P (!_{2}) = ^{1}_{4} . We do not penalize for correct classi cation, while for misclassi cation, we put 1 unit penalty for misclassifying !_{1} to !_{2} and put 3 unit for misclassifying !_{2} to !_{1}. Derive the bayesian decision rule using likelihood ratio.
Problem 4 (15pt): [Minimum Risk, Reject Option] In many machine learning applications, one has the option either to assign the pattern to one of c classes, or to reject it as being unrecognizable. If the cost for reject is not too high, rejection may be a desirable action. Let
8
>0; i = j and i; j = 1; : : : ; c
<
( ij!j) = _{r}; i = c + 1
>
^{:} _{s}; otherwise
where _{r} is the loss incurred for choosing the (c + 1)th action, rejection, and _{s} is the loss incurred for making any substitution error.

(5pt) Derive the decision rule with minimum risk.

(5pt) What happens if _{r} = 0?

(5pt) What happens if _{r} > _{s}?
Problem 5 (25pt): [Maximum Likelihood Estimation (MLE)] A general representation of a
1
@LL( )
@
exponential family is given by the following probability density:
p(xj ) = h(x) expf ^{T} T (x) A( )g
is natural parameter.
h(x) is the base density which ensures x is in right space. T (x) is the su cient statistics.
A( ) is the log normalizer which is determined by T (x) and h(x). exp(:) represents the exponential function.


(5pt) Write down the expression of A( ) in terms of T (x) and h(x).



(10pt) Show that _{@}^{@} A( ) = E T (x) where E (:) is the expectation w.r.t p(xj ).



(10pt) Suppose we have n i.i.d samples x_{1}; x_{2}; : : : ; x_{n}, derive the maximum likelihood estimator for . (You may use the results from part(b) to obtain your nal answer)

Problem 6 (25pt): [Logistic Regression, MLE] In this problem, you need to use MLE to derive and build a logistic regression classi er (suppose the target/response y 2 f0; 1g):
(1) (5pt) Suppose the classi er is y = x^{T} , where contains the weight as well as bias parame
ters. The loglikelihood function is LL( ), what is ?


(20pt) Write the codes to build and train the classi er on Iris plant dataset (https:// archive.ics.uci.edu/ml/datasets/iris). The iris dataset contains 150 samples with 4 features for 3 classes. To simplify the problem, we only consider: (a) two classes, i.e., virginica and nonvirginica; (b) The rst 2 types of features for training, i.e., sepal length and sepal width. Based on these simpli ed settings, train the model using gradient descent. Please show the classi cation results. (Note that (1) you could split the iris dataset into train/test set. (2) You could visualize the results by showing the trained classi er overlaid on the train/test data. (3) You could tune several hyperparameters, e.g., learning rate, weight initialization method etc, to see their e ects.


You could use sklearn or other packages to load and process the data, but you can not use the package to train the model).
2