Description

Naive Bayes for Text Classification
In this question, you will implement and evaluate Naive Bayes for text classi – cation.
0 Points Download the spam/ham (ham is not spam) dataset available on myCourses. The data set is divided into two sets: training set and test set. The dataset was used in the Metsis et al. paper [1]. Each set has two directories: spam and ham. All les in the spam folders are spam messages and all les in the ham folder are legitimate (non spam) messages.
1
40 points Implement the multinomial Naive Bayes algorithm for text classi cation described here: http://nlp.stanford.edu/IRbook/pdf/13bayes.pdf (see Figure 13.2). Note that the algorithm uses addone laplace smoothing. Ignore punctuation and special characters and normalize words by converting them to lower case, converting plural words to singular (i.e., \Here” and \here” are the same word, \pens” and \pen” are the same word). Normalize words by stemming them using an online stemmer such as http://www.nltk.org/howto/stem.html. Make sure that you do all the calculations in logscale to avoid under ow. Use your algorithm to learn from the training set and report accuracy on the test set.

points Improve your Naive Bayes by throwing away (i.e., ltering out) stop words such as \the” \of” and \for” from all the documents. A list of stop words can be found here: http://www.ranks.nl/stopwords. Report accuracy for Naive Bayes for this ltered set. Does the accuracy improve? Explain why the accuracy improves or why it does not?

Point Estimation
20 points Derive maximum likelihood estimators for

parameter p, Bernoulli(p) sample of size n.

parameter p based on a Binomial(N, p) sample of size n. Compute your estimators if the observed sample is (3, 6, 2, 0, 0, 3) and N = 10.

parameters a and b based on a Uniform (a, b) sample of size n.

parameter based on a Normal( , ^{2}) sample of size n with known variance ^{2} and unknown mean .

parameter based on a Normal( , ^{2}) sample of size n with known mean and unknown variance ^{2}.

parameters ( , ^{2}) based on a Normal( , ^{2}) sample of size n with unknown mean and variance ^{2}.
30 points You are given a coin and a thumbtack and you perform the following experiment: toss both the thumbtack and the coin 100 times. You get 60 heads and 40 tails for the coin, 70 heads and 30 tails for the thumbtack. You put Beta priors Beta(1,1), Beta(40, 60), Beta(30, 70), Beta(100, 100), Beta(1000, 1000), and Beta(100,000, 100,000) on the coin and thumbtack, respectively. (For both the coin toss and thumbtack toss experiments, you put these three priors, respectively.)

Derive the MLE and MAP estimates for the coin and the thumbtack.

With the help of gures identify how the di erent priors a ect the estimated parameter values. Follow the point estimation example in the lectures and illustrate in a similar manner in your answer. For
2
full credit, a curve for each scenario should be shown along with an explanation of the curve in terms of the di erent values in the question (number of heads and tails recorded and parameters of the Beta prior). Record changes in the curve for the di erent combinations and include intuitive explanations for the changes.

True or False: As you have collect more data instances by tossing the coin/thumbtack, the MLE estimate will approach the MAP estimate. Explain. [Answers without explanation will receive no credit.]

True or False: The MLE of the coin and the thumbtack are di erent but their MAP estimates approach the same value when we use a larger prior (think Beta(100,000, 100,000) and larger). Explain. [Answers without explanation will receive no credit.]
What to Turn in
Your code
README le for compiling and executing your code. A detailed write up that contains:


The accuracy on the test set.



Detailed answers to point estimation questions showing the complete derivation (can be handwritten and scanned). No points will be given for answers that are incomplete.

References

V. Metsis, I. Androutsopoulos and G. Paliouras, \Spam Filtering with Naive Bayes – Which Naive Bayes?”. Proceedings of the 3rd Conference on Email and AntiSpam (CEAS 2006), Mountain View, CA, USA, 2006.
3