# na¨ıve bayes classiﬁcation | Computer Science homework help

In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.

Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:

P(x1:T|y) =

T Y t=1

P(xi|y). (3)

When estimated from dataset D with pseudo-count prior of α, the model parameters are:

ˆ P(xi|y) =

CountD(xi,y) + α CountD(y) + Nα

, (4)

where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and

4

α = 0.1 in our experiments.

Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classiﬁer predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classiﬁer’s performance.

Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:

True condition Positive Negative

Predicted Condition

Positive True positive False positive Negative False negative True negative

Other important performance measurements are precision, recall, and F-score, deﬁned as:

precision =

true positive true positive + false positive

(5)

recall =

true positive true positive + false negative

(6)

F-score = 2·

precision·recall precision + recall

(7)

(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classiﬁer in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)

(b) How does the change of α eﬀect the classiﬁer performance? Using random split above, evaluate the training and test accuracy and F-score under diﬀerent selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the ﬁrst plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)

Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more eﬃcient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.

## We've got everything to become your favourite writing service

### Money back guarantee

Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.

### Confidentiality

We don’t share your private information with anyone. What happens on our website stays on our website.

### Our service is legit

We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.

### Get a plagiarism-free paper

We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.

### We can help with urgent tasks

Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.

### Pay a fair price

Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.