na¨ıve bayes classiﬁcation | Computer Science homework help
In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.
Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:
T Y t=1
When estimated from dataset D with pseudo-count prior of α, the model parameters are:
ˆ P(xi|y) =
CountD(xi,y) + α CountD(y) + Nα
where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and
α = 0.1 in our experiments.
Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classiﬁer predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classiﬁer’s performance.
Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:
True condition Positive Negative
Positive True positive False positive Negative False negative True negative
Other important performance measurements are precision, recall, and F-score, deﬁned as:
true positive true positive + false positive
true positive true positive + false negative
F-score = 2·
precision·recall precision + recall
(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classiﬁer in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)
(b) How does the change of α eﬀect the classiﬁer performance? Using random split above, evaluate the training and test accuracy and F-score under diﬀerent selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the ﬁrst plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)
Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more eﬃcient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.