Resources:
In a continuation from Lab 5, we will be making a Bayesian spam classifier for this lab. Use your favorite programming language for the program.
Recall from Lab 5 and the discussion on Wednesday that the basic calculation for the conditional probability is based on the product of its parts.
P(WordList|Spam) = P(Spam) * P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam) P(WordList|Not-Spam) = P(Not-Spam) * P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)For this lab, use the Spambase dataset created at HP and archived at UC Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase
This dataset contains a comma-separated list of numbers representing the following data format:
word frequency list, punctuation frequency list, character characteristic list, classwhere each is as follows:
spambase.names for the actual words in the list.
spambase.names for the actual words in the list.
Given: database of labeled values D
Output: Accuracy, error rate, precision and recall statistics where "positive"
is mapped to "spam" and "negative" is mapped to "not-spam"
Split D into a training set and testing set using the Holdout method
For each field in the dataset
Calculate the average and standard deviation for Spam training entries
Calculate the average and standard deviation for Not-Spam training entries
EndFor
set tp, fp, tn, fn all to 0
For each entry in the testing dataset
Calculate P(Spam|entry) and P(Not-Spam|entry)
Use the continuous function in the book for each P(FieldInEntry|Spam) and P(FieldInEntry|Not-Spam)
Label the entry as either Spam or Not-Spam
Compare the guessed label to the actual label in the dataset
If guess == Spam and actual == Spam, increment tp
If guess == Not-Spam and actual == Not-Spam, increment tn
If guess == Spam and actual == Not-Spam, increment fp
If guess == Not-Spam and actual == Spam, increment fn
EndFor
Output the accuracy, error rate, precision, and recall.
Turn in your code as an email to the instructor.