Lab 6 - Bayesian Spam Classifier

Due: Monday by 3:30pm

Resources:

In a continuation from Lab 5, we will be making a Bayesian spam classifier for this lab. Use your favorite programming language for the program.

Recall from Lab 5 and the discussion on Wednesday that the basic calculation for the conditional probability is based on the product of its parts.

   P(WordList|Spam) = P(Spam) * P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam)
   P(WordList|Not-Spam) = P(Not-Spam) * P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)
For this lab, use the Spambase dataset created at HP and archived at UC Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase

This dataset contains a comma-separated list of numbers representing the following data format:

word frequency list, punctuation frequency list, character characteristic list, class
where each is as follows:

What to Do For This Assignment

Using your favorite programming language, implement the following pseudocode:
Given: database of labeled values D
Output: Accuracy, error rate, precision and recall statistics where "positive"
        is mapped to "spam" and "negative" is mapped to "not-spam"

Split D into a training set and testing set using the Holdout method

For each field in the dataset
  Calculate the average and standard deviation for Spam training entries
  Calculate the average and standard deviation for Not-Spam training entries
EndFor

set tp, fp, tn, fn all to 0

For each entry in the testing dataset
   Calculate P(Spam|entry) and P(Not-Spam|entry) 
     Use the continuous function in the book for each P(FieldInEntry|Spam) and P(FieldInEntry|Not-Spam)
   Label the entry as either Spam or Not-Spam
   Compare the guessed label to the actual label in the dataset
     If guess == Spam and actual == Spam, increment tp
     If guess == Not-Spam and actual == Not-Spam, increment tn
     If guess == Spam and actual == Not-Spam, increment fp
     If guess == Not-Spam and actual == Spam, increment fn
EndFor

Output the accuracy, error rate, precision, and recall.
Turn in your code as an email to the instructor.