CMPS 445 - Lab 6

Lab 6 - Bayesian Spam Classifier

Due: Monday by 3:30pm

Resources:

In a continuation from Lab 5, we will be making a Bayesian spam classifier for this lab. Use your favorite programming language for the program.

Recall from Lab 5 and the discussion on Wednesday that the basic calculation for the conditional probability is based on the product of its parts.

   P(WordList|Spam) = P(Spam) * P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam)
   P(WordList|Not-Spam) = P(Not-Spam) * P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)

For this lab, use the Spambase dataset created at HP and archived at UC Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase

This dataset contains a comma-separated list of numbers representing the following data format:

word frequency list, punctuation frequency list, character characteristic list, class

where each is as follows:

WordFrequecyList is a comma-separated list of word frequencies for 48 words. Each frequency is represented as a percent (e.g. 48 is 48%) and reflects the frequency of that word within that particular email. See the file named spambase.names for the actual words in the list.
PunctuationFrequencyList is a comma-separated list of punctuation frequencies for 6 punctuation characters. Each frequency is represented as a percent (e.g. 48 is 48%) and reflects the frequency of that word within that particular email. See the file named spambase.names for the actual words in the list.
CharacterCharacteristcsList is a comma-separated list of statistics on the upper and lower case characters in the email.
- Average string length of ALL CAPS strings
- Max string length of ALL CAPS strings
- Total string length of ALL CAPS strings
Class is 0 (not-spam) or 1 (spam)

What to Do For This Assignment

Using your favorite programming language, implement the following pseudocode:

Given: database of labeled values D
Output: Accuracy, error rate, precision and recall statistics where "positive"
        is mapped to "spam" and "negative" is mapped to "not-spam"

Split D into a training set and testing set using the Holdout method

For each field in the dataset
  Calculate the average and standard deviation for Spam training entries
  Calculate the average and standard deviation for Not-Spam training entries
EndFor

set tp, fp, tn, fn all to 0

For each entry in the testing dataset
   Calculate P(Spam|entry) and P(Not-Spam|entry) 
     Use the continuous function in the book for each P(FieldInEntry|Spam) and P(FieldInEntry|Not-Spam)
   Label the entry as either Spam or Not-Spam
   Compare the guessed label to the actual label in the dataset
     If guess == Spam and actual == Spam, increment tp
     If guess == Not-Spam and actual == Not-Spam, increment tn
     If guess == Spam and actual == Not-Spam, increment fp
     If guess == Not-Spam and actual == Spam, increment fn
EndFor

Output the accuracy, error rate, precision, and recall.

Turn in your code as an email to the instructor.