Lab 5 - Bayesian Spam Classifier

Due: Friday by 3:30pm

Resources:

In this lab, we are going to explore using Bayesian classifiers for filtering spam email. We will begin the lab with a brief lecture on the techniques used.

Assume you have the following data from your training dataset for this lab. Your training dataset contains 200 samples, 100 of which are spam and 100 of which are not-spam:

Word Percent observed in Spam Percent observed in Non-spam
sir 50% 8%
madam 50% 2%
beneficiary 20% 2%
opportunity 65% 50%
account 38% 20%
reputable 40% 19%
rare 35% 10%
important 40% 35%
confidential or confidentially 60% 30%
business 50% 45%
urgent or urgently 50% 50%
overseas or foreign 42% 2%
million 60% 10%

We would use the following calculations to see if a message is spam or not spam:

  1. Create a word list of words seen in the email that are in the training data table.
  2. Calculate P(WordList|Spam) and P(WordList|Not-Spam) by the following:
         P(WordList|Spam) = P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam)
         P(WordList|Not-Spam) = P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)
    
  3. Classify the email as spam if P(WordList|Spam) > P(WordList|Not-Spam)
An example will be shown in class.

What to Do For This Assignment

Find an example of a Nigerian/419 scam email. Apply the above steps to it to calculate if this training set would detect it as Spam or Not-Spam. Submit a writeup that contains the message, the word list from step 1, the calculations from step 2, and the classification of the email from step 3.