Lab 7 - Support Vector Machines

Due: Friday by 3:30pm

In this lab, we will experiment with linearly separable and linearly inseparable data using an open-source implementation of support vector machines in C.

We will be using the open-source package SVMlight for this assignment. The website for the package is http://svmlight.joachims.org/. This code allows one to train SVMs with both linearly separable data and linearly inseparable data. You can choose from the three kernel functions given in class for linearly inseparable data.

Create a directory called svm_light and change into that directory. Use the wget command to download the current source code for SVMlight and follow the directions on the author's page to unpack the tarball and compile the code.

Create an additional directory called data and change into that directory. Copy the following two programs to create datasets for this lab into that directory and compile them:

You would compile and run with the following:
gcc -o svm_dataset svm_dataset.c
gcc -o holdout holdout.c
./svm_dataset > plain_data
./holdout plain_data
./svm_dataset -i 50 > insep_data
./holdout insep_data
Make several datasets, both linearly separable and linearly inseparable. Copy the resulting *.train and *.test files up one directory to the svm_light directory.

Run the svm_learn program on the *.train datasets and the svm_classify program on the *.test datasets (do this one dataset at a time, not on all datasets at once). Note the accuracy, precision, and recall of svm_classify on the testing datasets.

Try different options for svm_learn, as specified on the program's website. Particularly, see how the kernel selection affects the accuracy, precision, and recall for the testing datasets.

What to Submit for this Lab

Submit a writeup of what you have done for this lab. This writeup can include screencaps from running the programs. The writeup should specify the datasets created (e.g. command line options for svm_dataset), the svm_learn options used to generate model files, and the resulting statistics for svm_classify on those model files.

If you had a dataset that show particular difficulty with classification, convert that dataset file into a CSV file using the following vi substitution command (create a copy of the file first and edit the copy, e.g. cp insep_data insep.csv):

:1,$s/ [0-9][0-9]*:/, /g
Use a simple visualization technique, such as OpenOffice scatter plots, to visualize the overlap in the data. Here is an example of the visualization for a randomly generated linearly nonseparable file: nonsep.pdf. Upload your visualization file and the dataset.