Lab 4 - Simple Data Analysis with Python

Due: Friday by 3:30pm

Many open-source tools can be used to manipulate datasets and perform data mining analyses from the command line. In this lab, we will look at using Python for simple data analysis from a set of files.

Resources:

Python is one of many tools that can be used to read data from a file and perform basic data manipulations. Other tools you may want to explore include R, Ruby, Perl, and similar command-line open source tools. R is a good tool to learn if you wish to perform statistical analyses, since it has many statistical functions and packages containing common algorithms available.

Python code can be run interactively from the python environment or through a script containing Python code. Scripts typically end in the extension .py and contain the line

import sys
at the top of the file.

There are many ways to get data in to Python. One can use the built in file tools to parse the file, import data from a datastore such as importing JSON objects (e.g. data from MongoDB or similar databases), and so forth. For simplicity in this lab, we will take the US Baby Names example in "Python for Data Analysis" and modify it for another dataset.

If you have Linux on your own laptop, go to the Preliminaries->Essential Python Libraries section of "Python for Data Analysis" and make sure you have all the indicated packages installed. The book also contains instructions for installing the necessary Python libraries on Windows and Mac OS X.

Update: Steve got python-pandas installed on Sleipnir, so you can use Sleipnir instead of the Debian VM for this assignment.

If you are using the lab machines, download the following virtual machine, which is a basic Debian 7.3.0 image with command-line Python installed: debian-vm.tar.bz2 (2.3GB)

Extract with the command

tar -xjf debian-vm.tar.bz2
Note that any files saved on the student account in Rm 311 will be deleted at the end of the day, so you might want to save the VM on a flash drive.

Use the daily weather records for Meadows Field in the following file: weather-data.csv

The data is in CSV format from 2004-01-01 to 2013-12-31 for Meadows Field airport in Bakersfield, CA. It contains the precipitation, temperature, sun, and wind data for each day during that time range. See NOAA description of daily summary for the description of each field. Note in particular that the temperature fields are an integer representation of decimal data, e.g. if the original value was 8.9C, then the CSV file contains 89.

If you wish to download a different dataset, go to NOAA National Climate Data Center and search for "Meadows Field" as the weather station to get the data for the Bakersfield airport (for some reason, searching for Bakersfield as the city brings up Visalia and Canada before it brings up Meadows Field).

What To Do For This Lab

Using the above dataset and the examples in the "Python for Data Analysis" book, give the set of Python commands that you would need to run to perform the following analyses on the dataset:
  1. Use the describe functionality to display the overall statistics for the entire dataset
  2. Display the maximum and minimum temperature by year (either text or plot)
  3. Display the mean, mode, and median precipitation by year (either text or plot)
  4. Analyze the relationship between precipitation and the day of the year.
  5. Perform another analysis of your choice on the dataset.
Note that the date field in the CSV file is YYYYMMDD, so you will need to split that into year, month, and day to run the above analyses.

Create a writeup with the commands used to perform the above actions, INCLUDING any prepatory commands you used to load the data into Python. You can make this a Python script file (.py file) if you like. Submit the Python commands on Moodle as your writeup for this lab.