Homework 2 - Chapter 3

Due: Tuesday January 28, 2014 at 11:55pm

Questions are (mostly) from the book:

  1. 3.1: Data quality can be assessed in terms of several issues, including accuracy, completeness, and consistency. For each of the above three issues, discuss how data quality assessment can depend on the intended use of the data, giving examples. Propose two other dimensions of data quality.
  2. Discuss issues to consider during data cleaning. During your discussion, highlight how each of the methods of data cleaning presented in the book handle specific issues.
  3. 3.4: Discuss issues to consider during data integration.
  4. 3.6: Use the following methods to normalize the dataset: 200, 300, 400, 600, 1000
    1. min-max normalization by setting min=0 and max=1
    2. z-score normalization
    3. z-score normalization using the mean absolute deviation instead of the standard deviation
    4. normalization by decimal scaling
  5. 3.8: Using the data for age and body fat given in Exercise 2.4:
        age    23    23    27    39    41    47    49    50    52    54    56    57    58    58    60    61
        %fat  9.5  26.5   7.8  31.4  25.9  27.4  27.2  31.2  34.6  28.8  33.4  30.2  34.1  32.9  41.2  35.7
        
    Answer the following:
    1. Normalize the two attributes based on z-score normalization.
    2. Calculate the correlation coefficent (Person's product moment coefficient).
    3. Compute the covariance.
  6. 3.13: Propose an algorithm, in pseudocode or in your favorite programming language, for the following:
    1. The automatic generation of a concept hierarchy for nominal data based on the number of distinct values of attributes in the given schema.
    2. The automatic generation of a concept hierarchy for numeric data based on the equal-width partitioning rule.
    3. The automatic generation of a concept hierarchy for numeric data based on the equal-frequency partitioning rule.