Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Science

 

Decision Tree Classification

 


The Pretender (YoutubeLink) is an American action drama television series created by Steven Long Mitchell and Craig W. Van Sickle, that aired on NBC from September 19, 1996 to May 13, 2000.

The series follows Jarod, a young man on the run who is a "Pretender": a genius impostor able to quickly master the complex skill sets necessary to impersonate a member of any profession.

In first season, episode 1, Jarod pretends to be a doctor to avenge a boy crippled by faulty surgery.

YoutubeLink (0:32~0:38)

If you are Jarod in S1E1, want to pretend to be a doctor. What to do?

Problem detailed discription

 

 




 

One example from the Department of Health & Human Services
Another example from The most widely read and highly cited peer-reviewed neurology journal


 

How can we build something similar?

 


How to build an better decision tree?

Information Theory (Entropy)!

Build the tree based on the Information Gain.

 




Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

 

Here is the TrainingData and TestingData.

 

Try to build different decision trees for the training data.

  • Try to build a “perfect” decision tree based on the training data, and see if it has over fitting problem

  •  Try to build an “over-pruning” decision tree and test the accuracy on the testing data.

  •  Find a good threshold to prune the decision, test the performance on the testing data. Compare the result with two methods above.


Test your answer by /home/fac/clei/checker/DT/irisChecker50 YourAns.txt
(SampleAns)