Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Science

 

Hierarchical Clustering

 



Corona-virus

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS coronavirus 2, or SARS-CoV-2), a virus closely related to the SARS virus. The disease was discovered and named during the 2019–20 coronavirus outbreak. Those affected may develop a fever, dry cough, fatigue, and shortness of breath.

(!!!!Based on the unconfirmed resource!!!!)Genetic analysis of SARS-CoV-2 sequences shows that their closest genetic relatives appear to be bat coronaviruses, with the role of intermediate species possibly played by the pangolin.

 

Why there is an "intermediate host"?

How genetically similar are humans and humans?
How genetically similar are humans and gorillas?
How genetically similar are humans and mice?
How genetically similar are humans and bananas?

 


That is the difference between all mammals?






Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram

A tree like diagram that records the sequences of merges or splits

 

Here are the methods to construct the hierarchical tree.




 

Try to implement the algorithm and test with the following test cases.

First two columns are the x and y coordinates, and the third column is the group label

DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot

 




Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

 

Here is the Data.

Try to modify your program and solve this clustering problem.

 



Test your answer by /home/fac/clei/checker/Hierarchical/irisCheckerClustering YourAns.txt
(SampleAns)