Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Science

 

K-means Clustering

 


Cellular Network

The currently deployed wireless networks such as GSM, CDMA and LTE are known as cellular networks. In cellular network, the entire area is divided into smaller size cells to connect mobile subscribers with RF frequency to provide voice/data services. Each of these cells house one base station (i.e. BTS or eNodeB or eNB). (Link)

The base stations are interfaced together in different topologies viz. star, mesh etc. They are interfaced with MSCs, PSTN and PSDN in the backbone.

 





 

What is Cluster Analysis?

Detailed explaination.

 


 

Examples for different tpyes of Clusters.






What ideas can be borrowed from the cellular network?

K-means!!!

 

How to evaluate K-means?

Evaluation method

 

Does K-means work for all?

Limitations







 

Try to implement the algorithm and test with the following test cases.

First two columns are the x and y coordinates, and the third column is the group label

DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot
DataFile ClusterPlot

 




Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

 

Here is the Data.

Try to modify your program and solve this clustering problem.

 



Test your answer by
/home/fac/clei/checker/Kmeans/irisCheckerClustering YourAns.txt
(SampleAns)




  • Zachary's karate club

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate.

 

 

Data Download

Data Download (Matrix format)

Try to figure out a way to modify your program to handle this classic clustering problem. And display your result.

 




  • Spotify Songs Dataset

Spotify is a Swedish audio streaming and media service provider founded on 23 April 2006 by Daniel Ek and Martin Lorentzon. It is one of the largest music streaming service providers, with over 602 million monthly active users, including 236 million paying subscribers, as of December 2023.

Spotify offers digital copyright restricted recorded audio content, including more than 100 million songs and five million podcasts, from record labels and media companies.

 

Here is a dataset of Spotify tracks over a range of 125 different genres.
Each track has some audio features associated with it. The data is in CSV format, with total of 114,000 songs.

Can you arrange these 114,000 songs into some similar groups? 

 




Modified K-means style Algorithms

Choosing better initial centroid estimates: K-means++, Intelligent K-Means, Genetic K-Means

 Choosing different representative prototypes for the clusters: K-Medoids, K-Medians, K-Modes

 Applying feature transformation techniques: Weighted K-Means, Kernel K-Means