Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

Data Science

 

K-Nearest Neighbors algorithm

 



“You are who you associate with. Look around at your five closest friends and that’s who you are. If you don’t want to be that person, you know what you gotta do.”

Will Smith

“When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”

— Indiana poet James Whitcomb Riley

 




 k-NN classification

The output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

Detailed explanation  




Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

Here is the KnownData; and make your prediction for these UnknownData.
Test your answer by /home/fac/clei/checker/KNN/irisChecker20 YourAns.txt
(SampleAns)

d(xi, xj)= sqrt (sum for r=1 to n (ar(xi) - ar(xj))^2)

18/19




How to determine best "K" and best "Nearest" ?

Exhaust all the possibilities!

 

How to split the data?

 




Here is the public dataset drawn from the U.S. Army Anthropometric Survey  form University of Michigan

Try to use your program on the following 2 datasets

KnownData1    KnownData2
UnknownData1    UnknownData2

Test your answer (Sample Answer) by
/home/fac/clei/checker/KNN/armyChecker1 YourAnsForData1.txt
/home/fac/clei/checker/KNN/armyChecker2 YourAnsForData2.txt

Why?

How to fix?




 k-NN regression

The output is the property value for the object. This value is the average of the values of k nearest neighbors.

 

 

 Airbnb is a internet marketplace for short-term home and apartment rentals. It allows you to, for example, rent out your home for a week while you’re away, or rent out your spare bedroom to travelers.   

Airbnb doesn’t release any data on the listings in its marketplace, a but separate group named Inside Airbnb has extracted data on a sample of the listings for many of the major cities on the website.

Here is a example for Amsterdam price data
KnownData   UnknownData

Cal the SSE of your answer (SampleAns) by
/home/fac/clei/checker/KNN/airbnbChecker YourAns.txt




How to evaluate accuracy?





Let's try to use KNN idea to solve the Handwriting recognition problem.

In our test case, we will use the 32by32 pixel png format picture (Example)

Generate your own handwriting digits (Here is mine)

Use this python code to convert it into matrix format. (Here is mine)

Change the very first digit in the matrix as the Label (example)

 

 

Now, try to use the KNN method to build a classifier to recognize what digit the input (matrix format) picture represents.

Download the data from here