Classify Data Based On K-Nearest Neighbor Algorithm Machine Learning

Gul Md Ershad
6y
38k
0
4

Article

Introduction

K-Nearest Neighbour (KNN) is a basic classification algorithm of Machine Learning. It comes under supervised learning. It is often used in the solution of classification problems in the industry. It is widely used in pattern recognization, data mining, etc. It stores all the available cases from the training dataset and classifies the new cases based on distance function.

I will explain the KNN algorithm with the help of the "Euclidean Distance" formula.

Euclidean Distance

The Euclidean distance formula is used to measure the distance in the plane. It is a very famous way to get the distance between two points.

Let's say the points (x1, y1) and (x2, y2) are points in 2-dimensional space and distance by using the Pythagorean formula like below.

Then, the Euclidean distance between (x1, y1) and (x2, y2) is,

d = √(x2 - x1)^2 + (y2 - y1) ^2

So, in short form,

Data Classification based on Euclidean Distance Formula

We have two different kinds of data mentioned below,

Training Data

This set of data contains all the information included with classifications like below.

This training data includes classification with given x, y values.

Test Data

This set of data contains only the values of x and y. Its classification type would be predicted based on the training data.

This set of training data doesn't contain a classification type. So, it will be predicted.

Implementation

Import the below libraries.

import csv
import sys
from collections import Counter
from math import sqrt

Import training data set.

x = []
y = []
z = []
with open('training_data.csv','rt') as f:
reader = csv.reader(f)
for row in reader:
x.append(float(row[0]))
y.append(float(row[1]))
z.append(row[2])
coordinates = list(zip(x,y))
input_data = {coordinates[i]:z[i] for i in range(len(coordinates))}

Import test data set.

test_x = []
test_y = []
with open('test_data.csv', 'rt') as f:
reader = csv.reader(f)
for row in reader:
test_x.append(float(row[0]))
test_y.append(float(row[1]))
test_coordinates = list(zip(test_x, test_y))
print (test_coordinates)

Generate the Euclidean distance.

def euclidean_distance(x, y):
if len(x) != len(y):
return "Error: try equal length vectors"
else:
return sqrt(sum([(x[i]-y[i])**2 for i in range(len(y))]))

KNN clissifier.

def knn_classifier(neighbors, input_data):
knn = [input_data[i] for i in neighbors]
knn = Counter(knn)
classifier, _ = knn.most_common(1)[0]
return classifier

Generate Neighbours.

def neighbors(k, trained_points, new_point):
neighbor_distances = {}
for point in trained_points:
if point not in neighbor_distances:
neighbor_distances[point] = euclidean_distance(point, new_point)
least_common = sorted(neighbor_distances.items(), key = lambda x: x[1])
k_nearest_neighbors = list(zip(*least_common[:k]))
return list(k_nearest_neighbors[0])

Print Results.

results = {}
for item in test_coordinates:
results[item] = knn_classifier(neighbors(3,input_data.keys(), item), input_data)
print (results)

Output

Here, x and y data have been classified into different groups. I have attached the zipped Python code. Python 3 or above will be required to execute this code.

Conclusion

K-Nearest Neighbor algorithm is an important algorithm for supervised learning in Machine Learning.