Understanding K-Means Clustering using Python the easy way

k-Means Clustering using Python

In the previous article, we studied the k-NN. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.

What is k-Means Clustering?

k-means algorithm is an iterative algorithm that tries to partition the dataset into 'k' pre-defined distinct non-overlapping subgroups or clusters where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different or as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

Why k-Means Clustering?

  1. K-means converges in a finite number of iterations. Since the algorithm iterates a function whose domain is a finite set, the iteration must eventually converge.
  2. The computational cost of the k-means algorithm is O(k*n*d), where n is the number of data points, k the number of clusters, and d the number of attributes.
  3. Compared to other clustering methods, the k-means clustering technique is fast and efficient in terms of its computational cost.
  4. It’s difficult to predict the optimal number of clusters or the value of k. To find the number of clusters, we need to run the k-means clustering algorithm for a range of k values and compare the results.

What is meant by Clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.

Types of Clustering

1. Centroid-based Clustering
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. 
2. Density-based Clustering
Density-based clustering connects areas of high example density into clusters. This allows for arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters.
3. Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian distributions. The distribution-based algorithm clusters data into three Gaussian distributions. As the distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show a decrease in probability. When you do not know the type of distribution in your data, you should use a different algorithm.
4. Hierarchical Clustering
Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. In addition, another advantage is that any number of clusters can be chosen by cutting the tree at the right level.

Difference between Clustering and Regression

Type Supervised Learning Unsupervised Learning
Description Process of classifying the input instances based on their corresponding class labels Grouping the instances based on their similarity without the help of class labels
The necessity of Training and Testing Dataset It has labels so there is a need for training and testing dataset for verifying the model created There is no need for training and testing dataset
Complexity More complex as compared to clustering Less complex as compared to the classification
Example Algorithms Logistic regression, Naive Bayes classifier, Support vector machines, etc. k-means clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering algorithm, etc.

Why is k-Means clustering nondeterministic?

The basic k-means clustering is based on a non-deterministic algorithm. This means that running the algorithm several times on the same data could give different results, i.e. it may be possible that the centroid value for a given class may come out to be different at different instances of time for the same dataset. 
For instance, at a given instance we took D1 and D2 to find the centroid and then we took D1 and D5, the value of centroid will change, hence the resulting model will change  

How does k-Means clustering work?


Step 1
Begin with a decision on the value of k = number of clusters.​
Step 2
Put any initial partition that classifies the data into k clusters. You may  assign the training samples randomly, or systematically as the following: ​
       1. Take the first k training sample as single- element clusters      ​
       2. Assign each of the remaining (N-k) training samples to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster. ​
Step 3
Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample. ​
Step 4
Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments. 

What are Centroids?

In mathematics and physics, the centroid or geometric center of a plane figure is the arithmetic mean position of all the points in the figure. Informally, it is the point at which a cutout of the shape could be perfectly balanced on the tip of a pin.

What are Centroids in k-Means?

The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters. Initially, the k number of so-called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. Each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroids ci and cj, ci ≠ cj). These centroids are used to train a classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produces an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.

Advantages/Features of k-Means

  1. Easy to implement
  2. With a large number of variables, K-­‐Means may be computationally faster than hierarchical clustering (if K is small).
  3. k-­Means may produce tighter clusters than hierarchical clustering
  4. An instance can change cluster (move to another cluster) when the centroids are re-­computed 

Disadvantages/Shortcomings of k-Means

  1. Difficult to predict the number of clusters (K-­‐Value)​
  2. Initial seeds have a strong impact on the final results​
  3. The order of the data has an impact on the final results​
  4. Sensitive to scale: rescaling your datasets (normalization or standardization) will​ completely change results. While this itself is not bad, not realizing that you have to spend extra a4ention to scaling your data might be bad.​ 

Applications of k-Means

k-means can be applied to data that has a smaller number of dimensions, is numeric, and is continuous.
  1. Document Classification
  2. Delivery Store Optimization
  3. Identifying Crime Localities
  4. Customer Segmentation
  5. Fantasy League Stat Analysis
  6. Insurance Fraud Detection
  7. Rideshare Data Analysis
  8. Cyber-Profiling Criminals
  9. Call Record Detail Analysis
  10. Automatic Clustering of IT alerts

k-Means using an Example

Let's try to understand k-means using a mathematical example. Given the following dummy dataset with two clusters A and B.
Iteration 1
Data Points X axis Y axis
D1 2 0
D2 1 3
D3 3 5
D4 2 2
D5 4 6
Let's start by finding the Euclidean Distance given by the formula: √((x1-y1)² + (x2-y2)²
Distance between D1 and D2 : √((2-1)² + (0-3)²
= √((1)² + (3)²
= √(10)
= 3.17
Distance between D1 and D4 : √((2-2)² + (0-2)²
= √((0)² + (-2)²
= √(4)
= 2
Similarly, let us find the distance between other points
Data Points Distance between D2 and other data points Distance between D4 and other data points
D1 3.17 2.0
D3 2.83 3.17
D5 4.25 4.48
Seeing the above data, we can form the following two groups
Cluster 1: D1, D4
Cluster 2: D2, D3, D5
Now our next step is to find the mean values and the new centroid values with respect to the corresponding to the centroid values
Data Points Mean value of data points along the x-axis Distance between D4 and other data point
D1, D4 2.0 1.0
D2, D3, D5 2.67 4.67
Hence the centroid values are
Cluster 1: (2.0, 1.0)
Cluster 2: (2.67, 4.67)
Iteration 2
Based on the above data, we again calculate the euclidean distance for the new centroids.
Data Points Distance between D2 and other data points Distance between D4 and other data points
D1 1.0 4.72
D2 2.24 2.37
D3 4.13 0.47
D4 1 2.76
D5 5.39 1.89
From the above data, we infer the following
Cluster 1: (D1, D2, D4)
Cluster 2: (D3, D5)
Since the clusters have changed so we will now calculate the mean values again and the corresponding centroid values
Data Points Mean value of data points along the x-axis Distance between D4 and other data point
D1, D2, D4 1.67 1.67
D3, D5 3.5 5.5
So now the centroid values are:
Cluster 1: (1.67, 1.67)
Cluster 2: (3.5, 5,5)
The above process has to be repeated until we find a constant value for centroids and the latest cluster will be considered as the final cluster solution.

Python Implementation of K-Means Clustering

I have used Google Colab to run and test my code, you can use any of the IDEs or development environments. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.
Before we start with this, it is highly recommended you read the following tutorials
  1. Python Pandas
  2. Python Numpy
  3. Python Scikit Learn
  4. Python MatPlotLib
  5. Python Seaborn
  6. Python Tensorflow

1. Using functions

In this, I will be using some dummy data, you can use any data, that you want.
  1. # Importing the required libraries 
  2. import matplotlib.pyplot as plt  
  3. from matplotlib import style  
  4. style.use('ggplot')  
  5. import numpy as np  
  7. #Dummy data  
  8. X = np.array([[12],  
  9.               [1.51.8],  
  10.               [58 ],  
  11.               [88],  
  12.               [10.6],  
  13.               [9,11],  
  14.               [1,3],  
  15.               [8,9],  
  16.               [0,3],  
  17.               [5,4],  
  18.               [6,4],  
  19.               [1,2],  
  20.               [2,4],  
  21.               [11,5],  
  22.               [2.56,5.6],  
  23.               [22],  
  24.               [1.81.8],  
  25.               [88 ],  
  26.               [55],  
  27.               [0.60.6],  
  28.               [11,11],  
  29.               [3,3],  
  30.               [9,9],])  
  32. #Plot of the Dummy Data  
  33. plt.scatter(X[:,0], X[:,1], s=150)  
  34. plt.show()  
  36. #Defining the colors that can be used in plots  
  37. colors =["g","r","c","b","k"]  
  39. #Defination of the class K_Means  
  40. #Using this class we are implementing all the steps required  
  41. #to be performed to generate a k-means model  
  42. class K_Means:  
  43.     def __init__(self, k=5, tol=0.001, max_iter=300):  
  44.         self.k = k  
  45.         self.tol = tol  
  46.         self.max_iter = max_iter  
  48.     #Function to fit the dummy mode on to the model  
  49.     def fit(self,data):  
  51.         self.centroids = {}  
  53.         for i in range(self.k):  
  54.             self.centroids[i] = data[i]  
  56.         for i in range(self.max_iter):  
  57.             self.classifications = {}  
  59.             for i in range(self.k):  
  60.                 self.classifications[i] = []  
  62.             for featureset in data:  
  63.                 distances = [np.linalg.norm(featureset-self.centroids[centroid]) for centroid in self.centroids]  
  64.                 classification = distances.index(min(distances))  
  65.                 self.classifications[classification].append(featureset)  
  67.             prev_centroids = dict(self.centroids)  
  69.             for classification in self.classifications:  
  70.                 self.centroids[classification] = np.average(self.classifications[classification],axis=0)  
  72.             optimized = True  
  74.             for c in self.centroids:  
  75.                 original_centroid = prev_centroids[c]  
  76.                 current_centroid = self.centroids[c]  
  77.                 if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:  
  78.                     np.sum((current_centroid-original_centroid)/original_centroid*100.0)  
  79.                     optimized = False  
  81.             if optimized:  
  82.                 break  
  84.     #Function to generate prediction result  
  85.     def predict(self,data):  
  86.         distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]  
  87.         classification = distances.index(min(distances))  
  88.         return classification  
  90. clf = K_Means()  
  91. clf.fit(X)  
  93. #putting the centroid values on to the plot  
  94. for centroid in clf.centroids:  
  95.     plt.scatter(clf.centroids[centroid][0], clf.centroids[centroid][1],  
  96.                 marker="o", color="k", s=150, linewidths=5)  
  98. for classification in clf.classifications:  
  99.     color = colors[classification]  
  100.     for featureset in clf.classifications[classification]:  
  101.         plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)  
  103. #Priting the centroid values  
  104. for centroid in clf.centroids:  
  105.     print(clf.centroids[centroid][0],clf.centroids[centroid][1])  
  107. plt.show()  
The centroid values I am getting are:
1.0                            2.8
2.075                        2.15
4.712                        5.32
9.142857142857142 8.714285714285714
0.8                            0.6

2. Using Sklearn

Here I am using dummy data, you can use any dataset of your choice.
  1. # importing all the required libraries  
  2. import matplotlib.pyplot as plt    
  3. %matplotlib inline  
  4. import numpy as np    
  5. from sklearn.cluster import KMeans    
  7. #dummy data  
  8. X = np.array([[12],    
  9.               [1.51.8],    
  10.               [58 ],    
  11.               [88],    
  12.               [10.6],    
  13.               [9,11],    
  14.               [1,3],    
  15.               [8,9],    
  16.               [0,3],    
  17.               [5,4],    
  18.               [6,4],    
  19.               [1,2],    
  20.               [2,4],    
  21.               [11,5],    
  22.               [2.56,5.6],    
  23.               [22],    
  24.               [1.81.8],    
  25.               [88 ],    
  26.               [55],    
  27.               [0.60.6],    
  28.               [11,11],    
  29.               [3,3],    
  30.               [9,9],])   
  32. #plotting the dummy data  
  33. plt.scatter(X[:,0],X[:,1], label='True Position')   
  34. plt.show()   
  36. #creating the an object of sklearn.cluster.KMeans class  
  37. #Passed the number of clusers value to be 3, i.e. we are making a 3-Means clusering model  
  38. kmeans = KMeans(n_clusters=3)    
  40. #passing the data to the model  
  41. kmeans.fit(X)  
  43. #printing the centroid values  
  44. print(kmeans.cluster_centers_)  
  46. #Printing the predicted values  
  47. print(kmeans.labels_)  
  49. #Plotting the data based upon their predicted cluster  
  50. plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')  
  51. plt.show() 
The output that I got is
[[1.35454545 2.16363636]
[9.14285714 8.71428571]
[4.712 5.32 ]]
Points classified into classes:
[0 0 2 1 0 1 0 1 0 2 2 0 0 1 2 0 0 1 2 0 1 0 1]

3. Using TensorFlow

Here I am using dummy data, you can use any dataset that you wish to use.
  1. #importing the required libraries  
  2. import matplotlib.pyplot as plt  
  3. import numpy as np  
  4. import tensorflow as tf  
  6. points_n = 200 #number of data points  
  7. clusters_n = 3 #number of classes in which points needs to be classifies  
  8. iteration_n = 100 #epoch value  
  10. #initializing the placehlders  
  11. points = tf.constant(np.random.uniform(010, (points_n, 2)))  
  12. centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [00], [clusters_n, -1]))  
  13. points_expanded = tf.expand_dims(points, 0)  
  14. centroids_expanded = tf.expand_dims(centroids, 1)  
  16. #Creation of tensor to hold the euclidean disatnce formula  
  17. distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2)  
  18. assignments = tf.argmin(distances, 0)  
  20. #logic to generate the model  
  21. means = []  
  22. for c in range(clusters_n):  
  23.     means.append(tf.reduce_mean(  
  24.       tf.gather(points,   
  25.                 tf.reshape(  
  26.                   tf.where(  
  27.                     tf.equal(assignments, c)  
  28.                   ),[1,-1])  
  29.                ),reduction_indices=[1]))  
  31. new_centroids = tf.concat(means, 0)  
  33. update_centroids = tf.assign(centroids, new_centroids)  
  34. init = tf.global_variables_initializer()  
  36. #session to execute the logic and generate the model  
  37. with tf.Session() as sess:  
  38.   sess.run(init)  
  39.   for step in range(iteration_n):  
  40.     [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, centroids, points, assignments])  
  42.   #to print the centroid values    
  43.   print("centroids", centroid_values)  
  45. #plotting of the dummy data with centroid values  
  46. plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)  
  47. plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)  
  48. plt.show() 
The output that I am getting is:
[[3.81697836 2.2322892 ]
[2.72036385 6.65968922]
[8.123787 5.82232576]]


In this article, we studied what  k-means clustering is, why we use k-means clustering, what is meant by clustering, types of clustering, the difference between clustering and regression, how k-means clustering works, what are centroids, what are centroids in k-means, advantages of k-means, disadvantages of k-means, applications of k-means, k-means using an example and python implementation of the k-NN algorithm using functions, sklearn, and TensorFlow. Hope you were able to understand each and everything. For any doubts, please comment on your query.
In the next article, we will learn about the Support Vector Machine.
Congratulations!!! You have climbed your next step in becoming a successful ML Engineer.
Next Article In this Series >> Support Vector Machine