Machine Learning: K-Means Clustering


In the previous chapter, you studied Multiple Linear Regression.
In this chapter, we will study Multiple Linear Regression.
Note: if you can correlate anything with yourself or your life, there are greater chances of understanding the concept. So try to understand everything by relating it to humans.

What is k-Means Clustering?

k-means algorithm is an iterative algorithm that tries to partition the dataset into 'k' pre-defined distinct non-overlapping subgroups or clusters where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different or as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The less variation we have within clusters, the more homogeneous the data points are within the same cluster.

Why k-Means Clustering?

  1. K-means converges in a finite number of iterations. Since the algorithm iterates a function whose domain is a finite set, the iteration must eventually converge.
  2. The computational cost of the k-means algorithm is O(k*n*d), where n is the number of data points, k the number of clusters, and d the number of attributes.
  3. Compared to other clustering methods, the k-means clustering technique is fast and efficient in terms of its computational cost.
  4. It’s difficult to predict the optimal number of clusters or the value of k. To find the number of clusters, we need to run the k-means clustering algorithm for a range of k values and compare the results.

What is meant by Clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.

Types of Clustering

1. Centroid-based Clustering
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. 
2. Density-based Clustering
Density-based clustering connects areas of high example density into clusters. This allows for arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters.
3. Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian distributions. The distribution-based algorithm clusters data into three Gaussian distributions. As the distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show a decrease in probability. When you do not know the type of distribution in your data, you should use a different algorithm.
4. Hierarchical Clustering
Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. In addition, another advantage is that any number of clusters can be chosen by cutting the tree at the right level.

Difference between Clustering and Regression

Type Supervised Learning Unsupervised Learning
Description Process of classifying the input instances based on their corresponding class labels Grouping the instances based on their similarity without the help of class labels
The necessity of Training and Testing Dataset It has labels so there is a need for training and testing dataset for verifying the model created There is no need for training and testing dataset
Complexity More complex as compared to clustering Less complex as compared to the classification
Example Algorithms Logistic regression, Naive Bayes classifier, Support vector machines, etc. k-means clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering algorithm, etc.

Why is k-Means clustering nondeterministic?

The basic k-means clustering is based on a non-deterministic algorithm. This means that running the algorithm several times on the same data could give different results, i.e. it may be possible that the centroid value for a given class may come out to be different at different instances of time for the same dataset. 
For instance, at a given instance we took D1 and D2 to find the centroid and then we took D1 and D5, the value of centroid will change, hence the resulting model will change 

How does k-Means clustering work?

Step 1
Begin with a decision on the value of k = number of clusters.​
Step 2
Put any initial partition that classifies the data into k clusters. You may  assign the training samples randomly, or systematically as the following: ​
       1. Take the first k training sample as single- element clusters      ​
      2. Assign each of the remaining (N-k) training samples to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster. ​
Step 3
Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample. ​
Step 4
Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments. 

What are Centroids?

In mathematics and physics, the centroid or geometric center of a plane figure is the arithmetic mean position of all the points in the figure. Informally, it is the point at which a cutout of the shape could be perfectly balanced on the tip of a pin.

What are Centroids in k-Means?

The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.Initially, the k number of so-called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. Each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroids ci and cj, ci ≠ cj).
These centroids are used to train a classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produces an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.

k-Means using an Example

Let's try to understand k-means using a mathematical example. Given the following dummy dataset with two clusters A and B.
Iteration 1
Data Points X-axis Y-axis
D1 2 0
D2 1 3
D3 3 5
D4 2 2
D5 4 6
Let's start by finding the Euclidean Distance given by the formula: √((x1-y1)² + (x2-y2)²
Distance between D1 and D2 : √((2-1)² + (0-3)²
= √((1)² + (3)²
= √(10)
= 3.17
Distance between D1 and D4 : √((2-2)² + (0-2)²
= √((0)² + (-2)²
= √(4)
= 2
Similarly, let us find the distance between other points
Data Points Distance between D2 and other data points Distance between D4 and other data points
D1 3.17 2.0
D3 2.83 3.17
D5 4.25 4.48
Seeing the above data, we can form the following two groups
Cluster 1: D1, D4
Cluster 2: D2, D3, D5
Now our next step is to find the mean values and the new centroid values with respect to the corresponding to the centroid values
Data Points Mean value of data points along the x-axis Distance between D4 and other data point
D1, D4 2.0 1.0
D2, D3, D5 2.67 4.67
Hence the centroid values are
Cluster 1: (2.0, 1.0)
Cluster 2: (2.67, 4.67)
Iteration 2
Based on the above data, we again calculate the euclidean distance for the new centroids.
Data Points Distance between D2 and other data points Distance between D4 and other data points
D1 1.0 4.72
D2 2.24 2.37
D3 4.13 0.47
D4 1 2.76
D5 5.39 1.89
From the above data, we infer the following
Cluster 1: (D1, D2, D4)
Cluster 2: (D3, D5)
Since the clusters have changed so we will now calculate the mean values again and the corresponding centroid values
Data Points Mean value of data points along the x-axis Distance between D4 and other data point
D1, D2, D4 1.67 1.67
D3, D5 3.5 5.5
So now the centroid values are:
Cluster 1: (1.67, 1.67)
Cluster 2: (3.5, 5,5)
The above process has to be repeated until we find a constant value for centroids and the latest cluster will be considered as the final cluster solution.

Python Implementation of K-Means Clustering

I have used Google Colab to run and test my code, you can use any of the IDEs or development environments. Feelfree to use any dataset,there some very good datasets available on kaggle and with Google Colab.

1. Using functions

In this, I will be using some dummy data, you can use any data, that you want.
  1. # Importing the required libraries 
  2. import matplotlib.pyplot as plt  
  3. from matplotlib import style  
  4. style.use('ggplot')  
  5. import numpy as np  
  7. #Dummy data  
  8. X = np.array([[12],  
  9.               [1.51.8],  
  10.               [58 ],  
  11.               [88],  
  12.               [10.6],  
  13.               [9,11],  
  14.               [1,3],  
  15.               [8,9],  
  16.               [0,3],  
  17.               [5,4],  
  18.               [6,4],  
  19.               [1,2],  
  20.               [2,4],  
  21.               [11,5],  
  22.               [2.56,5.6],  
  23.               [22],  
  24.               [1.81.8],  
  25.               [88 ],  
  26.               [55],  
  27.               [0.60.6],  
  28.               [11,11],  
  29.               [3,3],  
  30.               [9,9],])  
  32. #Plot of the Dummy Data  
  33. plt.scatter(X[:,0], X[:,1], s=150)  
  36. #Defining the colors that can be used in plots  
  37. colors =["g","r","c","b","k"]  
  39. #Defination of the class K_Means  
  40. #Using this class we are implementing all the steps required  
  41. #to be performed to generate a k-means model  
  42. class K_Means:  
  43.     def __init__(self, k=5, tol=0.001, max_iter=300):  
  44.         self.k = k  
  45.         self.tol = tol  
  46.         self.max_iter = max_iter  
  48.     #Function to fit the dummy mode on to the model  
  49.     def fit(self,data):  
  51.         self.centroids = {}  
  53.         for i in range(self.k):  
  54.             self.centroids[i] = data[i]  
  56.         for i in range(self.max_iter):  
  57.             self.classifications = {}  
  59.             for i in range(self.k):  
  60.                 self.classifications[i] = []  
  62.             for featureset in data:  
  63.                 distances = [np.linalg.norm(featureset-self.centroids[centroid]) for centroid in self.centroids]  
  64.                 classification = distances.index(min(distances))  
  65.                 self.classifications[classification].append(featureset)  
  67.             prev_centroids = dict(self.centroids)  
  69.             for classification in self.classifications:  
  70.                 self.centroids[classification] = np.average(self.classifications[classification],axis=0)  
  72.             optimized = True  
  74.             for c in self.centroids:  
  75.                 original_centroid = prev_centroids[c]  
  76.                 current_centroid = self.centroids[c]  
  77.                 if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:  
  78.                     np.sum((current_centroid-original_centroid)/original_centroid*100.0)  
  79.                     optimized = False  
  81.             if optimized:  
  82.                 break  
  84.     #Function to generate prediction result  
  85.     def predict(self,data):  
  86.         distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]  
  87.         classification = distances.index(min(distances))  
  88.         return classification  
  90. clf = K_Means()  
  93. #putting the centroid values on to the plot  
  94. for centroid in clf.centroids:  
  95.     plt.scatter(clf.centroids[centroid][0], clf.centroids[centroid][1],  
  96.                 marker="o", color="k", s=150, linewidths=5)  
  98. for classification in clf.classifications:  
  99.     color = colors[classification]  
  100.     for featureset in clf.classifications[classification]:  
  101.         plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)  
  103. #Priting the centroid values  
  104. for centroid in clf.centroids:  
  105.     print(clf.centroids[centroid][0],clf.centroids[centroid][1])  
The centroid values I am getting are:
1.0                            2.8
2.075                        2.15
4.712                        5.32
9.142857142857142 8.714285714285714
0.8                            0.6

2. Using Sklearn

Here I am using dummy data, you can use any dataset of your choice.
  1. # importing all the required libraries  
  2. import matplotlib.pyplot as plt    
  3. %matplotlib inline  
  4. import numpy as np    
  5. from sklearn.cluster import KMeans    
  7. #dummy data  
  8. X = np.array([[12],    
  9.               [1.51.8],    
  10.               [58 ],    
  11.               [88],    
  12.               [10.6],    
  13.               [9,11],    
  14.               [1,3],    
  15.               [8,9],    
  16.               [0,3],    
  17.               [5,4],    
  18.               [6,4],    
  19.               [1,2],    
  20.               [2,4],    
  21.               [11,5],    
  22.               [2.56,5.6],    
  23.               [22],    
  24.               [1.81.8],    
  25.               [88 ],    
  26.               [55],    
  27.               [0.60.6],    
  28.               [11,11],    
  29.               [3,3],    
  30.               [9,9],])   
  32. #plotting the dummy data  
  33. plt.scatter(X[:,0],X[:,1], label='True Position')   
  36. #creating the object of sklearn.cluster.KMeans class  
  37. #Passed the number of clusters value to be 3, i.e. we are making a 3-Means clustering model  
  38. kmeans = KMeans(n_clusters=3)    
  40. #passing the data to the model  
  43. #printing the centroid values  
  44. print(kmeans.cluster_centers_)  
  46. #Printing the predicted values  
  47. print(kmeans.labels_)  
  49. #Plotting the data based upon their predicted cluster  
  50. plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')  
The output that I got is
[[1.35454545 2.16363636]
[9.14285714 8.71428571]
[4.712 5.32 ]]
Points classified into classes:
[0 0 2 1 0 1 0 1 0 2 2 0 0 1 2 0 0 1 2 0 1 0 1]

3. Using TensorFlow

Here I am using dummy data, you can use any dataset that you wish to use.
  1. #importing the required libraries  
  2. import matplotlib.pyplot as plt  
  3. import numpy as np  
  4. import tensorflow as tf  
  6. points_n = 200 #number of data points  
  7. clusters_n = 3 #number of classes in which points needs to be classifies  
  8. iteration_n = 100 #epoch value  
  10. #initializing the placehlders  
  11. points = tf.constant(np.random.uniform(010, (points_n, 2)))  
  12. centroids = tf.Variable(tf.slice(tf.random_shuffle(points), [00], [clusters_n, -1]))  
  13. points_expanded = tf.expand_dims(points, 0)  
  14. centroids_expanded = tf.expand_dims(centroids, 1)  
  16. #Creation of tensor to hold the euclidean disatnce formula  
  17. distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2)  
  18. assignments = tf.argmin(distances, 0)  
  20. #logic to generate the model  
  21. means = []  
  22. for c in range(clusters_n):  
  23.     means.append(tf.reduce_mean(  
  24.       tf.gather(points,   
  25.                 tf.reshape(  
  26.                   tf.where(  
  27.                     tf.equal(assignments, c)  
  28.                   ),[1,-1])  
  29.                ),reduction_indices=[1]))  
  31. new_centroids = tf.concat(means, 0)  
  33. update_centroids = tf.assign(centroids, new_centroids)  
  34. init = tf.global_variables_initializer()  
  36. #session to execute the logic and generate the model  
  37. with tf.Session() as sess:  
  39.   for step in range(iteration_n):  
  40.     [_, centroid_values, points_values, assignment_values] =[update_centroids, centroids, points, assignments])  
  42.   #to print the centroid values    
  43.   print("centroids", centroid_values)  
  45. #plotting of the dummy data with centroid values  
  46. plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)  
  47. plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)  
The output that I am getting is:
[[3.81697836 2.2322892 ]
[2.72036385 6.65968922]
[8.123787 5.82232576]]


In this chapter, we studied k-means clustering.
In the next chapter, we will study KNN i.e. k-nearest neighbors.
Rohit Gupta
139 17.4k 2.1m
Next » Machine Learning: KNN- K-Nearest Neighbors