What is K-Means Clustering in C#?
Clustering stands as a cornerstone of unsupervised machine learning, a method where data points are grouped into clusters based on inherent similarities in their data attributes or features. Unlike supervised learning, which relies on pre-existing labels to train a model, clustering operates in the realm of the unsupervised, where the primary task is to categorize data points into clusters based solely on the characteristics and features they exhibit.
In this paradigm, the label attached to each observation represents the cluster it belongs to, revealing meaningful patterns and relationships without the need for prior classification knowledge. In this article, we will delve into the world of K-Means Clustering and its practical implementation in C# using a real-world flower dataset, all while leveraging the power of Google Colab for insightful data visualization.
Getting Started
The data used in this article is based on the Clustering module of Microsoft Learn. In this article, we’ll go to implement the algorithm they describe and visualize the clusters.
The Flower Dataset
Suppose a botanist observes a sample of flowers and records the number of leaves and petals on each flower.
There are no known labels in the dataset, just two features. The goal is not to identify the different types (species) of flowers, just to group similar flowers together based on the number of leaves and petals.
# of Leaves (x1) |
# of Petals (x2) |
0 |
5 |
0 |
6 |
1 |
3 |
1 |
3 |
1 |
6 |
1 |
8 |
2 |
3 |
2 |
7 |
2 |
8 |
Training a Clustering Model
There are multiple algorithms you can use for clustering. One of the most commonly used algorithms is K-Meansclustering, which consists of the following steps:
- The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features: the number of leaves (x1) and the number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2]**)
- You decide how many clusters you want to use to group the flowers - call this value k. For example, to create three clusters, you would use a k value of 3. Then, k points are plotted at random coordinates. These points become the center points for each cluster, so they're called centroids.
- Each data point (in this case, a flower) is assigned to its nearest centroid.
- Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
- After the centroid is moved, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
- The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a predetermined maximum number of iterations is reached.
The following animation shows this process:
K-Means Clustering Algorithm Implementation in C#
using System;
using System.Collections.Generic;
using System.Linq;
class KMeans
{
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public int Cluster { get; set; }
}
static Random random = new Random();
static void Main(string[] args)
{
// Flower dataset
List<Point> dataPoints = new List<Point>()
{
new Point { X = 0, Y = 5 },
new Point { X = 0, Y = 6 },
new Point { X = 1, Y = 3 },
new Point { X = 1, Y = 3 },
new Point { X = 1, Y = 6 },
new Point { X = 1, Y = 8 },
new Point { X = 2, Y = 3 },
new Point { X = 2, Y = 7 },
new Point { X = 2, Y = 8 }
};
int k = 3; // Number of clusters
List<Point> centroids = InitializeRandomCentroids(dataPoints, k);
int maxIterations = 100;
for (int iteration = 0; iteration < maxIterations; iteration++)
{
// Assign data points to the nearest cluster
foreach (Point dataPoint in dataPoints)
{
double minDistance = double.MaxValue;
int closestCluster = -1;
for (int i = 0; i < k; i++)
{
double distance = CalculateDistance(dataPoint, centroids[i]);
if (distance < minDistance)
{
minDistance = distance;
closestCluster = i;
}
}
dataPoint.Cluster = closestCluster;
}
// Update centroids
for (int i = 0; i < k; i++)
{
List<Point> clusterPoints = dataPoints.Where(p => p.Cluster == i).ToList();
if (clusterPoints.Count > 0)
{
double meanX = clusterPoints.Select(p => p.X).Average();
double meanY = clusterPoints.Select(p => p.Y).Average();
centroids[i] = new Point { X = meanX, Y = meanY };
}
}
}
// Print the final clusters
for (int i = 0; i < k; i++)
{
Console.WriteLine($"Cluster {i}:");
foreach (Point dataPoint in dataPoints.Where(p => p.Cluster == i))
{
Console.WriteLine($"X: {dataPoint.X}, Y: {dataPoint.Y}");
}
Console.WriteLine();
}
}
static List<Point> InitializeRandomCentroids(List<Point> dataPoints, int k)
{
List<Point> centroids = new List<Point>();
for (int i = 0; i < k; i++)
{
int randomIndex = random.Next(dataPoints.Count);
centroids.Add(new Point
{
X = dataPoints[randomIndex].X,
Y = dataPoints[randomIndex].Y,
Cluster = i
});
}
return centroids;
}
static double CalculateDistance(Point a, Point b)
{
double dx = a.X - b.X;
double dy = a.Y - b.Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
Clusters from Console Output
Cluster 0:
X: 0, Y: 5
X: 0, Y: 6
X: 1, Y: 6
Cluster 1:
X: 1, Y: 8
X: 2, Y: 7
X: 2, Y: 8
Cluster 2:
X: 1, Y: 3
X: 1, Y: 3
X: 2, Y: 3
Visualizing the Clusters in Google Colab
Google Colab is an excellent platform for data visualization and analysis. It provides a user-friendly interface for Python programming and integrates seamlessly with popular data visualization libraries like Matplotlib and Seaborn. In this section, we'll explore how to leverage Google Colab to visualize the clusters created using K-Means Clustering in C#.
Python Code for Visualization
import numpy as np
import matplotlib.pyplot as plt
clusters = {
0: [
(5, 0),
(6, 0),
(6, 1)
],
1: [
(3, 1),
(3, 1),
(3, 2)
],
2: [
(8, 1),
(7, 2),
(8, 2)
]
}
colors = ['b', 'g', 'r']
plt.figure(figsize=(8, 6)) # Set the size of the plot
for cluster_id, points in clusters.items():
x, y = zip(*points)
plt.scatter(x, y, c=colors[cluster_id - 1], label=f'Cluster {cluster_id + 1}')
plt.legend()
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('K-Means Clustering Visualization')
plt.grid(True)
plt.show()
Summary
In the realm of data analysis and machine learning, K-Means Clustering shines as a powerful and versatile tool for uncovering hidden patterns within datasets. In this article, we embarked on an exciting journey to explore K-Means Clustering in C#, using a real-world flower dataset as our guide. We not only delved into the intricacies of implementing the K-Means algorithm but also harnessed the capabilities of Google Colab for insightful data visualization.
Through our journey, we discovered that clustering, as an unsupervised machine learning technique, allows us to categorize data points into meaningful clusters without relying on prior label information. K-Means Clustering, in particular, has proven to be an invaluable technique for a wide range of applications, from customer segmentation to image compression.