K-Means Clustering in C# with Flower Data Using Google Colab

What is K-Means Clustering in C#?

Clustering stands as a cornerstone of unsupervised machine learning, a method where data points are grouped into clusters based on inherent similarities in their data attributes or features. Unlike supervised learning, which relies on pre-existing labels to train a model, clustering operates in the realm of the unsupervised, where the primary task is to categorize data points into clusters based solely on the characteristics and features they exhibit.

In this paradigm, the label attached to each observation represents the cluster it belongs to, revealing meaningful patterns and relationships without the need for prior classification knowledge. In this article, we will delve into the world of K-Means Clustering and its practical implementation in C# using a real-world flower dataset, all while leveraging the power of Google Colab for insightful data visualization.

K-Means Clustering Visualization

Getting Started

The data used in this article is based on the Clustering module of Microsoft Learn. In this article, we’ll go to implement the algorithm they describe and visualize the clusters.

The Flower Dataset

Suppose a botanist observes a sample of flowers and records the number of leaves and petals on each flower.

Flower Dataset

There are no known labels in the dataset, just two features. The goal is not to identify the different types (species) of flowers, just to group similar flowers together based on the number of leaves and petals.

# of Leaves (x1) # of Petals (x2)
0 5
0 6
1 3
1 3
1 6
1 8
2 3
2 7
2 8


Training a Clustering Model

There are multiple algorithms you can use for clustering. One of the most commonly used algorithms is K-Meansclustering, which consists of the following steps:

  1. The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features). In the flower example, we have two features: the number of leaves (x1) and the number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2]**)
  2. You decide how many clusters you want to use to group the flowers - call this value k. For example, to create three clusters, you would use a k value of 3. Then, k points are plotted at random coordinates. These points become the center points for each cluster, so they're called centroids.
  3. Each data point (in this case, a flower) is assigned to its nearest centroid.
  4. Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
  5. After the centroid is moved, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
  6. The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a predetermined maximum number of iterations is reached.

The following animation shows this process:

Animation of Process

K-Means Clustering Algorithm Implementation in C#

using System;
using System.Collections.Generic;
using System.Linq;

class KMeans
{
    public class Point
    {
        public double X { get; set; }
        public double Y { get; set; }
        public int Cluster { get; set; }
    }

    static Random random = new Random();

    static void Main(string[] args)
    {
		// Flower dataset
        List<Point> dataPoints = new List<Point>()
        {
            new Point { X = 0, Y = 5 },
            new Point { X = 0, Y = 6 },
            new Point { X = 1, Y = 3 },
            new Point { X = 1, Y = 3 },
            new Point { X = 1, Y = 6 },
            new Point { X = 1, Y = 8 },
            new Point { X = 2, Y = 3 },
            new Point { X = 2, Y = 7 },
            new Point { X = 2, Y = 8 }
        };

        int k = 3; // Number of clusters
        List<Point> centroids = InitializeRandomCentroids(dataPoints, k);

        int maxIterations = 100;
        for (int iteration = 0; iteration < maxIterations; iteration++)
        {
            // Assign data points to the nearest cluster
            foreach (Point dataPoint in dataPoints)
            {
                double minDistance = double.MaxValue;
                int closestCluster = -1;

                for (int i = 0; i < k; i++)
                {
                    double distance = CalculateDistance(dataPoint, centroids[i]);
                    if (distance < minDistance)
                    {
                        minDistance = distance;
                        closestCluster = i;
                    }
                }

                dataPoint.Cluster = closestCluster;
            }

            // Update centroids
            for (int i = 0; i < k; i++)
            {
                List<Point> clusterPoints = dataPoints.Where(p => p.Cluster == i).ToList();
                if (clusterPoints.Count > 0)
                {
                    double meanX = clusterPoints.Select(p => p.X).Average();
                    double meanY = clusterPoints.Select(p => p.Y).Average();
                    centroids[i] = new Point { X = meanX, Y = meanY };
                }
            }
        }

        // Print the final clusters
        for (int i = 0; i < k; i++)
        {
            Console.WriteLine($"Cluster {i}:");
            foreach (Point dataPoint in dataPoints.Where(p => p.Cluster == i))
            {
                Console.WriteLine($"X: {dataPoint.X}, Y: {dataPoint.Y}");
            }
            Console.WriteLine();
        }
    }

    static List<Point> InitializeRandomCentroids(List<Point> dataPoints, int k)
    {
        List<Point> centroids = new List<Point>();

        for (int i = 0; i < k; i++)
        {
            int randomIndex = random.Next(dataPoints.Count);
            centroids.Add(new Point
            {
                X = dataPoints[randomIndex].X,
                Y = dataPoints[randomIndex].Y,
                Cluster = i
            });
        }
        return centroids;
    }

    static double CalculateDistance(Point a, Point b)
    {
        double dx = a.X - b.X;
        double dy = a.Y - b.Y;
        return Math.Sqrt(dx * dx + dy * dy);
    }
}

Clusters from Console Output

Cluster 0:
X: 0, Y: 5
X: 0, Y: 6
X: 1, Y: 6

Cluster 1:
X: 1, Y: 8
X: 2, Y: 7
X: 2, Y: 8

Cluster 2:
X: 1, Y: 3
X: 1, Y: 3
X: 2, Y: 3

Visualizing the Clusters in Google Colab

Google Colab is an excellent platform for data visualization and analysis. It provides a user-friendly interface for Python programming and integrates seamlessly with popular data visualization libraries like Matplotlib and Seaborn. In this section, we'll explore how to leverage Google Colab to visualize the clusters created using K-Means Clustering in C#.

K-Means Clustering Visualization

Python Code for Visualization

import numpy as np
import matplotlib.pyplot as plt

clusters = {
    0: [
        (5, 0),
        (6, 0),
        (6, 1)
    ],
    1: [
        (3, 1),
        (3, 1),
        (3, 2)
    ],
    2: [
        (8, 1),
        (7, 2),
        (8, 2)
    ]
}

colors = ['b', 'g', 'r']

plt.figure(figsize=(8, 6)) # Set the size of the plot

for cluster_id, points in clusters.items():
    x, y = zip(*points)
    plt.scatter(x, y, c=colors[cluster_id - 1], label=f'Cluster {cluster_id + 1}')

plt.legend()
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('K-Means Clustering Visualization')
plt.grid(True)

plt.show()

Summary

In the realm of data analysis and machine learning, K-Means Clustering shines as a powerful and versatile tool for uncovering hidden patterns within datasets. In this article, we embarked on an exciting journey to explore K-Means Clustering in C#, using a real-world flower dataset as our guide. We not only delved into the intricacies of implementing the K-Means algorithm but also harnessed the capabilities of Google Colab for insightful data visualization.

Through our journey, we discovered that clustering, as an unsupervised machine learning technique, allows us to categorize data points into meaningful clusters without relying on prior label information. K-Means Clustering, in particular, has proven to be an invaluable technique for a wide range of applications, from customer segmentation to image compression.


Similar Articles