Machine Learning  

Clustering With K-Means Practical Case

Introduction

This lesson and the next utilize what are known as unsupervised learning algorithms. Unsupervised algorithms don't use a target; instead, their purpose is to learn a property of the data, representing the structure of the features in a specific way. In the context of feature engineering for prediction, you could think of an unsupervised algorithm as a "feature discovery" technique.

Clustering simply means assigning data points to groups based on how similar the points are to each other. A clustering algorithm makes "birds of a feather flock together," so to speak.

I have already written an article for clustering with k-means, so you can read this article, which has steps with a practical link: https://www.c-sharpcorner.com/article/clustering-with-k-means/

Example - California Housing

As spatial features, California Housing 'Latitude' and 'Longitude' make natural candidates for k-means clustering. In this example, we'll cluster these with 'MedInc' (median income) to create economic segments in different regions of California.

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans

# Set plot style and configuration
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
# Load and preview data
df = pd.read_csv("housing.csv")
X = df.loc[:, ["MedInc", "Latitude", "Longitude"]]
X.head()

MedIncLatitudeLongitude08.325237.88-122.2318.301437.86-122.2227.257437.85-122.2435.643137.85-122.2543.846237.85-122.25

Since k-means clustering is sensitive to scale, it is a good idea to rescale or normalize the data to eliminate extreme values. Our features are already roughly on the same scale, so we'll leave them as-is.

# Create cluster feature
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")

X.head()

Now, let's examine a couple of plots to see how effective this was. First, a scatter plot is shown that illustrates the geographic distribution of the clusters. It seems like the algorithm has created separate segments for higher-income areas on the coasts.

sns.relplot(
    x="Longitude",
    y="Latitude",
    hue="Cluster",
    data=X,
    height=6
)

The target in this dataset is MedHouseVal (median house value). These box plots show the distribution of the target within each cluster. If the clustering is informative, these distributions should, for the most part, separate across MedHouseVal, which is indeed what we see.

X["MedHouseVal"] = df["MedHouseVal"]
sns.catplot(
    x="MedHouseVal",
    y="Cluster",
    data=X,
    kind="boxen",
    height=6
)

self._figure.tight_layout(*args, **kwargs).

I will attach the zip file with the code and dataset, which you can download and use.

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.