K-Means Clustering#

Unlike in traditional supervised machine learning algorithms, K-Means clustering is an unsupervised machine learning algorithm where data will not be trained first with labels
Data points will be grouped together (clustered) according to certain similarities
The idea is to set the attribute K which describes the number of cluster in the dataset
In other words, K describes also the centroid for each cluster (center point of the cluster)
Each data point in the data set is allocated to nearest centroid meaning that data point will then be a member of that cluster
Means refers to data averaging where centroids are calculated

K-Means algorithm description#

First the data is loaded and values for the following attributes are set
- Number of clusters
- Maximum iteration count
Generate the centroids for each cluster using random generator
Allocate each data point from loaded dataset to closest centroid
Reposition centroids according to member data points
Process is stopped when centroid positions are not changed or maximum iteration count has been met

Iris dataset description#

In this course we use Iris dataset
Dataset has three classes as flower species
- Iris versicolor
- Iris virginica
- Iris setosa
These classes each has 50 samples so the overall dataset has total of 150 samples
Flowers have two properties:
- sepals
- petals
Data includes the measurements of width and length for both of these
Below is an example picture describing these properties in practice

Our data is in comma separated format so we will use read_csv function from pandas library
Example of the actual data is presented below

Loading the dataset#

We will start by loading the dataset from the given CSV file (iris.csv)
At this point we only need the Pandas library for reading the data

# Import necessary libraries

import pandas as pd

# Read dataset to pandas dataframe (use names attribute to set dataframe column names)
file = "iris.csv"

dataset = pd.read_csv(file)

Now that we have our dataframe set, let's print sample rows for each iris flower class

dataset.iloc[:3]

	Unnamed: 0	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	1	5.1	3.5	1.4	0.2	setosa
1	2	4.9	3.0	1.4	0.2	setosa
2	3	4.7	3.2	1.3	0.2	setosa

dataset.iloc[50:53]

	Unnamed: 0	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
50	51	7.0	3.2	4.7	1.4	versicolor
51	52	6.4	3.2	4.5	1.5	versicolor
52	53	6.9	3.1	4.9	1.5	versicolor

dataset.iloc[100:103]

	Unnamed: 0	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
100	101	6.3	3.3	6.0	2.5	virginica
101	102	5.8	2.7	5.1	1.9	virginica
102	103	7.1	3.0	5.9	2.1	virginica

Running K-Means clustering for Iris dataset#

As we now have our dataset loaded into the dataframe, we will cluster the data using the K-Means algorithm
Pandas has already been loaded so let's start by importing the rest of the necessary libraries

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
import seaborn as sns

Previously we printed the content of oud newly created dataframe
For clustering we only need the following information:
- Sepal width and height
- Petal width and height
So basically we have four columns including the important data in index positions 1-4 (see picture below)

However, we want to include columns having the best data resolution feature (in other words, columns that have greater differences in data values between flower species)
In order to find this out we will group the data by flower species and present the average (mean) values for each column

dataset.groupby("Species").mean()

	Unnamed: 0	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Species
setosa	25.5	5.006	3.428	1.462	0.246
versicolor	75.5	5.936	2.770	4.260	1.326
virginica	125.5	6.588	2.974	5.552	2.026

As can be seen from the data output above, Petal.Length and Petal.Width columns have greater differencies between flower species
Next we will filter the data so only the previously mentioned columns will be included (columns in third and fourth index position)

x = dataset.loc[:,["Petal.Length","Petal.Width"]].values
x[:5]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

Now we have all necessary values stored into two-dimensional array where each subarray presents petal attributes for single iris flower
In the next phase random positions for cluster centroids need to be generated
The following attributes will be used for KMeans:
- n_clusters: number of clusters
- init: method for initialization (random positions for cluster centroids chosen randomly from data)
- max_iter: maximum number of algorithm iterations
- random_state parameter ensures that similar results can be replicated in each run

#Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3, init = 'random', max_iter = 100, random_state = 42)
y_kmeans = kmeans.fit_predict(x)
y_kmeans

c:\Users\REZ\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
c:\Users\REZ\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

As can be seen from the prediction results, each of 150 samples are set to one of three clusters (0, 1 and 2)
Important: These predicted labels do not represent the true labels in the order they are presented in the source data (for example, Setosa is not necessarily marked with label 0 eventhough its the first species listed in the original data)!

Visualising the results and cluster analysis#

In order to find out the correct labels, we need to first visualise the cluster data and then set the labels by comparing the cluster position and average petal values for species
For the data point insertion we use the following syntax for each of the three data point groups:

plt.scatter(data_for_x-axle, data_for_y-axle, s = data_point_size, c = color_selection)

Since we have data points for three clusters, axle data will be inserted with the following data selection:

dataset[cluster == cluster_number, column_number]

In addition, K-means algorithm supports also the retrieval of cluster centers which are separately stored into NumPy array as can be seen from the print below
So for visualising the cluster centers for each cluster we select all rows and the first column from the NumPy array for the x-axle and all rows and the second column for the y-axle

kmeans.cluster_centers_

array([[5.59583333, 2.0375    ],
       [4.26923077, 1.34230769],
       [1.462     , 0.246     ]])

# Cluster visualisation
plt.figure(figsize=(8,8))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'cyan')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'darkgray')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green')

# Add centroids to each cluster
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 200, c = 'red', label = 'Centroids')
plt.legend()

<matplotlib.legend.Legend at 0x221be56a3d0>

No description has been provided for this image

As can be seen from the scatter chart, all 150 dots have been added and classified under three clusters
By comparing the values from our previous data averages we can figure out the following:
- Setosa has the lowest values so it will represent the green cluster (label 2)
- Versicolor has the next greatest values so it will represent the dark gray cluster (label 1)
- Virginica has the greatest values overall so it will represent the cyan cluster (label 0)
Next we will redraw the scatter chart with approriate labels

# Cluster visualisation
plt.figure(figsize=(8,8))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'cyan', label = 'Virginica')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'darkgray', label = 'Versicolor')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Setosa')

# Add centroids to each cluster
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 200, c = 'red', label = 'Centroids')
plt.legend()

<matplotlib.legend.Legend at 0x221beb223d0>

Result evaluation using confusion matrix#

Since this chart does not tell the whole truth whether our predictions accurate or not, we will still use confusion matrix
Confusion matrix will help to define how predicted values correspond to true values
First we convert class labels to numerical values in so that true labels will have the same format as predicted labels

truth = []

for i in dataset["Species"]:
    if i == "setosa":
        truth.append(2)
    elif i == "versicolor":
        truth.append(1)
    else:
        truth.append(0)

Then we will take all different labels for confusion matrix axles

labels = dataset["Species"].unique()

Finally we create the confusion matrix and input both true and predicted data

# Confusion matrix definition
confusion_matrix = metrics.confusion_matrix(truth, y_kmeans)

Now that we have the predictions placed in the confusion matrix, we could check the results before visualising.

confusion_matrix

array([[46,  4,  0],
       [ 2, 48,  0],
       [ 0,  0, 50]], dtype=int64)

Next we will visualise the results using Heatmap technique in Seaborn library.
The following parameters will be given to Heatmap:
- data: the actual confusion matrix data we presented earlier
- annot: this will be enabled with True value so that values will be shown in each cell
- fmt: for the enabled annotations (annot) we define the proper string formatting. In this case the "d" presents integer. These could be used for example:
  - d → integers
  - f → floating point numbers
  - b → binary numbers
  - o → octal numbers
  - x → octal hexadecimal numbers
  - e → floating point numbers in exponent format
- ax: axel data from our previous visualisation
- cmap: specifies the colormap used for the heatmap. For example, this is one list of supported colormaps: https://proplot.readthedocs.io/en/v0.4.1/_images/colormaps_4_1.svg
The rest of the configuration is related to information included around the heatmap visualisation, for example the title, labels for axles etc.

# Input values for the confusion matrix
ax = plt.axes()
sns.set_theme(font_scale = 1.3)
plt.figure(figsize = (10,7))
sns.heatmap(confusion_matrix, annot = True, fmt="d", ax = ax, cmap = "magma")

# Add descriptive title for the figure and names for axles
ax.set_title('Confusion Matrix - K-means clustering')
ax.set_xlabel("Predicted label", fontsize = 15)
ax.set_xticklabels(labels)
ax.set_ylabel("True Label", fontsize = 15)
ax.set_yticklabels(labels, rotation = 0)
plt.show()

<Figure size 1000x700 with 0 Axes>

From confusion matrix we can make the following conclusions:
- Predictions were mainly correct (Setosa 46/50, Versicolor 48/50, Virginica 50/50)
- 4 of Setosa labels were predicted as Versicolor
- 2 of Versicolor labels were predicted as Setosa