K-Means Clustering#

  • Unlike in traditional supervised machine learning algorithms, K-Means clustering is an unsupervised machine learning algorithm where data will not be trained first with labels
  • Data points will be grouped together (clustered) according to certain similarities
  • The idea is to set the attribute K which describes the number of cluster in the dataset
  • In other words, K describes also the centroid for each cluster (center point of the cluster)
  • Each data point in the data set is allocated to nearest centroid meaning that data point will then be a member of that cluster
  • Means refers to data averaging where centroids are calculated

K-Means algorithm description#

  1. First the data is loaded and values for the following attributes are set
    • Number of clusters
    • Maximum iteration count
  2. Generate the centroids for each cluster using random generator
  3. Allocate each data point from loaded dataset to closest centroid
  4. Reposition centroids according to member data points
  5. Process is stopped when centroid positions are not changed or maximum iteration count has been met

kuva.png

Iris dataset description#

  • In this course we use Iris dataset
  • Dataset has three classes as flower species
    • Iris versicolor
    • Iris virginica
    • Iris setosa
  • These classes each has 50 samples so the overall dataset has total of 150 samples
  • Flowers have two properties:
    • sepals
    • petals
  • Data includes the measurements of width and length for both of these
  • Below is an example picture describing these properties in practice

kuva.png

  • Our data is in comma separated format so we will use read_csv function from pandas library
  • Example of the actual data is presented below

kuva.png

Loading the dataset#

  • We will start by loading the dataset from the given CSV file (iris.csv)
  • At this point we only need the Pandas library for reading the data
# Import necessary libraries

import pandas as pd
# Read dataset to pandas dataframe (use names attribute to set dataframe column names)
file = "iris.csv"

dataset = pd.read_csv(file)
  • Now that we have our dataframe set, let's print sample rows for each iris flower class
dataset.iloc[:3]
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
dataset.iloc[50:53]
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
50 51 7.0 3.2 4.7 1.4 versicolor
51 52 6.4 3.2 4.5 1.5 versicolor
52 53 6.9 3.1 4.9 1.5 versicolor
dataset.iloc[100:103]
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
100 101 6.3 3.3 6.0 2.5 virginica
101 102 5.8 2.7 5.1 1.9 virginica
102 103 7.1 3.0 5.9 2.1 virginica

Running K-Means clustering for Iris dataset#

  • As we now have our dataset loaded into the dataframe, we will cluster the data using the K-Means algorithm
  • Pandas has already been loaded so let's start by importing the rest of the necessary libraries
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
import seaborn as sns
  • Previously we printed the content of oud newly created dataframe
  • For clustering we only need the following information:
    • Sepal width and height
    • Petal width and height
  • So basically we have four columns including the important data in index positions 1-4 (see picture below)

kuva.png

  • However, we want to include columns having the best data resolution feature (in other words, columns that have greater differences in data values between flower species)
  • In order to find this out we will group the data by flower species and present the average (mean) values for each column
dataset.groupby("Species").mean()
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
setosa 25.5 5.006 3.428 1.462 0.246
versicolor 75.5 5.936 2.770 4.260 1.326
virginica 125.5 6.588 2.974 5.552 2.026
  • As can be seen from the data output above, Petal.Length and Petal.Width columns have greater differencies between flower species
  • Next we will filter the data so only the previously mentioned columns will be included (columns in third and fourth index position)
x = dataset.loc[:,["Petal.Length","Petal.Width"]].values
x[:5]
array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])
  • Now we have all necessary values stored into two-dimensional array where each subarray presents petal attributes for single iris flower
  • In the next phase random positions for cluster centroids need to be generated
  • The following attributes will be used for KMeans:
    • n_clusters: number of clusters
    • init: method for initialization (random positions for cluster centroids chosen randomly from data)
    • max_iter: maximum number of algorithm iterations
    • random_state parameter ensures that similar results can be replicated in each run
#Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3, init = 'random', max_iter = 100, random_state = 42)
y_kmeans = kmeans.fit_predict(x)
y_kmeans
c:\Users\REZ\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
c:\Users\REZ\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
  • As can be seen from the prediction results, each of 150 samples are set to one of three clusters (0, 1 and 2)
  • Important: These predicted labels do not represent the true labels in the order they are presented in the source data (for example, Setosa is not necessarily marked with label 0 eventhough its the first species listed in the original data)!

Visualising the results and cluster analysis#

  • In order to find out the correct labels, we need to first visualise the cluster data and then set the labels by comparing the cluster position and average petal values for species
  • For the data point insertion we use the following syntax for each of the three data point groups:
plt.scatter(data_for_x-axle, data_for_y-axle, s = data_point_size, c = color_selection)
  • Since we have data points for three clusters, axle data will be inserted with the following data selection:
dataset[cluster == cluster_number, column_number]
  • In addition, K-means algorithm supports also the retrieval of cluster centers which are separately stored into NumPy array as can be seen from the print below
  • So for visualising the cluster centers for each cluster we select all rows and the first column from the NumPy array for the x-axle and all rows and the second column for the y-axle
kmeans.cluster_centers_
array([[5.59583333, 2.0375    ],
       [4.26923077, 1.34230769],
       [1.462     , 0.246     ]])
# Cluster visualisation
plt.figure(figsize=(8,8))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'cyan')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'darkgray')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green')

# Add centroids to each cluster
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 200, c = 'red', label = 'Centroids')
plt.legend()

<matplotlib.legend.Legend at 0x221be56a3d0>
No description has been provided for this image
  • As can be seen from the scatter chart, all 150 dots have been added and classified under three clusters
  • By comparing the values from our previous data averages we can figure out the following:
    • Setosa has the lowest values so it will represent the green cluster (label 2)
    • Versicolor has the next greatest values so it will represent the dark gray cluster (label 1)
    • Virginica has the greatest values overall so it will represent the cyan cluster (label 0)
  • Next we will redraw the scatter chart with approriate labels
# Cluster visualisation
plt.figure(figsize=(8,8))
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'cyan', label = 'Virginica')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'darkgray', label = 'Versicolor')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Setosa')

# Add centroids to each cluster
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 200, c = 'red', label = 'Centroids')
plt.legend()
<matplotlib.legend.Legend at 0x221beb223d0>
No description has been provided for this image

Result evaluation using confusion matrix#

  • Since this chart does not tell the whole truth whether our predictions accurate or not, we will still use confusion matrix
  • Confusion matrix will help to define how predicted values correspond to true values
  • First we convert class labels to numerical values in so that true labels will have the same format as predicted labels
truth = []

for i in dataset["Species"]:
    if i == "setosa":
        truth.append(2)
    elif i == "versicolor":
        truth.append(1)
    else:
        truth.append(0)
  • Then we will take all different labels for confusion matrix axles
labels = dataset["Species"].unique()
  • Finally we create the confusion matrix and input both true and predicted data
# Confusion matrix definition
confusion_matrix = metrics.confusion_matrix(truth, y_kmeans)
  • Now that we have the predictions placed in the confusion matrix, we could check the results before visualising.
confusion_matrix
array([[46,  4,  0],
       [ 2, 48,  0],
       [ 0,  0, 50]], dtype=int64)
  • Next we will visualise the results using Heatmap technique in Seaborn library.
  • The following parameters will be given to Heatmap:
    • data: the actual confusion matrix data we presented earlier
    • annot: this will be enabled with True value so that values will be shown in each cell
    • fmt: for the enabled annotations (annot) we define the proper string formatting. In this case the "d" presents integer. These could be used for example:
      • d → integers
      • f → floating point numbers
      • b → binary numbers
      • o → octal numbers
      • x → octal hexadecimal numbers
      • e → floating point numbers in exponent format
    • ax: axel data from our previous visualisation
    • cmap: specifies the colormap used for the heatmap. For example, this is one list of supported colormaps: https://proplot.readthedocs.io/en/v0.4.1/_images/colormaps_4_1.svg
  • The rest of the configuration is related to information included around the heatmap visualisation, for example the title, labels for axles etc.
# Input values for the confusion matrix
ax = plt.axes()
sns.set_theme(font_scale = 1.3)
plt.figure(figsize = (10,7))
sns.heatmap(confusion_matrix, annot = True, fmt="d", ax = ax, cmap = "magma")

# Add descriptive title for the figure and names for axles
ax.set_title('Confusion Matrix - K-means clustering')
ax.set_xlabel("Predicted label", fontsize = 15)
ax.set_xticklabels(labels)
ax.set_ylabel("True Label", fontsize = 15)
ax.set_yticklabels(labels, rotation = 0)
plt.show()
No description has been provided for this image
<Figure size 1000x700 with 0 Axes>
  • From confusion matrix we can make the following conclusions:
    • Predictions were mainly correct (Setosa 46/50, Versicolor 48/50, Virginica 50/50)
    • 4 of Setosa labels were predicted as Versicolor
    • 2 of Versicolor labels were predicted as Setosa