K-Nearest Neighbors (KNN) algorithm#

Previously we inspected how K-Means algorithm worked
K-Means is unsupervised algorithm meaning the data can be inserted to algorithm without training the model with labels first
In this section we take a look how supervised K-Nearest Neighbors algorithm works
The idea in KNN algorithm is to calculate the distance between points positioned on a graph
Assumption is that similar points can be found in close proximity

KNN algorithm description#

Load the data
Set the value for K which equals to number of neighbors
Calculate the distance between input sample and training samples
Add distances to data structure
Sort the data structure including all the distances in ascending order (smallest value first)
Select the corresponding label according to majority of K nearest neighbors

Distance calculation#

There are many metrics available to calculate distance between two points in KNN algorithm
Most widely used metric is Euclidean distance, which is also used in the Python implementation of KNN in this section
The distance is measured between two points using straight line in Euclidean space
The following formula is utilised:

Prepare the data for KNN algorithm#

In this example we will focus on doing classification according to three Iris classes (Setosa, Versicolor and Virginica)
In addition, only the sepal width and sepal length will be used
Let's start by importing the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors
from sklearn.model_selection import train_test_split

We will remove the FutureWarnings as these might appear with some parts used from sklearn library

import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

Next we will load the data from CSV file to dataframe
In order to use the data, dataframe needs to be filtered so that only sepal width and length are included
Since three classes are used, classification targets will be stored using numbers 0-2
We will also set the h variable which is basically used as a step parameter for Numpys arange function (classification borders will be more accurate with lower values)

data = pd.read_csv("iris.csv")

# Prepare the data
X = data.to_numpy()[:, 1:3].astype(float)
y = data.replace({"Species":{"setosa":0, "versicolor":1, "virginica":2}})["Species"].to_numpy()
h = .01

# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Finding the optimal value for the number of neighbors#

Choosing optimal k value might not be easy
Some guidelines usually used for selecting the optimal k value for KNN algorithm are the following:
Value should be odd
Value should not be multiples of the number of classes
Value should not be too small or too large
One great way of choosing the optimal value is to draw a chart where error rate is compared to k values
Then we choose the minimum k value from our range of tested values
In this case we choose to test values between 1 and 40

error_rate = []
for i in range(1,40):
    knn = neighbors.KNeighborsClassifier(n_neighbors = i)
    knn.fit(x_train, y_train)
    pred_i = knn.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize = (10,6))
plt.plot(range(1,40), error_rate,color = 'blue', linestyle = 'dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. k value')
plt.xlabel('k')
plt.ylabel('Error Rate')

# Calculate minimum error value (add one as we are not using indexes for error rates)
min_error = error_rate.index(min(error_rate)) + 1
print("Minimum error", min(error_rate), "is achieved with k value of", min_error)

Minimum error 0.15555555555555556 is achieved with k value of 28

No description has been provided for this image

Running KNN for Iris dataset#

Now that we have found the lowest value we can use it as an input value for our KNN classifier
We give our classifier two parameters:
- neighbor count
- metric (Euclidean is used here)

n_neighbors = min_error

# Neighbors classifier definition. Euclidean will be used as a metric
clf = neighbors.KNeighborsClassifier(n_neighbors, metric = "euclidean")

# Input data and labels to classifier
clf.fit(X, y)

KNeighborsClassifier(metric='euclidean', n_neighbors=28)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Calculate minimum and maximum boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Predictions can now be run with given boundaries (node positions will all be within given area borders)

# Predict class using data and KNN classifier
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Visualising the results#

The best way to present the classification results will be data visualisation using charts
For this example we need three different color palettes for both backgrounds and points
Colors will be set for three classes (light colors for backgrounds and darker colors for points)

# Create color maps for data visualisation
bg_colors = ListedColormap(["#FFAAAA", "#AAFFAA", "#00AAFF"])
p_colors = ListedColormap(["#FF0000", "#00FF00", "#0884C2"])

Draw a chart from predicted data utilising previously defined colormaps

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize = (8,8))
plt.pcolormesh(xx, yy, Z, cmap = bg_colors)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = p_colors, edgecolors = 'black')
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Iris classification (k = %i)" % (n_neighbors))
plt.show()