K-Nearest Neighbors (KNN) algorithm#
- Previously we inspected how K-Means algorithm worked
- K-Means is unsupervised algorithm meaning the data can be inserted to algorithm without training the model with labels first
- In this section we take a look how supervised K-Nearest Neighbors algorithm works
- The idea in KNN algorithm is to calculate the distance between points positioned on a graph
- Assumption is that similar points can be found in close proximity
KNN algorithm description#
- Load the data
- Set the value for K which equals to number of neighbors
- Calculate the distance between input sample and training samples
- Add distances to data structure
- Sort the data structure including all the distances in ascending order (smallest value first)
- Select the corresponding label according to majority of K nearest neighbors
Distance calculation#
- There are many metrics available to calculate distance between two points in KNN algorithm
- Most widely used metric is Euclidean distance, which is also used in the Python implementation of KNN in this section
- The distance is measured between two points using straight line in Euclidean space
- The following formula is utilised:
Prepare the data for KNN algorithm#
- In this example we will focus on doing classification according to three Iris classes (Setosa, Versicolor and Virginica)
- In addition, only the sepal width and sepal length will be used
- Let's start by importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors
from sklearn.model_selection import train_test_split
- We will remove the FutureWarnings as these might appear with some parts used from sklearn library
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
- Next we will load the data from CSV file to dataframe
- In order to use the data, dataframe needs to be filtered so that only sepal width and length are included
- Since three classes are used, classification targets will be stored using numbers 0-2
- We will also set the h variable which is basically used as a step parameter for Numpys arange function (classification borders will be more accurate with lower values)
data = pd.read_csv("iris.csv")
# Prepare the data
X = data.to_numpy()[:, 1:3].astype(float)
y = data.replace({"Species":{"setosa":0, "versicolor":1, "virginica":2}})["Species"].to_numpy()
h = .01
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
Finding the optimal value for the number of neighbors#
- Choosing optimal k value might not be easy
- Some guidelines usually used for selecting the optimal k value for KNN algorithm are the following:
- Value should be odd
- Value should not be multiples of the number of classes
- Value should not be too small or too large
- One great way of choosing the optimal value is to draw a chart where error rate is compared to k values
- Then we choose the minimum k value from our range of tested values
- In this case we choose to test values between 1 and 40
error_rate = []
for i in range(1,40):
knn = neighbors.KNeighborsClassifier(n_neighbors = i)
knn.fit(x_train, y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize = (10,6))
plt.plot(range(1,40), error_rate,color = 'blue', linestyle = 'dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. k value')
plt.xlabel('k')
plt.ylabel('Error Rate')
# Calculate minimum error value (add one as we are not using indexes for error rates)
min_error = error_rate.index(min(error_rate)) + 1
print("Minimum error", min(error_rate), "is achieved with k value of", min_error)
Running KNN for Iris dataset#
- Now that we have found the lowest value we can use it as an input value for our KNN classifier
- We give our classifier two parameters:
- neighbor count
- metric (Euclidean is used here)
n_neighbors = min_error
# Neighbors classifier definition. Euclidean will be used as a metric
clf = neighbors.KNeighborsClassifier(n_neighbors, metric = "euclidean")
# Input data and labels to classifier
clf.fit(X, y)
# Calculate minimum and maximum boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
- Predictions can now be run with given boundaries (node positions will all be within given area borders)
# Predict class using data and KNN classifier
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Visualising the results#
- The best way to present the classification results will be data visualisation using charts
- For this example we need three different color palettes for both backgrounds and points
- Colors will be set for three classes (light colors for backgrounds and darker colors for points)
# Create color maps for data visualisation
bg_colors = ListedColormap(["#FFAAAA", "#AAFFAA", "#00AAFF"])
p_colors = ListedColormap(["#FF0000", "#00FF00", "#0884C2"])
- Draw a chart from predicted data utilising previously defined colormaps
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize = (8,8))
plt.pcolormesh(xx, yy, Z, cmap = bg_colors)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = p_colors, edgecolors = 'black')
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Iris classification (k = %i)" % (n_neighbors))
plt.show()