K-Nearest Neighbors (KNN) algorithm#

  • Previously we inspected how K-Means algorithm worked
  • K-Means is unsupervised algorithm meaning the data can be inserted to algorithm without training the model with labels first
  • In this section we take a look how supervised K-Nearest Neighbors algorithm works
  • The idea in KNN algorithm is to calculate the distance between points positioned on a graph
  • Assumption is that similar points can be found in close proximity

KNN algorithm description#

  1. Load the data
  2. Set the value for K which equals to number of neighbors
  3. Calculate the distance between input sample and training samples
  4. Add distances to data structure
  5. Sort the data structure including all the distances in ascending order (smallest value first)
  6. Select the corresponding label according to majority of K nearest neighbors

kuva.png

Distance calculation#

  • There are many metrics available to calculate distance between two points in KNN algorithm
  • Most widely used metric is Euclidean distance, which is also used in the Python implementation of KNN in this section
  • The distance is measured between two points using straight line in Euclidean space
  • The following formula is utilised:

kuva.png

Prepare the data for KNN algorithm#

  • In this example we will focus on doing classification according to three Iris classes (Setosa, Versicolor and Virginica)
  • In addition, only the sepal width and sepal length will be used
  • Let's start by importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors
from sklearn.model_selection import train_test_split
  • We will remove the FutureWarnings as these might appear with some parts used from sklearn library
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
  • Next we will load the data from CSV file to dataframe
  • In order to use the data, dataframe needs to be filtered so that only sepal width and length are included
  • Since three classes are used, classification targets will be stored using numbers 0-2
  • We will also set the h variable which is basically used as a step parameter for Numpys arange function (classification borders will be more accurate with lower values) kuva.png
data = pd.read_csv("iris.csv")

# Prepare the data
X = data.to_numpy()[:, 1:3].astype(float)
y = data.replace({"Species":{"setosa":0, "versicolor":1, "virginica":2}})["Species"].to_numpy()
h = .01

# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Finding the optimal value for the number of neighbors#

  • Choosing optimal k value might not be easy
  • Some guidelines usually used for selecting the optimal k value for KNN algorithm are the following:
  • Value should be odd
  • Value should not be multiples of the number of classes
  • Value should not be too small or too large
  • One great way of choosing the optimal value is to draw a chart where error rate is compared to k values
  • Then we choose the minimum k value from our range of tested values
  • In this case we choose to test values between 1 and 40
error_rate = []
for i in range(1,40):
    knn = neighbors.KNeighborsClassifier(n_neighbors = i)
    knn.fit(x_train, y_train)
    pred_i = knn.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize = (10,6))
plt.plot(range(1,40), error_rate,color = 'blue', linestyle = 'dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. k value')
plt.xlabel('k')
plt.ylabel('Error Rate')

# Calculate minimum error value (add one as we are not using indexes for error rates)
min_error = error_rate.index(min(error_rate)) + 1
print("Minimum error", min(error_rate), "is achieved with k value of", min_error)
Minimum error 0.15555555555555556 is achieved with k value of 28

No description has been provided for this image

Running KNN for Iris dataset#

  • Now that we have found the lowest value we can use it as an input value for our KNN classifier
  • We give our classifier two parameters:
    • neighbor count
    • metric (Euclidean is used here)
n_neighbors = min_error

# Neighbors classifier definition. Euclidean will be used as a metric
clf = neighbors.KNeighborsClassifier(n_neighbors, metric = "euclidean")

# Input data and labels to classifier
clf.fit(X, y)
KNeighborsClassifier(metric='euclidean', n_neighbors=28)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Calculate minimum and maximum boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
  • Predictions can now be run with given boundaries (node positions will all be within given area borders)
# Predict class using data and KNN classifier
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Visualising the results#

  • The best way to present the classification results will be data visualisation using charts
  • For this example we need three different color palettes for both backgrounds and points
  • Colors will be set for three classes (light colors for backgrounds and darker colors for points)
# Create color maps for data visualisation
bg_colors = ListedColormap(["#FFAAAA", "#AAFFAA", "#00AAFF"])
p_colors = ListedColormap(["#FF0000", "#00FF00", "#0884C2"])
  • Draw a chart from predicted data utilising previously defined colormaps
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize = (8,8))
plt.pcolormesh(xx, yy, Z, cmap = bg_colors)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = p_colors, edgecolors = 'black')
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Iris classification (k = %i)" % (n_neighbors))
plt.show()
No description has been provided for this image