Decision tree#

  • Decision tree can be considered under the supervised machine learning
  • Decision tree consists of nodes each of which has data stored in attributes
  • Data is split in each node and thus these nodes can also be called "decision nodes"
  • At the bottom of the tree are leaves which are basically final outcomes from the decision path
  • Below is an example of a decision tree

Decision tree example 1

  • Let's use this structure for categorising animals with the following two classes:
    • Mammal
    • Non-mammal

Decision tree example 2

Preparing the dataset#

  • Like in two previous datasets, we are going to use Iris dataset for our decision tree implementation
  • First we import all required libraries
import pandas as pd
from sklearn import metrics, tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
  • As in previous examles, let's load the data from our CSV file as a dataframe
data = pd.read_csv("iris.csv")
  • From dataset we will then separate data and labels into two variables
# Save features and labels in variables
x = data[["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]]
y = data["Species"]
  • Dataset will be devided into the following variables:
    • x_train: Data used for training the model (60 %)
    • x_test: Test data used for validating the model (40 %)
    • y_train: Labels for training data
    • y_test: Labels for test data
  • Parameter test_size defines the portion of test data in overall dataset and the value for random_state parameter ensures that similar results can be replicated in each run
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 42)

Tree classifier definition#

  • Next we will define the classifier for the decision tree
  • max_depth parameter controls the levels in tree structure
  • Too low value for max_depth might lead to underfitting whereas too high value could result in overfitting
  • With the value of 3 we ensure that tree cannot expand levels over this and also that there won't be overfitting
  • Training dataset will then be given to the classifier (data and labels)
clf = DecisionTreeClassifier(max_depth = 3, random_state = 42)
clf.fit(x_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Tree structure visualisation#

  • Now we can draw the visualisation for our decision tree
  • The following parameters will be given for plot_tree method:

    • decision tree: This was trained previously
    • feature_names: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width
    • class_names: setosa, versicolor and virginica
  • In addition, feature names and labels will be stored into two separate variables which will be used in tree visualisation

# Save the feature name and target variables
feature_names = x.columns
labels = y.unique()
labels
array(['setosa', 'versicolor', 'virginica'], dtype=object)
# Visualise the tree structure
plt.figure(figsize = (30,10), facecolor = 'lightgray')

tree.plot_tree(clf, feature_names = feature_names, class_names = labels, fontsize = 18)
plt.show()
No description has been provided for this image

Running predictions for the test data#

  • Our tree has now been trained and we can now check how it performs against our test data
  • We will use predict method for our test data and save it to the variable
test_pred_decision_tree = clf.predict(x_test)
  • Predictions have now been run to our test data which was 40 % of the original dataset rows (60 rows)
  • Predicted classes have been stored into NumPy array as can be seen from the output below
test_pred_decision_tree
array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'virginica', 'setosa',
       'virginica', 'setosa', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor',
       'setosa', 'setosa', 'setosa', 'virginica', 'versicolor',
       'versicolor', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'virginica', 'versicolor', 'virginica',
       'versicolor', 'setosa', 'virginica', 'versicolor', 'setosa',
       'setosa', 'setosa', 'versicolor'], dtype=object)

Result evaluation using confusion matrix#

  • In order to measure the results we will draw a confusion matrix
  • Confusion matrix will have our true labels in one axle and predicted labels in another
  • Good results would be if we would have positive integers above zero in each cell intersection point of the same classes (for example true and predicted label for the class setosa)
# Confusion matrix definition
confusion_matrix = metrics.confusion_matrix(y_test, test_pred_decision_tree)

# Input values for the confusion matrix
ax = plt.axes()
sns.set_theme(font_scale = 1.3)
plt.figure(figsize = (10,7))
sns.heatmap(confusion_matrix, annot = True, fmt = "g", ax = ax, cmap = "magma")

# Add descriptive title for the figure and names for axles
ax.set_title('Confusion Matrix - Decision Tree')
ax.set_xlabel("Predicted label", fontsize = 15)
ax.set_xticklabels(labels)
ax.set_ylabel("True Label", fontsize = 15)
ax.set_yticklabels(list(labels), rotation = 0)
plt.show()
No description has been provided for this image
<Figure size 1000x700 with 0 Axes>
  • Test set which was split earlier, is now included in the confusion matrix (40 % -> 60/150)
  • As can be seen from our chart our tree performed quite well
  • Only one value was classified outside the ideal output (class virginica was predicted as versicolor)