Python Implementation of Decision Trees Using Entropy – Step-by-Step Guide
Decision Trees are one of the most popular and powerful algorithms used for classification tasks in machine learning. Entropy is a key concept in Decision Trees, which is used to measure the impurity or disorder of a dataset. The goal is to minimize entropy to create the best possible splits at each node in the tree.
In this step-by-step guide, we will implement a Decision Tree classifier using entropy as the splitting criterion. We will use scikit-learn to implement the Decision Tree algorithm and evaluate it.
1. Introduction to Decision Trees and Entropy
A Decision Tree is a flowchart-like tree structure where each internal node represents a feature or attribute, the branches represent the decision rules, and each leaf node represents the outcome or class label.
Entropy is a measure of disorder or impurity in the dataset. The formula for entropy is:
where:
- PiP_i is the probability of class ii in the dataset SS,
- mm is the number of different classes in SS.
- Information Gain is used to select the feature that best separates the dataset into distinct classes. The feature with the highest information gain is selected for the split.
2. Step-by-Step Decision Tree Implementation
Step 1: Import Required Libraries
We begin by importing the necessary libraries, including scikit-learn for implementing the Decision Tree algorithm, pandas for handling datasets, and matplotlib for visualization.
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
Step 2: Load and Explore the Dataset
In this step we will read the data from some of the open repositories like Kaggle dataset, UCI Machine Learning Repository etc and explore the data to understand the features and its importance. In this example, we will use the famous Iris dataset, which is available in scikit-learn. The dataset contains features of three different species of iris flowers.
# Load the Iris dataset from scikit-learn
from sklearn import datasets
iris = datasets.load_iris()
# Features and target
X = iris.data # Features
y = iris.target # Target (labels)
# Display the first few rows of the dataset
print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)
print("First 5 rows of the data:\n", X[:5])
OUTPUT:
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 5 rows of the data:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
Step 3: Data Preprocessing
Before building the model, we need to pre-process the data. In this pre processing step, we focus on following steps if needed:
- Missing Value Imputation, here we just remove the missing value if any feature has buy its median or Mode.
- Drop the columns which is not impacting the target
- Visualize the relationship between the feature to check if they are highly corelated to each other.
- Check if there is any categorical feature, remove it to numerical feature by applying OHE (One Hot Encoding).
Finally, After completing all the above steps, we are in the position to split the dataset into training and testing sets using 70-30 rule (70% data will be used for training the model and 30% data will be used for testing the model) and also bring all the features in the same scale using methods like : MinMaxScaler, StandardScaler etc.
For this example, no major pre processing is needed. We'll simply ensure that the dataset is ready for training. Next, we split the data into training and testing sets.
# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display the shape of training and testing data
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")
OUPTU:
Training data shape: (120, 4)
Testing data shape: (30, 4)
Step 4: Build and Train the Decision Tree Model
We've already split the dataset in the previous step. We will now create a Decision Tree model using entropy as the criterion for the best split and train it using the training data.
# Create a Decision Tree Classifier with 'entropy' as the criterion
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42)
# Train the model on the training data
dt_model.fit(X_train, y_train)
Step 5: Validation of the model
Once the model is trained, we can make predictions on the test set.
# Predict the class labels for the test data
y_pred = dt_model.predict(X_test)
# Display the predicted labels
print("Predicted Labels:", y_pred)
OUTPUT:
Predicted Labels: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Step 6: Evaluate the Model
Once we have the predictions, we can now evaluate the performance of the model by calculating accuracy and other evaluation metrics.
# Confusion Matrix and Accuracy
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
# Evaluate the model's accuracy on the test data
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree model: {accuracy * 100:.2f}%")
# Display a detailed classification report
print("Classification Report:\n", classification_report(y_test, y_pred))
OUTPUT:
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Accuracy of the Decision Tree model: 100.00%
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Step 7: Visualize the Decision Tree
To better understand how the decision tree model works, we can visualize it. scikit-learn provides a convenient method to plot the tree structure.By visualizing the decision tree, we can see how the tree splits the data at each node based on features. Each node shows the feature used for the split, the threshold, and the class distribution.
# Visualize the Decision Tree
plt.figure(figsize=(12, 8))
plot_tree(dt_model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()
data:image/s3,"s3://crabby-images/5f8a2/5f8a2898af9de44631c0ee2a72d4f000d21279d9" alt=""
3. Conclusion
In this guide, we have implemented a Decision Tree classifier using entropy as the criterion for splitting the data. The steps covered include:
- Loading and exploring the dataset.
- Preprocessing the data (splitting into training and testing sets).
- Building and training the Decision Tree model.
- Making predictions and evaluating the model.
- Visualizing the Decision Tree structure.
Decision Trees are easy to interpret and can handle both numerical and categorical data. While they are highly interpretable, they can also suffer from overfitting. To prevent overfitting, we can tune hyperparameters such as tree depth, minimum samples per leaf, and others.
You can experiment with other datasets and parameters to improve the model and explore the power of Decision Trees with entropy!