Random Forest for Regression
We can also use Random Forest for regression tasks. Below is an example using the Boston Housing dataset
Step 1: Import Required Libraries
Before implementing Random Forest, import the necessary Python libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
Step 2: Load and Explore the Dataset
In this step we will read the data some of the open repositories like Kaggle dataset, UCI Machine Learning Repository etc and explore the data to understand the features and its importance.
We will use the Housing dataset as an example for regression.
housing = fetch_california_housing()
df_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
df_housing['target'] = housing.target
df_housing.head() #print top 5 rows of the dataset
OUTPUT:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude target
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
Step 3: Split the Data
In this step we will split the dataset into training and testing datasets and also apply the scaling to bring all the features on the same scale.
# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(df_housing.drop(columns=['target']), df_housing['target'], test_size=0.2, random_state=42)
Step 4: Initialize Random Forest Regressor
We will create the instance of RandomForestRegressor for regression.
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
Step 5: Train the model
Train the model using the training data (X_train, y_train)
rf_regressor.fit(X_train, y_train)
RandomForestRegressor(random_state=42)
Step 6 Predict on test data
As we have trained the model now time is to test the model using testing data (X_test, y_test)
Now, we predict on the test dataset.
y_pred_reg = rf_regressor.predict(X_test)
print(y_pred_reg)
OUTPUT:
[0.5095 0.74161 4.9232571 ... 4.7582187 0.71409 1.65083 ]
Step 7: Evaluate Performance
mse = mean_squared_error(y_test, y_pred_reg)
print(f'Mean Squared Error: {mse:.2f}')
Mean Squared Error: 0.26
Key Takeaways
- Random Forest Regressor is an ensemble method using multiple decision trees.
- Works well for continuous data prediction like house prices.
- Reduces overfitting by averaging multiple trees.
- Mean Squared Error (MSE) measures prediction accuracy.
- Scalable & robust for real-world regression tasks.
Next Blog- Gradient Boosting in Machine Learning