How Random Forest works internally: A Step-by-Step Guide with Examples
Random Forest is one of the most popular and powerful machine learning algorithms. It’s often used in classification tasks (e.g., determining if a student will pass an exam based on study habits) but it can also be used for regression. But how does it work internally? In this article, we’ll break down the inner workings of Random Forests in a simple and detailed way, showing you how each decision tree is built, how predictions are made, and how the final decision is reached.
We will use a small dataset, and at each step, we will print out the datasets used, the predictions made by individual trees, and the final combined result.
By the end of this article, you will understand:
- How Random Forest creates decision trees.
- How predictions are made by individual trees.
- How the final prediction is made using majority voting.
What is a Random Forest?
A Random Forest is a collection (or “forest”) of decision trees. Unlike a single decision tree, where predictions are made from one model, Random Forest aggregates the results from many trees to make more accurate predictions.
Each tree in the forest is trained on a different random sample of the data, which helps prevent overfitting and improves the model’s ability to generalize to new data.
Step-by-Step Example: Predicting Whether a Student Will Pass an Exam
Let’s imagine we have a simple dataset (purposely we have taken small dataset to understand each step clearly) that tracks the study habits of students, and we want to predict if they will pass their exam based on three factors:
- Hours Studied: Number of hours a student studied.
- Sleep Hours: Hours of sleep the student gets.
- Attendance: The student’s attendance percentage.
Our goal is to predict whether the student will pass the exam (target variable).
Step 1: Create the Dataset
Here is our sample dataset:
Hours_Studied | Sleep_Hours | Attendance | Pass_Exam |
---|---|---|---|
2 | 7 | 70 | 0 |
4 | 6 | 85 | 0 |
6 | 8 | 80 | 1 |
8 | 5 | 60 | 1 |
10 | 7 | 90 | 1 |
Explanation:
Pass_Exam
is our target column, where 1 means the student passed, and 0 means the student failed.
Step 2: Split the Data into Training and Test Sets
Before building the Random Forest model, we split our data into training and test sets. The training set is used to teach the model, while the test set is used to evaluate how well the model performs on unseen data.
import pandas as pd
from sklearn.model_selection import train_test_split
# Create sample small dataset
data = {
'Hours_Studied': [2, 4, 6, 8, 10, 1, 2, 3, 12, 14],
'Sleep_Hours': [7, 6, 8, 5, 7, 10, 12, 11, 6, 5],
'Attendance': [70, 85, 80, 60, 90, 40, 50, 60, 90, 95],
'Pass_Exam': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1]
}
df = pd.DataFrame(data)
df
# Features (X) and target (y)
X = df[['Hours_Studied', 'Sleep_Hours', 'Attendance']]
y = df['Pass_Exam']
# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('\nX_Train:\n', X_train, '\n\nX_test:\n', X_test, '\n\ny_train:\n', y_train, '\n\ny_test:\n', y_test)
Step 3: Build the Random Forest
Now, let’s create a Random Forest with 2 decision trees (for simplicity) and train it on our dataset.
from sklearn.ensemble import RandomForestClassifier
# Create Random Forest with 2 trees
n_estimators = 2
rf_model = RandomForestClassifier(n_estimators=n_estimators, random_state=42, bootstrap=True)
rf_model.fit(X_train, y_train)
Step 4: How Random Forest Works Internally
4.1. Bootstrap Sampling: Randomly Selecting Data for Each Tree
Each tree in the Random Forest is built on a random subset of the training data (this is called bootstrap sampling). Let’s print the sampled data used to train each tree:
# Access and print the bootstrap samples for each tree
for i, tree in enumerate(rf_model.estimators_):
# Get the bootstrap sample indices used for training this tree
bootstrap_indices = tree.tree_.weighted_n_node_samples
sampled_data = X_train.iloc[tree.apply(X_train)]
print(f"\nTree {i + 1} used the following sampled data:")
print(sampled_data)
- The Random Forest model randomly selects rows from the training set for each tree. This randomness ensures that each tree is slightly different from the others. Currently we can see few records are selected multiple times this is due to randomness but if you increase the no of records in dataset, it will reduce the probability of repeating same records.
4.2. Subtree Forecasting: Predictions by Individual Trees
Each tree now makes predictions based on its training data. Let’s print how each individual tree predicts the test data.
# Show predictions from each tree
for i, tree in enumerate(rf_model.estimators_):
tree_prediction = tree.predict(X_test)
print(f"\nTree {i + 1} predicts: {tree_prediction}")
- Each decision tree independently predicts the outcome for the test data. These predictions may vary because each tree is trained on different data.
4.3. Final Prediction: Majority Voting
The Random Forest takes the predictions from all trees and combines them using majority voting. In this case, if the majority of trees predict a student will pass, that will be the final prediction.
# Show the final prediction (majority voting)
final_prediction = rf_model.predict(X_test)
print("\nFinal prediction from Random Forest (majority voting):", final_prediction)
- In classification, the final decision is made by majority voting—each tree gets one vote, and the most common prediction wins.
This process of building multiple trees, each using different parts of the data, helps Random Forests achieve better accuracy and robustness compared to single decision trees. The model’s ability to handle randomness makes it a great tool for preventing overfitting and making reliable predictions on unseen data.
Now, let’s extend the article by adding model evaluation. We’ll use all common metrics like accuracy, precision, recall, F1-score, and we’ll also include the ROC-AUC curve to evaluate our Random Forest model thoroughly.
Step 6: Evaluating the Random Forest Model
After building and training the Random Forest model, it’s important to evaluate how well the model performs on the test data. We’ll use a variety of metrics and tools to get a clear picture of its performance.
6.1. Evaluating with Accuracy, Precision, Recall, and F1-Score
Let’s begin by calculating these basic evaluation metrics:
F1-Score: The weighted average of precision and recall.
Accuracy: The ratio of correct predictions to the total predictions.
Precision: The ratio of correctly predicted positive observations to the total predicted positives.
Recall: The ratio of correctly predicted positive observations to all actual positives.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Predicting on the test set
y_pred = rf_model.predict(X_test)
# Calculating evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
- We use the test data to make predictions (
y_pred
) and then calculate the evaluation metrics by comparing these predictions with the actual labels (y_test
). We have very small and simple dataset due to which we got good score of each metric but if you have large and complex dataset it will reduce the scores.
6.2. Confusion Matrix
The confusion matrix is a table that helps visualize the performance of the classification model by showing the number of true positives, false positives, true negatives, and false negatives.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fail', 'Pass'], yticklabels=['Fail', 'Pass'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
- A confusion matrix provides detailed information on how many predictions were correct and where the model made errors.
Example Output: A heatmap will be displayed showing the number of correct and incorrect predictions for each class.
6.3. ROC Curve and AUC Score
The ROC (Receiver Operating Characteristic) curve shows the trade-off between true positive rate (sensitivity/recall) and false positive rate. The AUC (Area Under the Curve) score gives a single value to summarize the model’s performance. A higher AUC means the model is better at distinguishing between classes.
from sklearn.metrics import roc_curve, auc
# Get probability scores for the positive class
y_prob = rf_model.predict_proba(X_test)[:, 1]
# Calculate the ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--') # Dashed diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
# Print AUC score
print("AUC Score:", roc_auc)
Explanation:
predict_proba
gives the probability that a data point belongs to the positive class (in this case,Pass_Exam
= 1).- The ROC curve helps visualize how well the model can distinguish between students who pass or fail.
- The AUC score summarizes the model’s performance: 1.0 is a perfect score, while 0.5 represents random guessing.
Example Output:
- The ROC curve will be displayed.
- The AUC Score might look like this:
Step 7: Final Summary of Evaluation
With all the metrics calculated, we can now summarize the performance of the Random Forest model.
- Accuracy tells us how often the model was correct overall.
- Precision focuses on how many of the model’s predicted positives were actually positive.
- Recall measures how many actual positives were identified by the model.
- F1-Score gives us a balance between precision and recall.
- The ROC-AUC Curve shows the model’s ability to distinguish between classes.
These metrics help us understand different aspects of the model’s performance and identify where improvements may be necessary.
Final Thoughts
By evaluating the Random Forest using multiple metrics, we can get a comprehensive view of the model’s strengths and weaknesses. In this example, since we used a small dataset, the model performed perfectly, but in real-world applications, we might see more nuanced results.
With this understanding, you can now confidently evaluate Random Forest models in your own machine learning projects.
Future Work
To extend this knowledge further, you can experiment with:
- Increasing the number of trees to see how it impacts accuracy.
- Trying Random Forest on larger, more complex datasets.
- Exploring how Random Forests handle regression tasks.