Heart Disease Classification Project: A Deep Dive into AdaBoost

What is AdaBoost?

AdaBoost (Adaptive Boosting) is one of the first boosting algorithms developed for binary classification problems. It works by combining multiple weak learners (typically decision trees) to form a strong learner. Each weak learner focuses on the mistakes of the previous one, adapting to improve the overall model performance.

Why AdaBoost?

AdaBoost’s strength lies in its simplicity and efficiency. It has proven to perform well in both binary and multi-class classification tasks, while often being more resistant to overfitting compared to other algorithms. It’s a great choice when you need a model that provides a balance between interpretability and accuracy.


How AdaBoost Works

AdaBoost works by building a sequence of weak learners, where each learner attempts to correct the errors made by its predecessor. The steps include:

  1. Initialize Weights: Initially, each data point is assigned an equal weight. These weights indicate how much influence each example has in the learning process.
  2. Train Weak Learner: A weak learner (such as a decision tree stump) is trained on the data, considering the weights of the training samples.
  3. Update Weights: After the weak learner is trained, the weights are updated. Misclassified samples receive higher weights, while correctly classified samples receive lower weights. This process ensures that the next weak learner focuses more on the difficult examples.
  4. Calculate Learner Strength: Each weak learner is assigned a weight (alpha) based on its accuracy. A more accurate learner gets a higher alpha value, meaning it has a greater influence on the final prediction.
  5. Repeat: Steps 2-4 are repeated for a pre-defined number of iterations, or until the model reaches a certain performance threshold.
  6. Final Model: The final prediction is made by combining the predictions of all weak learners, weighted by their individual alphas.

Heart Disease Dataset Overview

Heart disease is one of the most prevalent health issues worldwide, and early detection is key to improving patient outcomes. In this article, we’ll explore how to use a powerful ensemble technique, AdaBoost, to build a predictive model for heart disease classification. We will walk through the basic workings of AdaBoost, implement it using the Heart Disease Dataset and extract useful insights into the model’s internal workings.

We’ll be using the Heart Disease Dataset from the UCI Machine Learning Repository. This dataset contains 303 instances and 14 attributes, including:

  • age: Age of the patient.
  • sex: Gender of the patient (1 = male, 0 = female).
  • cp: Chest pain type (4 types).
  • trestbps: Resting blood pressure.
  • chol: Serum cholesterol.
  • fbs: Fasting blood sugar.
  • target: Diagnosis of heart disease (1 = presence of heart disease, 0 = no heart disease).

For simplicity, we’ll transform this into a binary classification problem, where 1 means the presence of heart disease, and 0 means no heart disease.


AdaBoost Step-by-Step Implementation

Here’s how to implement AdaBoost with the Heart Disease dataset in Python. We’ll go through each step, from loading the dataset to printing the internal workings of AdaBoost.

Step 1: Import Libraries

First, let’s import the necessary libraries:

import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Step 2: Load and Preprocess the Dataset

Next, we load the dataset and handle missing values:

# Load the Heart Disease dataset
url = "https://raw.githubusercontent.com/goradbj1/dataairevolution/refs/heads/main/datasets/heart_disease_data.csv"

# Load dataset from URL
df = pd.read_csv(url, index_col = 0)
df.head()
# Select only few relevant columns
df = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'target']]

# Replace missing values ('?') with NaN and drop them for simplicity
df = df.replace('?', np.nan).dropna()

# Convert target into binary classification (0 = no disease, 1 = disease)
df['target'] = df['target'].apply(lambda x: 1 if x > 0 else 0)
df

Step 3: Splitting the Data

We split the dataset into features (X) and labels (y), and then into training and testing sets:

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the AdaBoost Classifier

We initialize and train the AdaBoost classifier with DecisionTreeClassifier as the weak learner: We will create only 3 week leaners but you can experiment with this.

# Initialize AdaBoostClassifier with DecisionTree as base estimator
ada_classifier = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=3)

# Fit the model
ada_classifier.fit(X_train, y_train)

Step 5: Understanding Internal Workings

Let’s extract and print the internal workings of the AdaBoost algorithm, including the number of weak learners, alpha values (weights), and the predictions made by each weak learner:

# Extract information
print("Number of weak learners:", len(ada_classifier.estimators_))

# Print the weight (alpha) assigned to each weak learner
print("\nAlpha values (weights) of weak learners:")
print(ada_classifier.estimator_weights_)
# Go through each weak learner (Decision Tree stump)
for i, tree in enumerate(ada_classifier.estimators_):
    # Print weak learner i
    print(f"\nWeak Learner {i + 1}:")
    
    # Get predictions from the weak learner on the training data
    train_pred = tree.predict(X_train)
    print(f"Weak Learner Predictions (Training Data): {train_pred}")
    
    # Check how well it performed on the training data
    accuracy = accuracy_score(y_train, train_pred)
    print(f"Weak Learner Training Accuracy: {accuracy:.2f}")

Step 6: Final Predictions and Model Performance

We make predictions on the test set and evaluate the model’s performance:

# Final predictions of the ensemble on test data
final_pred = ada_classifier.predict(X_test)
print("\nFinal Predictions of the AdaBoost model (Test Data):", final_pred)

# Calculate final accuracy
final_accuracy = accuracy_score(y_test, final_pred)
print(f"Final Test Accuracy: {final_accuracy:.2f}")

Output –
Final Predictions of the AdaBoost model (Test Data): [0 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1]

Final Test Accuracy: 0.79

To save the trained AdaBoost model, you can use the joblib library or the pickle module in Python. Below is the code for both methods.

Using joblib

First, ensure you have joblib installed: pip install joblib

import joblib

# Save the model
joblib.dump(ada_classifier, 'ada_boost_model.pkl')

Key Insights from AdaBoost’s Internal Working

From the internal workings, we observe:

  • Alpha Values: These represent the influence of each weak learner. Learners with higher alpha values contribute more to the final prediction.
  • Weak Learner Predictions: Each weak learner makes predictions on the training data, with varying accuracy.
  • Final Accuracy: The combination of weak learners results in a strong classifier with a high test accuracy (79%).

Conclusion

AdaBoost is a powerful ensemble technique that can significantly improve the performance of weak classifiers like decision trees. By focusing on the misclassified instances from previous models, AdaBoost creates a robust final model. In this tutorial, we successfully implemented AdaBoost for heart disease classification and examined its internal workings.

Boosting techniques like AdaBoost are widely used in industries, particularly in medical research, where accurate classification is essential. By using such models, we can help clinicians make better decisions, ultimately improving patient outcomes.


Future Enhancements

  1. Tuning Hyperparameters: Experiment with the number of weak learners (n_estimators) or other AdaBoost hyperparameters to improve accuracy.
  2. Advanced Weak Learners: Use more complex base learners (like deeper decision trees) and compare results.
  3. Cross-Validation: Apply cross-validation to ensure the model generalizes well on unseen data.
  4. Real-world Applications: Extend this approach to other health datasets or industries that need reliable predictive models.

Leave a Reply