Bank Loan Approval Prediction Project : Why a Voting Classifier Outperforms Individual Models

Voting is one of the simplest ensemble methods. In voting, multiple models are trained separately, and the final prediction is made by averaging the results (for regression) or by taking the majority vote (for classification).

  • How it works: Each model makes a prediction, and these predictions are combined through voting (classification) or averaging (regression). How voting differs from bagging is, in voting we may have different base models also all those will be trained on complete data where as in bagging mostly the base model is same ex. decision tree and it will be trained on subset of data.
  • Why it’s useful: Voting ensembles are easy to implement and can be a good starting point for improving prediction accuracy.
  • Example: A voting classifier might combine predictions from decision trees, SVMs, and k-nearest neighbors (KNN) models.

In the banking sector, determining whether a loan should be approved or not is crucial. Machine learning can help automate this decision-making process. In this article, we will demonstrate how to use a Voting Classifier to predict whether a bank loan will be approved or rejected. A Voting Classifier combines the predictions of multiple machine learning models, enhancing accuracy by leveraging the strengths of different algorithms.

We’ll build individual classifiers like Logistic Regression, Decision Trees, and K-Nearest Neighbors (KNN), then combine them to make more accurate predictions.


Objectives:

  • Understand what a Voting Classifier is and why it’s useful.
  • Preprocess the data and build individual models.
  • Combine models into a Voting Classifier to improve performance.

Dataset:

We will be using the following dataset, which contains various features related to customers and whether their loan applications were approved.

Data Features:

  • age: Customer’s age.
  • job: Type of job (admin, technician, etc.).
  • marital: Marital status (single, married, etc.).
  • education: Education level (primary, secondary, tertiary).
  • default: Has credit defaulted (yes or no).
  • balance: Account balance.
  • housing: Does the customer have a housing loan (yes or no).
  • loan: Does the customer have a personal loan (yes or no).
  • contact: Type of communication contact (cellular, telephone, etc.).
  • day: Day of the month when last contacted.
  • month: Month of the year.
  • duration: Last contact duration (in seconds).
  • campaign: Number of contacts during this campaign.
  • pdays: Number of days since the customer was last contacted.
  • previous: Number of previous contacts.
  • outcome: Outcome of the previous campaign.
  • loan_approve: Target variable (yes for loan approved, no for loan rejected).

Step-by-Step Guide:

Step 1: Import Libraries and Load Dataset

We start by importing the necessary libraries and loading the dataset.

import warnings
import pandas as pd
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
url = "https://raw.githubusercontent.com/goradbj1/dataairevolution/refs/heads/main/datasets/bank_loan_data.csv"
df = pd.read_csv(url, index_col = 0)
df

Step 2: Data Preprocessing

Before building the model, we need to convert categorical columns to numerical format using Label Encoding and ensure all columns are ready for training.

# Convert categorical columns to numeric using LabelEncoder
labelencoder = LabelEncoder()

# Apply LabelEncoder to categorical columns
for column in ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'outcome', 'loan_approve']:
    df[column] = labelencoder.fit_transform(df[column])

# Convert other columns (age, balance, day, duration, campaign, pdays, previous) to numeric if they are in object type
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['balance'] = pd.to_numeric(df['balance'], errors='coerce')
df['day'] = pd.to_numeric(df['day'], errors='coerce')
df['duration'] = pd.to_numeric(df['duration'], errors='coerce')
df['campaign'] = pd.to_numeric(df['campaign'], errors='coerce')
df['pdays'] = pd.to_numeric(df['pdays'], errors='coerce')
df['previous'] = pd.to_numeric(df['previous'], errors='coerce')

# Remove rows with missing values
df.dropna(inplace=True)

# Features and target
X = df.drop('loan_approve', axis=1)  # Features
y = df['loan_approve']  # Target: loan approval (1 for approved, 0 for rejected)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X.shape, X_test.shape, y_train.shape, y_test.shape

((45211, 16), (13564, 16), (31647,), (13564,))


Step 3: Build Individual Classifiers

We now train three individual classifiers: Logistic Regression, Decision Tree, and K-Nearest Neighbors (KNN). Each model will be evaluated separately to check its performance.

# Create individual classifiers
log_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier()
knn_clf = KNeighborsClassifier()

# Train and evaluate each classifier
models = [log_clf, dt_clf, knn_clf]
model_names = ['Logistic Regression', 'Decision Tree', 'KNN']

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")

Step 4: Build and Evaluate the Voting Classifier

Once the individual models are built, we combine them into a Voting Classifier. This classifier will use hard (where each model casts a vote, and the majority vote determines the final prediction) / soft voting (ensemble of numerous models and predicts an output (class) based on their highest probability).

# Voting Classifier (soft voting)
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('dt', dt_clf), ('knn', knn_clf)], voting='soft')

# Train the voting classifier
voting_clf.fit(X_train, y_train)

# Predict using the voting classifier
y_pred_voting = voting_clf.predict(X_test)

# Show the accuracy of the Voting Classifier
voting_accuracy = accuracy_score(y_test, y_pred_voting)
print(f"Voting Classifier Accuracy: {voting_accuracy:.4f}")

Results:

Let’s compare the accuracy of each individual model with that of the combined Voting Classifier:

ModelAccuracy
Logistic Regression0.8864
Decision Tree0.8744
K-Nearest Neighbors0.8795
Voting Classifier0.8896

Here we have seen very little improvement in the accuracy but if you experiment with different algorithms, it’s parameters, hyper tuning and use different data, you might get better improvement in the results.

In this scenario, the Voting Classifier outperforms or performs comparably to the best individual models, demonstrating the benefits of combining classifiers.


Conclusion:

In this article, we explored the Voting Classifier and how it can be used to improve predictions for bank loan approvals. By combining Logistic Regression, Decision Tree, and K-Nearest Neighbors, the Voting Classifier aggregated the strengths of these models, resulting in better overall accuracy.

Voting classifiers are particularly useful in real-world scenarios where different algorithms excel at different parts of the data. They help create more robust predictions by balancing the strengths and weaknesses of individual models.


Future Work:

  1. Experiment with hard voting: It can improve accuracy in some cases.
  2. Add more classifiers: Try other models such as Random Forest or XGBoost to further boost accuracy.
  3. Hyperparameter tuning: Use techniques like GridSearchCV to optimize each model’s parameters and improve the overall performance.

Leave a Reply