How to Split the Datasets: Vertically and Horizontally

In data science, we often work with large datasets. Sometimes it is required to divide or split the datasets into multiple parts. Also sometime to make better analyses on part of the data or in machine learning to train & evaluate their performance we need to split the datasets.

To make machine learning models or deep learning modes more accurate and reliable, it’s crucial to split the dataset into different parts. This tutorial will show us why we divide the dataset, where it’s required, and how to do it step-by-step using a dataset on customer churn.

Why Divide the Dataset?

  1. Training and Testing: We split the dataset into training and testing sets to evaluate how well our model performs on unseen data. The training set is used to build the model, while the testing set is used to check its accuracy.
  2. Vertical Splitting: Sometimes, we may want to split the dataset vertically to separate different types of information. For example, we might want to separate customer information from transaction data. Additionally, we might need to divide the dataset into input features (independent variables) and output features (dependent variables).

Where is it Required?

  • Model Evaluation: Splitting data into training and testing sets helps in evaluating the performance of a model.
  • Data Preprocessing: Vertical splitting helps in focusing on specific features or columns that are more relevant for analysis and preprocessing tasks.
  • Feature Selection: It allows us to manage and analyze specific parts of the data more efficiently.

Required Libraries

To perform these tasks, we need to use some specific libraries in Python. Here’s how you can install them:

pip install pandas scikit-learn
  • pandas: Used for data manipulation and analysis.
  • scikit-learn: Provides tools for machine learning and statistical modeling, including data splitting.

Step-by-Step Process

Let’s start with a dataset on customer churn. You can download it from here.

1. Import Libraries and Load Dataset

First, we need to import the necessary libraries and load the dataset.

import pandas as pd

# Load dataset
url = "https://raw.githubusercontent.com/goradbj1/dataairevolution/main/datasets/customer_churn_dataset.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head()
2. Vertical Splitting

Let’s split the dataset into two vertical sections. You can split as per your requirements.

# Assuming the first few columns are customer information
df_first = df.iloc[:, :4] # First 4 columns
df_second = df.iloc[:, 4:] # All columns after the first 4

# Display the first few rows of each split
print("df_first :")
print(df_first.head())
print("\ndf_second:")
print(df_second .head())

Let us have one more example as split into 2 parts df_input and df_output. df_input (X) with all columns except last one (Churn) and df_output (y) with only last column (Churn).

# Separate the features and the target variable
X = df.drop('Churn', axis=1) # Input features (all columns except the target)
y = df['Churn'] # Output feature (target variable)

# Display the first few rows of input features and the target variable
print("\nInput Features:")
print(X.head())
print("\nOutput Feature:")
print(y.head())
3. Splitting into Train and Test

First check shape of dataset

df.shape            # this will print (100000, 7) i.e 100000 rows and 7 columns

Now, we will split the dataset into training and testing sets. We typically use 70-80% of the data for training and the rest for testing. Ratio you can change as per your requirements.

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the splits
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Conclusion

Dividing the dataset is an essential step in building reliable machine learning models and for effective data preprocessing. It ensures that our model is trained on one part of the data and tested on another, giving us an unbiased evaluation of its performance. Vertical splitting helps in managing and analyzing specific parts of the data more efficiently.

Future Enhancements

  1. Stratified Splitting: Ensure that the training and testing sets have the same proportion of target classes.
  2. Cross-Validation: Use cross-validation techniques to get a more robust estimate of model performance.
  3. Feature Engineering: After splitting, focus on creating new features that can improve model accuracy.

Additional Details

  • Data Preprocessing: Before splitting, always preprocess the data (e.g., handling missing values, encoding categorical variables).
  • Random State: Use a random state for reproducibility, ensuring the same split every time you run the code.

Also there are multiple ways and libraries which can do this work, you can explore those ways. By following these steps, you can effectively split your dataset both vertically and into training and testing sets, paving the way for accurate and reliable machine learning models and efficient data preprocessing.

Leave a Reply