Data Cleaning and Preprocessing with Pandas
Welcome to the fourth tutorial in our series on data analysis with Python! In this article, we’ll explore data cleaning and preprocessing with Pandas. Cleaning and preprocessing data is a crucial step in any data analysis project, as it ensures the data is accurate, consistent, and ready for analysis. We’ll use real-time business examples to demonstrate these concepts.
Importing Pandas
Before we start, let’s import the Pandas library:
import pandas as pd
Example 1: Cleaning Sales Data
Imagine you are a data analyst at a retail company. You have a dataset of monthly sales that contains missing values, duplicates, and incorrect data types. Let’s clean and preprocess this dataset.
Step 1: Loading the Data
First, let’s create a DataFrame to represent the sales data:
data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
'Sales': [250, 300, None, 400, 500, 350, 420, None, 270, 310, 450, 390]
}
sales_df = pd.DataFrame(data)
print(sales_df)
Step 2: Handling Missing Values
Missing values can skew your analysis results. Let’s handle the missing values in the sales data.
1. Identifying Missing Values:
missing_values = sales_df.isnull().sum()
print("Missing values in each column:\n", missing_values)
2. Filling Missing Values:
We can fill missing values with the mean, median, or a fixed value. Here, we’ll use the median:
median_sales = sales_df['Sales'].median()
sales_df['Sales'].fillna(median_sales, inplace=True)
print(sales_df)
Step 3: Removing Duplicates
Duplicates can distort your analysis. Let’s add some duplicate rows and remove them.
# Adding duplicate rows for demonstration
sales_df = sales_df.append(sales_df.iloc[0], ignore_index=True)
sales_df = sales_df.append(sales_df.iloc[1], ignore_index=True)
print("Data with duplicates:\n", sales_df)
# Removing duplicates
sales_df.drop_duplicates(inplace=True)
print("Data after removing duplicates:\n", sales_df)
Step 4: Correcting Data Types
Ensuring that each column has the correct data type is essential for accurate analysis. Let’s check and correct data types if needed.
print("Data types before correction:\n", sales_df.dtypes)
# If necessary, convert data types
# For example, if 'Sales' was read as an object, we could convert it to float
sales_df['Sales'] = sales_df['Sales'].astype(float)
print("Data types after correction:\n", sales_df.dtypes)
Example 2: Preprocessing Customer Data
Imagine you are working with customer data at an e-commerce company. The dataset contains missing values, inconsistent formatting, and outliers. Let’s clean and preprocess this data.
Step 1: Loading the Data
First, let’s create a DataFrame to represent the customer data:
data = {
'CustomerID': [1, 2, 3, 4, 5, 6],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'Age': [25, None, 35, 40, 29, -5],
'Email': ['alice@example.com', 'bob@example.com', None, 'david@example', 'eva@example.com', 'frank@example.com']
}
customer_df = pd.DataFrame(data)
print(customer_df)
Step 2: Handling Missing Values
1. Identifying Missing Values:
missing_values = customer_df.isnull().sum()
print("Missing values in each column:\n", missing_values)
2. Filling Missing Values:
We’ll fill the missing age values with the median age and remove rows with missing email addresses:
median_age = customer_df['Age'].median()
customer_df['Age'].fillna(median_age, inplace=True)
customer_df.dropna(subset=['Email'], inplace=True)
print(customer_df)
Step 3: Correcting Inconsistent Formatting
1. Correcting Email Addresses:
We’ll ensure all email addresses are in a consistent format by checking for the “@” symbol:
def correct_email(email):
if '@' not in email:
return None
return email
customer_df['Email'] = customer_df['Email'].apply(correct_email)
customer_df.dropna(subset=['Email'], inplace=True)
print(customer_df)
Step 4: Handling Outliers
1. Identifying Outliers:
Outliers can significantly affect the results of your analysis. Let’s identify and handle them:
outliers = customer_df[customer_df['Age'] < 0]
print("Outliers:\n", outliers)
2. Removing Outliers:
We’ll remove the row with the negative age value:
customer_df = customer_df[customer_df['Age'] >= 0]
print(customer_df)
Example 3: Cleaning Product Data
Imagine you are an analyst at a manufacturing company, and you have data on various products, including their prices and availability status. Let’s clean and preprocess this data.
Step 1: Loading the Data
First, let’s create a DataFrame to represent the product data:
data = {
'ProductID': [101, 102, 103, 104, 105],
'ProductName': ['Laptop', 'Tablet', 'Smartphone', 'Monitor', 'Keyboard'],
'Price': ['$1200', '$450', '$800', '400', '70'],
'InStock': ['Yes', 'No', 'Yes', 'No', 'Yes']
}
product_df = pd.DataFrame(data)
print(product_df)
Step 2: Cleaning Price Column
The Price column contains values with inconsistent formatting. Let’s clean this column.
1. Removing Currency Symbols:
product_df['Price'] = product_df['Price'].replace({'\$': ''}, regex=True).astype(float)
print(product_df)
Step 3: Converting Boolean Columns
The InStock column contains string values (‘Yes’/’No’). Let’s convert them to boolean values:
product_df['InStock'] = product_df['InStock'].map({'Yes': True, 'No': False})
print(product_df)
Conclusion
In this tutorial, we’ve demonstrated how to clean and preprocess data using Pandas with real-time business examples. We’ve handled missing values, removed duplicates, corrected data types, fixed inconsistent formatting, and dealt with outliers. These steps are essential for ensuring that your data is accurate and ready for analysis.
In the next tutorial, we’ll dive into data visualization with Matplotlib, where you’ll learn how to create basic plots to visualize your data effectively. Stay tuned and keep exploring!