Titanic Data Analysis Project : Who Survived in Titanics?

Introduction

In this project, we will go on an exciting journey to analyze the famous Titanic dataset using Pandas and Matplotlib. We will answer few interesting business questions step-by-step, providing code and insights along the way. By the end of this data analysis project, you’ll have a comprehensive understanding of how to combine data analysis techniques to derive meaningful insights.

Step 1: Setting Up Your Environment

Description: We’ll start by setting up our environment and loading the Titanic dataset.

Code:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')
df.head()

The first glimpse of the Titanic dataset shows us a variety of features like age, sex, class, and survival status, setting the stage for our analysis.

Step 2: Understanding Missing Data

Description: Identify and handle missing values in the dataset.

Code:

# Check for missing values
missing_data = df.isnull().sum()
missing_data

We notice that ‘age’, ‘deck’, ’embarked’ and ’embark_town’ have missing values. Understanding where data is missing helps us decide how to handle these gaps for accurate analysis.

Step 3: Handling Missing Values

Description: Impute missing values in the ‘age’ column and drop columns with excessive missing values.

Code:

# Fill missing age and embarked values with the median and mean
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['age'].mean())

# Drop columns with too many missing values
df = df.drop(columns=['deck', 'embark_town'])
# Check for missing values
missing_data = df.isnull().sum()
missing_data

By filling in missing ages & embarked and removing columns with excessive missing data, we clean our dataset, making it more reliable for analysis.

Step 4: Analyzing Age Distribution

Description: Analyze the age distribution of passengers.

Code:

# Plot age distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['age'], bins=30, kde=True)
plt.xlabel('Age')
plt.title('Age Distribution of Titanic Passengers')
plt.show()

Insight: The age distribution reveals that most passengers were young adults, with a small number of children and elderly passengers.

Step 5: Gender Distribution

Description: Examine the gender distribution among passengers.

Code:

# Plot gender distribution
plt.figure(figsize=(8, 4))
gender_counts = df['sex'].value_counts()
gender_counts.plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender Distribution of Titanic Passengers')
plt.show()

Insight: The gender distribution shows a higher number of male passengers compared to female passengers.

Step 6: Class Distribution

Description: Analyze the distribution of passengers across different classes.

Code:

# Plot class distribution
plt.figure(figsize=(8, 4))
class_counts = df['class'].value_counts()
class_counts.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Class Distribution of Titanic Passengers')
plt.show()

Insight: Most passengers traveled in third class, with fewer passengers in first and second classes.

Step 7: Survival Rate Analysis

Description: Calculate the overall survival rate of passengers.

Code:

# Calculate survival rate
survival_rate = df['survived'].mean()
print(f"Overall Survival Rate: {survival_rate:.2%}")

Overall Survival Rate: 38.38%

Insight: The overall survival rate (38.38%) provides a baseline for further analysis on what factors influenced survival.

Step 8: Survival Rate by Gender

Description: Compare survival rates between male and female passengers.

Code:

# Plot survival rate by gender
plt.figure(figsize=(8, 4))
sns.barplot(x='sex', y='survived', data=df)
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender')
plt.show()

Insight: Females had a significantly higher survival rate compared to males, reflecting the “women and children first” policy.

Step 9: Survival Rate by Class

Description: Analyze survival rates across different passenger classes.

Code:

# Plot survival rate by class
plt.figure(figsize=(8, 4))
sns.barplot(x='class', y='survived', data=df)
plt.xlabel('Class')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Class')
plt.show()

Insight: First-class passengers had the highest survival rate, followed by second-class and third-class passengers.

Step 10: Survival Rate by Age Group

Description: Group passengers by age and analyze their survival rates.

Code:

# Create age groups and plot survival rates
plt.figure(figsize=(8, 4))
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['0-18', '19-30', '31-50', '51+'])
sns.barplot(x='age_group', y='survived', data=df)
plt.xlabel('Age Group')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Age Group')
plt.show()

Insight: Children (0-18) had the highest survival rate, followed by elderly passengers (51+), reflecting priority given to these groups during evacuation.

Step 11: Analyzing Fare Distribution

Description: Examine the distribution of fare paid by passengers.

Code:

# Plot fare distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['fare'], bins=30, kde=True)
plt.xlabel('Fare')
plt.title('Fare Distribution of Titanic Passengers')
plt.show()

Insight: The fare distribution shows a wide range, with most passengers paying lower fares, likely reflecting third-class ticket prices.

Step 12: Correlation Analysis

Description: Explore correlations between numerical features. Note that correlation can only be calculated between numerical values, so we need to convert categorical variables into numerical ones or exclude them from the correlation matrix.

Code:

# Convert categorical variables to numerical ones
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].map({'male': 0, 'female': 1})
df_encoded['embarked'] = df_encoded['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_encoded['class'] = df_encoded['class'].map({'First': 1, 'Second': 2, 'Third': 3})
df_encoded['who'] = df_encoded['who'].map({'man': 0, 'woman': 1, 'child': 2})
df_encoded['adult_male'] = df_encoded['adult_male'].astype(int)
df_encoded['alone'] = df_encoded['alone'].astype(int)

# Exclude non-numeric columns
df_numeric = df_encoded.select_dtypes(include=['int64', 'float64'])

# Plot heatmap of correlations
plt.figure(figsize=(8, 4))
sns.heatmap(df_numeric.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Insight: The correlation matrix reveals relationships between features, such as a strong positive correlation between ‘survived’ and ‘sex (gender) or who column which also reflects gender.’ Even you can drop one of them if required

Step 13: Passenger Class and Embarked Port

Description: Analyze the relationship between passenger class and port of embarkation.

Code:

# Plot class distribution by embarked port
plt.figure(figsize=(8, 4))
sns.countplot(x='embarked', hue='class', data=df)
plt.xlabel('Port of Embarkation')
plt.ylabel('Count')
plt.title('Passenger Class by Port of Embarkation')
plt.show()

Insight: Most passengers from Southampton were in third class, while Cherbourg had a higher proportion of first-class passengers.

Step 14: Age and Fare Relationship

Description: Investigate the relationship between age and fare.

Code:

# Scatter plot of age vs. fare
plt.figure(figsize=(8, 4))
sns.scatterplot(x='age', y='fare', data=df)
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs. Fare')
plt.grid()
plt.show()

Insight: There is no clear relationship between age and fare, suggesting ticket prices were not age-dependent.

Step 15: Survival by Family Size

Description: Analyze the impact of family size on survival rates.

Code:

# Create family size feature and plot survival rates
plt.figure(figsize=(8, 4))
df['family_size'] = df['sibsp'] + df['parch'] + 1
sns.barplot(x='family_size', y='survived', data=df)
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Family Size')
plt.grid()
plt.show()

Insight: Passengers with smaller families had higher survival rates, potentially due to easier evacuation logistics.

Conclusion

Through this step-by-step analysis, we’ve uncovered valuable insights about the Titanic passengers. From understanding demographic distributions to examining survival factors, each step added a piece to the puzzle. This project showcases the power of combining Pandas and Matplotlib for data analysis.

Future Work

To extend this analysis, consider exploring: