Titanic Data Analysis Project : Who Survived in Titanics?
Introduction
In this project, we will go on an exciting journey to analyze the famous Titanic dataset using Pandas and Matplotlib. We will answer few interesting business questions step-by-step, providing code and insights along the way. By the end of this data analysis project, you’ll have a comprehensive understanding of how to combine data analysis techniques to derive meaningful insights.
Step 1: Setting Up Your Environment
Description: We’ll start by setting up our environment and loading the Titanic dataset.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Titanic dataset
df = sns.load_dataset('titanic')
df.head()
The first glimpse of the Titanic dataset shows us a variety of features like age, sex, class, and survival status, setting the stage for our analysis.
Step 2: Understanding Missing Data
Description: Identify and handle missing values in the dataset.
Code:
# Check for missing values
missing_data = df.isnull().sum()
missing_data
We notice that ‘age’, ‘deck’, ’embarked’ and ’embark_town’ have missing values. Understanding where data is missing helps us decide how to handle these gaps for accurate analysis.
Step 3: Handling Missing Values
Description: Impute missing values in the ‘age’ column and drop columns with excessive missing values.
Code:
# Fill missing age and embarked values with the median and mean
df['age'] = df['age'].fillna(df['age'].median())df['
embarked
'] = df['embarked
'].fillna(df['age'].mean())
# Drop columns with too many missing values
df = df.drop(columns=['deck', 'embark_town'])
# Check for missing values
missing_data = df.isnull().sum()
missing_data
By filling in missing ages & embarked and removing columns with excessive missing data, we clean our dataset, making it more reliable for analysis.
Step 4: Analyzing Age Distribution
Description: Analyze the age distribution of passengers.
Code:
# Plot age distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['age'], bins=30, kde=True)
plt.xlabel('Age')
plt.title('Age Distribution of Titanic Passengers')
plt.show()
Insight: The age distribution reveals that most passengers were young adults, with a small number of children and elderly passengers.
Step 5: Gender Distribution
Description: Examine the gender distribution among passengers.
Code:
# Plot gender distribution
plt.figure(figsize=(8, 4))
gender_counts = df['sex'].value_counts()
gender_counts.plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender Distribution of Titanic Passengers')
plt.show()
Insight: The gender distribution shows a higher number of male passengers compared to female passengers.
Step 6: Class Distribution
Description: Analyze the distribution of passengers across different classes.
Code:
# Plot class distribution
plt.figure(figsize=(8, 4))
class_counts = df['class'].value_counts()
class_counts.plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Class Distribution of Titanic Passengers')
plt.show()
Insight: Most passengers traveled in third class, with fewer passengers in first and second classes.
Step 7: Survival Rate Analysis
Description: Calculate the overall survival rate of passengers.
Code:
# Calculate survival rate
survival_rate = df['survived'].mean()
print(f"Overall Survival Rate: {survival_rate:.2%}")
Overall Survival Rate: 38.38%
Insight: The overall survival rate (38.38%) provides a baseline for further analysis on what factors influenced survival.
Step 8: Survival Rate by Gender
Description: Compare survival rates between male and female passengers.
Code:
# Plot survival rate by gender
plt.figure(figsize=(8, 4))
sns.barplot(x='sex', y='survived', data=df)
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender')
plt.show()
Insight: Females had a significantly higher survival rate compared to males, reflecting the “women and children first” policy.
Step 9: Survival Rate by Class
Description: Analyze survival rates across different passenger classes.
Code:
# Plot survival rate by class
plt.figure(figsize=(8, 4))
sns.barplot(x='class', y='survived', data=df)
plt.xlabel('Class')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Class')
plt.show()
Insight: First-class passengers had the highest survival rate, followed by second-class and third-class passengers.
Step 10: Survival Rate by Age Group
Description: Group passengers by age and analyze their survival rates.
Code:
# Create age groups and plot survival rates
plt.figure(figsize=(8, 4))
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['0-18', '19-30', '31-50', '51+'])
sns.barplot(x='age_group', y='survived', data=df)
plt.xlabel('Age Group')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Age Group')
plt.show()
Insight: Children (0-18) had the highest survival rate, followed by elderly passengers (51+), reflecting priority given to these groups during evacuation.
Step 11: Analyzing Fare Distribution
Description: Examine the distribution of fare paid by passengers.
Code:
# Plot fare distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['fare'], bins=30, kde=True)
plt.xlabel('Fare')
plt.title('Fare Distribution of Titanic Passengers')
plt.show()
Insight: The fare distribution shows a wide range, with most passengers paying lower fares, likely reflecting third-class ticket prices.
Step 12: Correlation Analysis
Description: Explore correlations between numerical features. Note that correlation can only be calculated between numerical values, so we need to convert categorical variables into numerical ones or exclude them from the correlation matrix.
Code:
# Convert categorical variables to numerical ones
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].map({'male': 0, 'female': 1})
df_encoded['embarked'] = df_encoded['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_encoded['class'] = df_encoded['class'].map({'First': 1, 'Second': 2, 'Third': 3})
df_encoded['who'] = df_encoded['who'].map({'man': 0, 'woman': 1, 'child': 2})
df_encoded['adult_male'] = df_encoded['adult_male'].astype(int)
df_encoded['alone'] = df_encoded['alone'].astype(int)
# Exclude non-numeric columns
df_numeric = df_encoded.select_dtypes(include=['int64', 'float64'])
# Plot heatmap of correlations
plt.figure(figsize=(8, 4))
sns.heatmap(df_numeric.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Insight: The correlation matrix reveals relationships between features, such as a strong positive correlation between ‘survived’ and ‘sex (gender) or who column which also reflects gender.’ Even you can drop one of them if required
Step 13: Passenger Class and Embarked Port
Description: Analyze the relationship between passenger class and port of embarkation.
Code:
# Plot class distribution by embarked port
plt.figure(figsize=(8, 4))
sns.countplot(x='embarked', hue='class', data=df)
plt.xlabel('Port of Embarkation')
plt.ylabel('Count')
plt.title('Passenger Class by Port of Embarkation')
plt.show()
Insight: Most passengers from Southampton were in third class, while Cherbourg had a higher proportion of first-class passengers.
Step 14: Age and Fare Relationship
Description: Investigate the relationship between age and fare.
Code:
# Scatter plot of age vs. fare
plt.figure(figsize=(8, 4))
sns.scatterplot(x='age', y='fare', data=df)
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs. Fare')
plt.grid()
plt.show()
Insight: There is no clear relationship between age and fare, suggesting ticket prices were not age-dependent.
Step 15: Survival by Family Size
Description: Analyze the impact of family size on survival rates.
Code:
# Create family size feature and plot survival rates
plt.figure(figsize=(8, 4))
df['family_size'] = df['sibsp'] + df['parch'] + 1
sns.barplot(x='family_size', y='survived', data=df)
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Family Size')
plt.grid()
plt.show()
Insight: Passengers with smaller families had higher survival rates, potentially due to easier evacuation logistics.
Conclusion
Through this step-by-step analysis, we’ve uncovered valuable insights about the Titanic passengers. From understanding demographic distributions to examining survival factors, each step added a piece to the puzzle. This project showcases the power of combining Pandas and Matplotlib for data analysis.
Future Work
To extend this analysis, consider exploring:
- More detailed interactions between features.
- Predictive modeling to estimate survival probabilities.
- Comparison with other historical datasets to uncover broader patterns.
By continuing to explore and analyze data, we can gain deeper insights and make more informed decisions.