Mastering Multiple Linear Regression: Car Price Prediction Project
Multiple Linear Regression is an extension of simple linear regression. It is used when we want to predict the value of a dependent variable based on the values of two or more independent variables.
Definition: Multiple Linear Regression is a supervised learning algorithm used for predicting a continuous output variable based on multiple input variables.
In this tutorial, we’ll walk through the steps of creating a multiple linear regression model using a practical example.
Example: Predicting Car Prices based on age of the car, Kms Driven and Engine Size
Problem Statement: We want to predict the price of a car based on its age, kilometers driven, and engine size.
Sample Dataset:
Age (years) | Kilometers Driven (km) | Engine Size (cc) | Price (INR) |
---|---|---|---|
3 | 40000 | 1500 | 500000 |
5 | 60000 | 1800 | 400000 |
2 | 20000 | 1300 | 550000 |
8 | 80000 | 2000 | 300000 |
1 | 10000 | 1200 | 600000 |
Mathematical Equation:
The multiple linear regression equation is:
Price = β0 + β1 × Age + β2 × Kilometers Driven + β3 × Engine Size
Where:
- β0 is the intercept.
- β1, β2, β3 are the coefficients for the independent variables.
Example Calculation:
Assume β0 = 700000, β1 = −10000, β2 = −0.05, and β3= 200.
For a car with an age of 4 years, 50000 km driven, and 1600 cc engine:
Price=700000+(−10000×4)+(−0.05×50000)+(200×1600)
Price=700000−40000−2500+320000
Price=577500 INR
Step-by-Step Implementation
Step 1: Loading the Dataset
We’ll load a dataset of car prices from a CSV file. The dataset is available at https://raw.githubusercontent.com/goradbj1/dataairevolution/main/datasets/cars-price.csv
.
import pandas as pd
# Load the dataset
url = 'https://raw.githubusercontent.com/goradbj1/dataairevolution/main/datasets/cars-price.csv'
data = pd.read_csv(url, index_col = 0)# Display the first few rows of the dataset
data
Step 2: Preprocessing the Data
Preprocessing involves checking for missing values, summarizing data, visualizing relationships, and scaling if necessary.
# Basic information about the dataset
print(data.info())
# Summary statistics of the dataset
print(data.describe())
# Check for missing values
print(data.isnull().sum())
Insights:
- The
info()
method gives us an overview of the dataset, including data types and non-null counts. - The
describe()
method provides summary statistics for the numerical columns. - Checking for missing values shows us that our dataset is complete with no missing values.
Data Visualization
Let’s visualize the relationships between the input variables and the output variable (Price).
import matplotlib.pyplot as plt
import seaborn as sns
# Individual plots
plt.figure(figsize=(16, 8))
# Age vs Price
plt.subplot(1, 3, 1)
sns.scatterplot(x='Age', y='Price', data=data)
plt.title('Age vs Price')
# Kilometers Driven vs Price
plt.subplot(1, 3, 2)
sns.scatterplot(x='Kms_Driven', y='Price', data=data)
plt.title('Kilometers Driven vs Price')
# Engine Size vs Price
plt.subplot(1, 3, 3)
sns.scatterplot(x='Engine_Size', y='Price', data=data)
plt.title('Engine Size vs Price')
plt.tight_layout()
plt.show()
Insights:
- From the scatter plots, we can observe:
- Age vs Price: There is a negative relationship; as the age of the car increases, the price tends to decrease.
- Kilometers Driven vs Price: There is a negative relationship; as the kilometers driven increases, the price tends to decrease.
- Engine Size vs Price: There is a positive relationship; as the engine size increases, the price tends to increase.
Correlation Matrix
Let’s look at the correlation matrix to understand the linear relationships between the variables.
# Correlation matrix
corr_matrix = data.corr()
print(corr_matrix)
# Heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Insights:
- The correlation matrix shows that:
- Age has a negative correlation with Price.
- Kilometers Driven has a negative correlation with Price.
- Engine Size has a positive correlation with Price.
Scaling the Data
For multiple linear regression, scaling may not be necessary, but it’s a good practice to scale the data, especially when using algorithms sensitive to the scale of data.
from sklearn.preprocessing import StandardScaler
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data[['Age', 'Kms_Driven', 'Engine_Size']])
X_scaled
Step 3: Splitting the Dataset
We will split the dataset into training and testing sets.
from sklearn.model_selection import train_test_split
# Split the data
X = data[['Age', 'Kms_Driven', 'Engine_Size']]
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
data.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape
((200000, 4), (160000, 3), (40000, 3), (160000,), (40000,))
Step 4: Building the Multiple Linear Regression Model
- Import the necessary libraries:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
- Train the model:
# Create the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
Step 5: Evaluating the Model
We will evaluate the model using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2_score = r2_score(y_test, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
print(f'R2_score: {r2_score:.2f}')
Mean Absolute Error: 46601.40
Root Mean Squared Error: 61757.52
R2_score: 0.96
Step 6: Making Predictions
We will use the trained model to make predictions for new data.
import numpy as np# Make predictions
new_data = np.array([[4, 50000, 1600], [2, 30000, 1800]])
new_data_scaled = scaler.transform(new_data)
predictions = model.predict(new_data_scaled)
for i, pred in enumerate(predictions):
print(f'Predicted price for a car with age {new_data[i][0]} years, {new_data[i][1]} km driven, and {new_data[i][2]} cc engine: {pred:.2f} INR')
Predicted price for a car with age 4 years, 50000 km driven, and 1600 cc engine: 730072.45 INR
Predicted price for a car with age 2 years, 30000 km driven, and 1800 cc engine: 973428.57 INR
Step 7: Saving the Model
We will save the model to a file for future use.
# Save the model to a file
joblib.dump(model, 'multiple_linear_regression_model.pkl')
Conclusion
Multiple Linear Regression is a powerful technique for predicting a dependent variable based on multiple independent variables. By understanding the relationships between these variables, we can make accurate predictions and gain valuable insights.
What’s Next?
To enhance your understanding of multiple linear regression, try the following practice questions:
- Practice Question 1: Use a dataset of house prices with multiple features (e.g., size, location, number of bedrooms, age of the house) to create a multiple linear regression model to predict house prices.
- Practice Question 2: Apply multiple linear regression to a real-world dataset from sources like Kaggle or government databases and practice making predictions.
By practicing with different datasets and scenarios, you’ll become proficient in applying multiple linear regression to solve real-world problems.