Mastering Multiple Linear Regression: Car Price Prediction Project

Multiple Linear Regression is an extension of simple linear regression. It is used when we want to predict the value of a dependent variable based on the values of two or more independent variables.

Definition: Multiple Linear Regression is a supervised learning algorithm used for predicting a continuous output variable based on multiple input variables.

In this tutorial, we’ll walk through the steps of creating a multiple linear regression model using a practical example.

Example: Predicting Car Prices based on age of the car, Kms Driven and Engine Size

Problem Statement: We want to predict the price of a car based on its age, kilometers driven, and engine size.

Sample Dataset:

Age (years)	Kilometers Driven (km)	Engine Size (cc)	Price (INR)
3	40000	1500	500000
5	60000	1800	400000
2	20000	1300	550000
8	80000	2000	300000
1	10000	1200	600000

Mathematical Equation:

The multiple linear regression equation is:

Price = β0 + β1 × Age + β2 × Kilometers Driven + β3 × Engine Size

Where:

β0 is the intercept.
β1, β2, β3 are the coefficients for the independent variables.

Example Calculation:

Assume β0 = 700000, β1 = −10000, β2 = −0.05, and β3= 200.

For a car with an age of 4 years, 50000 km driven, and 1600 cc engine:

Price=700000+(−10000×4)+(−0.05×50000)+(200×1600)
Price=700000−40000−2500+320000
Price=577500 INR

Step-by-Step Implementation

Step 1: Loading the Dataset

We’ll load a dataset of car prices from a CSV file. The dataset is available at https://raw.githubusercontent.com/goradbj1/dataairevolution/main/datasets/cars-price.csv.

import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/goradbj1/dataairevolution/main/datasets/cars-price.csv'
data = pd.read_csv(url, index_col = 0)

# Display the first few rows of the dataset
data

Step 2: Preprocessing the Data

Preprocessing involves checking for missing values, summarizing data, visualizing relationships, and scaling if necessary.

# Basic information about the dataset
print(data.info())

# Summary statistics of the dataset
print(data.describe())

# Check for missing values
print(data.isnull().sum())

Insights:

The info() method gives us an overview of the dataset, including data types and non-null counts.
The describe() method provides summary statistics for the numerical columns.
Checking for missing values shows us that our dataset is complete with no missing values.

Data Visualization

Let’s visualize the relationships between the input variables and the output variable (Price).

import matplotlib.pyplot as plt
import seaborn as sns

# Individual plots
plt.figure(figsize=(16, 8))

# Age vs Price
plt.subplot(1, 3, 1)
sns.scatterplot(x='Age', y='Price', data=data)
plt.title('Age vs Price')

# Kilometers Driven vs Price
plt.subplot(1, 3, 2)
sns.scatterplot(x='Kms_Driven', y='Price', data=data)
plt.title('Kilometers Driven vs Price')

# Engine Size vs Price
plt.subplot(1, 3, 3)
sns.scatterplot(x='Engine_Size', y='Price', data=data)
plt.title('Engine Size vs Price')

plt.tight_layout()
plt.show()

Insights:

From the scatter plots, we can observe:
- Age vs Price: There is a negative relationship; as the age of the car increases, the price tends to decrease.
- Kilometers Driven vs Price: There is a negative relationship; as the kilometers driven increases, the price tends to decrease.
- Engine Size vs Price: There is a positive relationship; as the engine size increases, the price tends to increase.

Correlation Matrix

Let’s look at the correlation matrix to understand the linear relationships between the variables.

# Correlation matrix
corr_matrix = data.corr()
print(corr_matrix)

# Heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Insights:

The correlation matrix shows that:
- Age has a negative correlation with Price.
- Kilometers Driven has a negative correlation with Price.
- Engine Size has a positive correlation with Price.

Scaling the Data

For multiple linear regression, scaling may not be necessary, but it’s a good practice to scale the data, especially when using algorithms sensitive to the scale of data.

from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data[['Age', 'Kms_Driven', 'Engine_Size']])
X_scaled

Step 3: Splitting the Dataset

We will split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the data
X = data[['Age', 'Kms_Driven', 'Engine_Size']]
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
data.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

((200000, 4), (160000, 3), (40000, 3), (160000,), (40000,))

Step 4: Building the Multiple Linear Regression Model

Import the necessary libraries:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

Train the model:

# Create the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

Step 5: Evaluating the Model

We will evaluate the model using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2_score = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
print(f'R2_score: {r2_score:.2f}')

Mean Absolute Error: 46601.40
Root Mean Squared Error: 61757.52
R2_score: 0.96

Step 6: Making Predictions

We will use the trained model to make predictions for new data.

import numpy as np

# Make predictions
new_data = np.array([[4, 50000, 1600], [2, 30000, 1800]])
new_data_scaled = scaler.transform(new_data)
predictions = model.predict(new_data_scaled)

for i, pred in enumerate(predictions):
    print(f'Predicted price for a car with age {new_data[i][0]} years, {new_data[i][1]} km driven, and {new_data[i][2]} cc engine: {pred:.2f} INR')

Predicted price for a car with age 4 years, 50000 km driven, and 1600 cc engine: 730072.45 INR
Predicted price for a car with age 2 years, 30000 km driven, and 1800 cc engine: 973428.57 INR

Step 7: Saving the Model

We will save the model to a file for future use.

# Save the model to a file
joblib.dump(model, 'multiple_linear_regression_model.pkl')

Conclusion

Multiple Linear Regression is a powerful technique for predicting a dependent variable based on multiple independent variables. By understanding the relationships between these variables, we can make accurate predictions and gain valuable insights.

What’s Next?

To enhance your understanding of multiple linear regression, try the following practice questions:

Practice Question 1: Use a dataset of house prices with multiple features (e.g., size, location, number of bedrooms, age of the house) to create a multiple linear regression model to predict house prices.
Practice Question 2: Apply multiple linear regression to a real-world dataset from sources like Kaggle or government databases and practice making predictions.

By practicing with different datasets and scenarios, you’ll become proficient in applying multiple linear regression to solve real-world problems.

Tags: Car Price Prediction Project, Machine Learning, Multiple Linear Regression, Project, Projects, Regression

Mastering Multiple Linear Regression: Car Price Prediction Project

Example: Predicting Car Prices based on age of the car, Kms Driven and Engine Size

Step-by-Step Implementation

Step 1: Loading the Dataset

Step 2: Preprocessing the Data

Data Visualization

Correlation Matrix

Scaling the Data

Step 3: Splitting the Dataset

Step 4: Building the Multiple Linear Regression Model

Step 5: Evaluating the Model

Step 6: Making Predictions

Step 7: Saving the Model

Conclusion

What’s Next?

Related

DriveXpert AI Assistant : Users quickly solve their car-related queries

What is XAI? Understanding Explainable Artificial Intelligence

Retrieval-Based Chatbots: Virtual Librarians of the Digital Age

Leave a Reply Cancel reply

DriveXpert AI Assistant : Users quickly solve their car-related queries

Open Source vs Paid Large Language Models (LLMs): A Strategic Comparison

Vector Databases: A Key Component in Modern AI and Data Science

Exploring Memory, Document Loaders, and Indexes in Langchain: Powering Smarter Applications

A Deep Dive into Agents and Tools in Langchain: Simplifying AI-Driven Tasks

Example: Predicting Car Prices based on age of the car, Kms Driven and Engine Size

Step-by-Step Implementation

Step 1: Loading the Dataset

Step 2: Preprocessing the Data

Data Visualization

Correlation Matrix

Scaling the Data

Step 3: Splitting the Dataset

Step 4: Building the Multiple Linear Regression Model

Step 5: Evaluating the Model

Step 6: Making Predictions

Step 7: Saving the Model

Conclusion

What’s Next?

Related

More Stories

Leave a Reply Cancel reply

You may have missed