Simple Linear Regression: Build a ML Model to Predict Electricity Consumption
Linear Regression is a fundamental and widely-used type of regression analysis. It models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
Definition: Linear Regression is a supervised learning algorithm used for predicting a continuous output variable based on one or more input variables.
Mathematical Equation:
The simple linear regression equation is:
Consumption = β0 + β1 × Appliances
Where:
- β0 is the intercept (the value of Consumption when Appliances = 0)
- β1 is the slope (the change in Consumption for a one-unit change in Appliances)
Example Calculation:
Assume β0 = 50 and β1 = 20.
For 12 appliances:
Consumption = 50 + 20 × 12 = 290 kWh
In this tutorial, we’ll walk through the steps of creating a linear regression model using a simple example.
Example: Predicting Monthly Electricity Consumption
Problem Statement: We want to predict the monthly electricity consumption (in kWh) based on the number of electrical appliances used.
Dataset:
Appliances (count) | Consumption (kWh) |
---|---|
5 | 150 |
8 | 220 |
10 | 300 |
15 | 450 |
Although in real-world scenarios, datasets are often available in various formats like CSV. You can download the dataset from this link
Step-by-Step Implementation
Step 1: Creating the Dataset
First, either you can download the dataset or you can generate you own using following script. Here we have generated a synthetic dataset with 50000 records but you can change as per your requirements. Also ensure numpy, pandas, and scikit-learn modules are installed.
To install packages you can use pip install numpy pandas scikit-learn
import numpy as np
import pandas as pd
# Set the seed for reproducibility
np.random.seed(0)
# Generate random data
appliances = np.random.randint(1, 21, 50000)
consumption = 50 + 20 * appliances + np.random.normal(0, 10, 50000)
# Create a DataFrame
data = pd.DataFrame({
'Appliances': appliances,
'Consumption': consumption
})
# Display the dataset
dat
a
Step 2: Preprocessing the Data
Preprocessing involves checking for missing values and scaling the data if necessary. For this example, our dataset is clean and does not require scaling. Also, you can try different preprocessing techniques based on your requirements.
# Check for missing values
print(data.isnull().sum())
# No missing values, so we proceed
Step 3: Splitting the Dataset
We will split the dataset into Input and Output. Also into training (80%) and testing (20%) sets.
from sklearn.model_selection import train_test_split
# Split the data
X = data[['Appliances']]
y = data['Consumption']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
data.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape
# Output
((50000, 2), (40000, 1), (10000, 1), (40000,), (10000,))
Step 4: Building the Linear Regression Model
- Import the necessary libraries:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import joblib
- Train the model:
# Create the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
Step 5: Evaluating the Model
We will evaluate the model using Mean Squared Error (MSE) and R-squared on both train and test data se well
# Predict on the train set
y_train_pred = model.predict(X_train)
# Evaluate the model on train data
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
print(f'Train Mean Squared Error: {train_mse:.2f}')
print(f'Train R-squared: {train_r2:.2f}')
Train Mean Squared Error: 98.89
Train R-squared: 0.99
# Predict on the test set
y_test_pred = model.predict(X_test)
# Evaluate the model on test data
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
print(f'Test Mean Squared Error: {test_mse:.2f}')
print(f'Test R-squared: {test_r2:.2f}')
Test Mean Squared Error: 100.23
Test R-squared: 0.99
To visualize the results of a linear regression model, you can plot the training data, the regression line, and the test data, the regression line. Here’s a step-by-step guide to achieving this using Python and the matplotlib
library
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
# Plot the training data
plt.scatter(X_train, y_train, color='blue', label='Training Data')
# Plot the regression line
plt.plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression Line')
# Adding titles and labels
plt.title('Linear Regression: Training Data')
plt.xlabel('Number of Appliances')
plt.ylabel('Electricity Consumption (kWh)')
# Adding a legend
plt.legend()
plt.grid()
# Show the plot
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
# Plot the test data
plt.scatter(X_test, y_test, color='green', label='Test Data')
# Plot the regression line for test data
plt.plot(X_test, y_test_pred, color='orange', linestyle='--', linewidth=2, label='Prediction Line')
# Adding titles and labels
plt.title('Linear Regression: Test Data')
plt.xlabel('Number of Appliances')
plt.ylabel('Electricity Consumption (kWh)')
# Adding a legend
plt.legend()
plt.grid()
# Show the plot
plt.show()
Step 6: Making Predictions
We will use the trained model to make predictions for new data.
# Make predictions
new_data = np.array([[12], [18]])
predictions = model.predict(new_data)
for i, pred in enumerate(predictions):
print(f'Predicted consumption for {new_data[i][0]} appliances: {pred:.2f} kWh')
Predicted consumption for 12 appliances: 289.99 kWh
Predicted consumption for 18 appliances: 410.01 kWh
Step 7: Saving the Model
We will save the model to a file for future use.
# Save the model to a file
joblib.dump(model, 'linear_regression_model.pkl')
[‘linear_regression_model.pkl’] # this model will be saved on local machine
You can use different available evaluation metrics like Mean Absolute Error (MAE), RMSE (Root Mean Squared Error), MSE (Mean Squared Error), R2 Score etc.
Conclusion
Linear regression is a powerful and straightforward method for predictive modeling. By understanding the relationship between input and output variables, we can make accurate predictions and gain valuable insights.
What’s Next?
To enhance your understanding of linear regression, try the following practice questions:
- Practice Question 1: Use a dataset of car prices and their feature (e.g., age) to create a simple linear regression model to predict car prices.
- Practice Question 2: Collect data on house prices and create a simple linear regression model to predict prices based on feature like size.
- Practice Question 3: Apply simple linear regression to a real-world dataset from sources like Kaggle or government databases and practice making predictions.
By practicing with different datasets and scenarios, you’ll become more proficient in using linear regression for predictive modeling.