How to Build a Linear Regression Model in Python Using Scikit-Learn

Nidhi Sharma
1d
2.1k
0
1

Article

Introduction

Linear Regression is one of the most fundamental machine learning algorithms used for predicting continuous values. It establishes a relationship between independent variables (features) and a dependent variable (target). In Python, Scikit-Learn provides a simple and efficient way to build and train a linear regression model.

This article explains how to build a linear regression model using Python and Scikit-Learn with step-by-step implementation, real-world examples, and best practices.

What is Linear Regression?

Linear Regression is a supervised learning algorithm that models the relationship between input variables (X) and output variable (y) using a straight line.

The equation of linear regression is:

y = mx + b

Where:

y = predicted value
x = input feature
m = slope (coefficient)
b = intercept

Why Use Linear Regression?

Linear regression is widely used because:

It is simple and easy to understand
It works well for linear relationships
It is computationally efficient
It provides interpretable results

Prerequisites

Before starting, make sure you have:

Python installed
Basic understanding of Python
Libraries: numpy, pandas, matplotlib, scikit-learn

Install required libraries:

pip install numpy pandas matplotlib scikit-learn

Step 1: Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Load Dataset

data = pd.read_csv("data.csv")

X = data[['Feature']]
y = data['Target']

Step 3: Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

This step ensures that we train and test the model on different data.

Step 4: Create and Train the Model

model = LinearRegression()
model.fit(X_train, y_train)

The model learns the relationship between input and output.

Step 5: Make Predictions

y_pred = model.predict(X_test)

Step 6: Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)

Step 7: Visualize the Results

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red')
plt.xlabel("Feature")
plt.ylabel("Target")
plt.title("Linear Regression Model")
plt.show()

Real-World Example

A common real-world example is predicting house prices based on area:

Input (X): Size of the house
Output (y): Price of the house

The model learns how price changes with size and predicts values for new houses.

Difference Between Linear Regression and Other Models

Feature	Linear Regression	Decision Tree
Model Type	Linear	Non-linear
Interpretability	High	Medium
Performance	Good for linear data	Better for complex data
Speed	Fast	Moderate

Best Practices

Normalize or scale data if needed
Remove outliers for better accuracy
Use multiple features for better predictions
Validate model using cross-validation

Common Mistakes

Using linear regression for non-linear data
Ignoring feature correlation
Not splitting data into train and test sets

Summary

Linear regression in Python using Scikit-Learn provides a simple and effective way to build predictive models for continuous data. By following a structured approach of loading data, splitting it, training the model, and evaluating performance, developers can quickly create reliable machine learning solutions. With proper data preprocessing and validation, linear regression becomes a powerful tool for solving real-world problems like price prediction, trend analysis, and forecasting.