How to Find a Regression Line on X or Y in Python

Tuhin Paul
Oct 06
297
0
1

Article

Introduction
What Is a Regression Line?
Why Regression on X vs. Y Matters
Real-World Scenario: Predicting Electric Vehicle Charging Demand
Methods to Compute Regression Lines
Complete Implementation with Test Cases
Best Practices and Common Pitfalls
Conclusion

Introduction

Linear regression is one of the most widely used tools in data science—but did you know there are two ways to fit a line depending on which variable you treat as the predictor? Most tutorials only show regression of Y on X (predicting Y from X), but what if you need to predict X from Y instead?

This article demystifies both approaches using a timely, real-world example from the electric vehicle (EV) industry and provides clean, tested Python code you can use immediately.

What Is a Regression Line?

A regression line models the linear relationship between two variables. There are two distinct types:

Regression of Y on X: Minimizes vertical errors. Used when X is the input (e.g., time) and Y is the output (e.g., demand).
Regression of X on Y: Minimizes horizontal errors. Used when you want to infer the input (X) from an observed output (Y).

Mathematically, they produce different lines unless the correlation is perfect (±1).

Why Regression on X vs. Y Matters

Choosing the wrong regression direction leads to biased predictions. For example:

Predicting battery charge level (Y) from charging time (X) → use Y-on-X.
Estimating how long a car has been charging (X) from its current battery level (Y) → use X-on-Y.

Confusing the two can result in significant errors in real applications.

Real-World Scenario: Predicting EV Charging Duration from Battery Level

Imagine you're building a smart parking system for an EV charging station. Cameras detect a car’s current battery percentage (e.g., 68%), but you don’t know how long it’s been plugged in. You do have historical data: for thousands of sessions, you recorded charging time (minutes) and resulting battery level (%).

Your goal: Estimate charging duration from observed battery level.
This requires regression of X (time) on Y (battery %)—the less common but critical approach.

Methods to Compute Regression Lines

We’ll compute both lines using basic statistics—no external libraries needed.

Key Formulas

Given data points (x_i, y_i):

Y on X:
Slope = r * (σ_y / σ_x)
Intercept = mean(y) - slope * mean(x)
X on Y:
Slope = r * (σ_x / σ_y)
Intercept = mean(x) - slope * mean(y)

Where r is Pearson’s correlation coefficient, and σ denotes standard deviation.

Complete Implementation with Test Cases

import math
from typing import Tuple, List
import unittest

def compute_regression_lines(x: List[float], y: List[float]) -> dict:
    """
    Computes both regression lines:
    - Y on X: y = a1 * x + b1
    - X on Y: x = a2 * y + b2  →  y = (x - b2) / a2  (if needed in y = mx + c form)
    
    Returns coefficients in a dictionary.
    """
    if len(x) != len(y) or len(x) == 0:
        raise ValueError("Input lists must be non-empty and of equal length")
    
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    
    # Compute variances and covariance
    var_x = sum((xi - mean_x) ** 2 for xi in x) / n
    var_y = sum((yi - mean_y) ** 2 for yi in y) / n
    cov_xy = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y)) / n
    
    # Avoid division by zero
    if var_x == 0 or var_y == 0:
        raise ValueError("Variance of X or Y is zero; regression undefined")
    
    # Correlation
    r = cov_xy / (math.sqrt(var_x) * math.sqrt(var_y))
    
    # Regression of Y on X
    slope_y_on_x = r * (math.sqrt(var_y) / math.sqrt(var_x))
    intercept_y_on_x = mean_y - slope_y_on_x * mean_x
    
    # Regression of X on Y
    slope_x_on_y = r * (math.sqrt(var_x) / math.sqrt(var_y))
    intercept_x_on_y = mean_x - slope_x_on_y * mean_y
    
    return {
        'y_on_x': {'slope': slope_y_on_x, 'intercept': intercept_y_on_x},
        'x_on_y': {'slope': slope_x_on_y, 'intercept': intercept_x_on_y}
    }

def predict_x_from_y(y_val: float, slope: float, intercept: float) -> float:
    """Use X-on-Y regression to predict X from Y."""
    return slope * y_val + intercept

class TestRegressionLines(unittest.TestCase):
    def test_known_case(self):
        # Perfect linear relationship: y = 2x + 1
        x = [1, 2, 3, 4, 5]
        y = [3, 5, 7, 9, 11]
        
        result = compute_regression_lines(x, y)
        
        # Both regressions should be identical when r = ±1
        self.assertAlmostEqual(result['y_on_x']['slope'], 2.0)
        self.assertAlmostEqual(result['y_on_x']['intercept'], 1.0)
        self.assertAlmostEqual(result['x_on_y']['slope'], 0.5)  # because x = 0.5*(y - 1)
        self.assertAlmostEqual(result['x_on_y']['intercept'], -0.5)
        
        # Predict time (x) from battery level (y=7) → should be 3
        pred_x = predict_x_from_y(7, result['x_on_y']['slope'], result['x_on_y']['intercept'])
        self.assertAlmostEqual(pred_x, 3.0)

    def test_ev_scenario(self):
        # Simulated EV data: time (min) vs battery (%)
        charging_time = [10, 20, 30, 40, 50, 60]
        battery_level = [20, 38, 55, 70, 82, 90]
        
        result = compute_regression_lines(charging_time, battery_level)
        
        # Predict how long a car has charged if battery = 60%
        estimated_time = predict_x_from_y(60, 
                                         result['x_on_y']['slope'], 
                                         result['x_on_y']['intercept'])
        
        # Should be roughly between 30 and 40 minutes
        self.assertGreater(estimated_time, 30)
        self.assertLess(estimated_time, 40)

if __name__ == "__main__":
    # Example: EV charging data
    time_minutes = [15, 25, 35, 45, 55]
    battery_pct = [25, 45, 60, 75, 85]
    
    models = compute_regression_lines(time_minutes, battery_pct)
    
    print("=== EV Charging Regression ===")
    print(f"Y-on-X (predict battery from time): y = {models['y_on_x']['slope']:.2f}x + {models['y_on_x']['intercept']:.2f}")
    print(f"X-on-Y (predict time from battery): x = {models['x_on_y']['slope']:.2f}y + {models['x_on_y']['intercept']:.2f}")
    
    # Predict charging time for 70% battery
    est_time = predict_x_from_y(70, models['x_on_y']['slope'], models['x_on_y']['intercept'])
    print(f"\nEstimated charging time for 70% battery: {est_time:.1f} minutes")
    
    # Run tests
    unittest.main(argv=[''], exit=False, verbosity=2)

Best Practices and Common Pitfalls

Always ask: “Which variable is the input, and which is the output?”
Never assume Y-on-X works for inverse prediction—it doesn’t.
Check correlation strength: if |r| < 0.5, linear regression may not be appropriate.
For X-on-Y, remember the line is expressed as X = mY + c—don’t force it into Y = mX + c unless you algebraically rearrange it.
Validate with real data: synthetic examples can hide edge cases.

Conclusion

Whether you're estimating EV charging duration from battery levels, inferring user engagement time from activity scores, or back-calculating dosage from biomarker readings—knowing how to regress X on Y is a superpower. By understanding both regression directions and applying them correctly, you avoid systematic prediction errors and build more trustworthy systems. The next time you fit a line, ask: “Am I predicting Y from X—or X from Y?” The answer changes everything.