Python  

How to Find a Regression Line on X or Y in Python

Table of Contents

  • Introduction

  • What Is a Regression Line?

  • Why Regression on X vs. Y Matters

  • Real-World Scenario: Predicting Electric Vehicle Charging Demand

  • Methods to Compute Regression Lines

  • Complete Implementation with Test Cases

  • Best Practices and Common Pitfalls

  • Conclusion

Introduction

Linear regression is one of the most widely used tools in data science—but did you know there are two ways to fit a line depending on which variable you treat as the predictor? Most tutorials only show regression of Y on X (predicting Y from X), but what if you need to predict X from Y instead?

This article demystifies both approaches using a timely, real-world example from the electric vehicle (EV) industry and provides clean, tested Python code you can use immediately.

What Is a Regression Line?

A regression line models the linear relationship between two variables. There are two distinct types:

  • Regression of Y on X: Minimizes vertical errors. Used when X is the input (e.g., time) and Y is the output (e.g., demand).

  • Regression of X on Y: Minimizes horizontal errors. Used when you want to infer the input (X) from an observed output (Y).

Mathematically, they produce different lines unless the correlation is perfect (±1).

Why Regression on X vs. Y Matters

Choosing the wrong regression direction leads to biased predictions. For example:

  • Predicting battery charge level (Y) from charging time (X) → use Y-on-X.

  • Estimating how long a car has been charging (X) from its current battery level (Y) → use X-on-Y.

Confusing the two can result in significant errors in real applications.

Real-World Scenario: Predicting EV Charging Duration from Battery Level

Imagine you're building a smart parking system for an EV charging station. Cameras detect a car’s current battery percentage (e.g., 68%), but you don’t know how long it’s been plugged in. You do have historical data: for thousands of sessions, you recorded charging time (minutes) and resulting battery level (%).

Your goal: Estimate charging duration from observed battery level.
This requires regression of X (time) on Y (battery %)—the less common but critical approach.

Methods to Compute Regression Lines

We’ll compute both lines using basic statistics—no external libraries needed.

Key Formulas

Given data points (x_i, y_i):

  • Y on X:
    Slope = r * (σ_y / σ_x)
    Intercept = mean(y) - slope * mean(x)

  • X on Y:
    Slope = r * (σ_x / σ_y)
    Intercept = mean(x) - slope * mean(y)

Where r is Pearson’s correlation coefficient, and σ denotes standard deviation.

Complete Implementation with Test Cases

PlantUML Diagram
import math
from typing import Tuple, List
import unittest

def compute_regression_lines(x: List[float], y: List[float]) -> dict:
    """
    Computes both regression lines:
    - Y on X: y = a1 * x + b1
    - X on Y: x = a2 * y + b2  →  y = (x - b2) / a2  (if needed in y = mx + c form)
    
    Returns coefficients in a dictionary.
    """
    if len(x) != len(y) or len(x) == 0:
        raise ValueError("Input lists must be non-empty and of equal length")
    
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    
    # Compute variances and covariance
    var_x = sum((xi - mean_x) ** 2 for xi in x) / n
    var_y = sum((yi - mean_y) ** 2 for yi in y) / n
    cov_xy = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y)) / n
    
    # Avoid division by zero
    if var_x == 0 or var_y == 0:
        raise ValueError("Variance of X or Y is zero; regression undefined")
    
    # Correlation
    r = cov_xy / (math.sqrt(var_x) * math.sqrt(var_y))
    
    # Regression of Y on X
    slope_y_on_x = r * (math.sqrt(var_y) / math.sqrt(var_x))
    intercept_y_on_x = mean_y - slope_y_on_x * mean_x
    
    # Regression of X on Y
    slope_x_on_y = r * (math.sqrt(var_x) / math.sqrt(var_y))
    intercept_x_on_y = mean_x - slope_x_on_y * mean_y
    
    return {
        'y_on_x': {'slope': slope_y_on_x, 'intercept': intercept_y_on_x},
        'x_on_y': {'slope': slope_x_on_y, 'intercept': intercept_x_on_y}
    }

def predict_x_from_y(y_val: float, slope: float, intercept: float) -> float:
    """Use X-on-Y regression to predict X from Y."""
    return slope * y_val + intercept

class TestRegressionLines(unittest.TestCase):
    def test_known_case(self):
        # Perfect linear relationship: y = 2x + 1
        x = [1, 2, 3, 4, 5]
        y = [3, 5, 7, 9, 11]
        
        result = compute_regression_lines(x, y)
        
        # Both regressions should be identical when r = ±1
        self.assertAlmostEqual(result['y_on_x']['slope'], 2.0)
        self.assertAlmostEqual(result['y_on_x']['intercept'], 1.0)
        self.assertAlmostEqual(result['x_on_y']['slope'], 0.5)  # because x = 0.5*(y - 1)
        self.assertAlmostEqual(result['x_on_y']['intercept'], -0.5)
        
        # Predict time (x) from battery level (y=7) → should be 3
        pred_x = predict_x_from_y(7, result['x_on_y']['slope'], result['x_on_y']['intercept'])
        self.assertAlmostEqual(pred_x, 3.0)

    def test_ev_scenario(self):
        # Simulated EV data: time (min) vs battery (%)
        charging_time = [10, 20, 30, 40, 50, 60]
        battery_level = [20, 38, 55, 70, 82, 90]
        
        result = compute_regression_lines(charging_time, battery_level)
        
        # Predict how long a car has charged if battery = 60%
        estimated_time = predict_x_from_y(60, 
                                         result['x_on_y']['slope'], 
                                         result['x_on_y']['intercept'])
        
        # Should be roughly between 30 and 40 minutes
        self.assertGreater(estimated_time, 30)
        self.assertLess(estimated_time, 40)

if __name__ == "__main__":
    # Example: EV charging data
    time_minutes = [15, 25, 35, 45, 55]
    battery_pct = [25, 45, 60, 75, 85]
    
    models = compute_regression_lines(time_minutes, battery_pct)
    
    print("=== EV Charging Regression ===")
    print(f"Y-on-X (predict battery from time): y = {models['y_on_x']['slope']:.2f}x + {models['y_on_x']['intercept']:.2f}")
    print(f"X-on-Y (predict time from battery): x = {models['x_on_y']['slope']:.2f}y + {models['x_on_y']['intercept']:.2f}")
    
    # Predict charging time for 70% battery
    est_time = predict_x_from_y(70, models['x_on_y']['slope'], models['x_on_y']['intercept'])
    print(f"\nEstimated charging time for 70% battery: {est_time:.1f} minutes")
    
    # Run tests
    unittest.main(argv=[''], exit=False, verbosity=2)
c

Best Practices and Common Pitfalls

  • Always ask: “Which variable is the input, and which is the output?”

  • Never assume Y-on-X works for inverse prediction—it doesn’t.

  • Check correlation strength: if |r| < 0.5, linear regression may not be appropriate.

  • For X-on-Y, remember the line is expressed as X = mY + c—don’t force it into Y = mX + c unless you algebraically rearrange it.

  • Validate with real data: synthetic examples can hide edge cases.

Conclusion

Whether you're estimating EV charging duration from battery levels, inferring user engagement time from activity scores, or back-calculating dosage from biomarker readings—knowing how to regress X on Y is a superpower. By understanding both regression directions and applying them correctly, you avoid systematic prediction errors and build more trustworthy systems. The next time you fit a line, ask: “Am I predicting Y from X—or X from Y?” The answer changes everything.