Table of Contents
Introduction
What Is a Regression Line?
Why Regression on X vs. Y Matters
Real-World Scenario: Predicting Electric Vehicle Charging Demand
Methods to Compute Regression Lines
Complete Implementation with Test Cases
Best Practices and Common Pitfalls
Conclusion
Introduction
Linear regression is one of the most widely used tools in data science—but did you know there are two ways to fit a line depending on which variable you treat as the predictor? Most tutorials only show regression of Y on X (predicting Y from X), but what if you need to predict X from Y instead?
This article demystifies both approaches using a timely, real-world example from the electric vehicle (EV) industry and provides clean, tested Python code you can use immediately.
What Is a Regression Line?
A regression line models the linear relationship between two variables. There are two distinct types:
Regression of Y on X: Minimizes vertical errors. Used when X is the input (e.g., time) and Y is the output (e.g., demand).
Regression of X on Y: Minimizes horizontal errors. Used when you want to infer the input (X) from an observed output (Y).
Mathematically, they produce different lines unless the correlation is perfect (±1).
Why Regression on X vs. Y Matters
Choosing the wrong regression direction leads to biased predictions. For example:
Confusing the two can result in significant errors in real applications.
Real-World Scenario: Predicting EV Charging Duration from Battery Level
Imagine you're building a smart parking system for an EV charging station. Cameras detect a car’s current battery percentage (e.g., 68%), but you don’t know how long it’s been plugged in. You do have historical data: for thousands of sessions, you recorded charging time (minutes) and resulting battery level (%).
Your goal: Estimate charging duration from observed battery level.
This requires regression of X (time) on Y (battery %)—the less common but critical approach.
Methods to Compute Regression Lines
We’ll compute both lines using basic statistics—no external libraries needed.
Key Formulas
Given data points (x_i, y_i)
:
Where r
is Pearson’s correlation coefficient, and σ
denotes standard deviation.
Complete Implementation with Test Cases
![PlantUML Diagram]()
import math
from typing import Tuple, List
import unittest
def compute_regression_lines(x: List[float], y: List[float]) -> dict:
"""
Computes both regression lines:
- Y on X: y = a1 * x + b1
- X on Y: x = a2 * y + b2 → y = (x - b2) / a2 (if needed in y = mx + c form)
Returns coefficients in a dictionary.
"""
if len(x) != len(y) or len(x) == 0:
raise ValueError("Input lists must be non-empty and of equal length")
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
# Compute variances and covariance
var_x = sum((xi - mean_x) ** 2 for xi in x) / n
var_y = sum((yi - mean_y) ** 2 for yi in y) / n
cov_xy = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y)) / n
# Avoid division by zero
if var_x == 0 or var_y == 0:
raise ValueError("Variance of X or Y is zero; regression undefined")
# Correlation
r = cov_xy / (math.sqrt(var_x) * math.sqrt(var_y))
# Regression of Y on X
slope_y_on_x = r * (math.sqrt(var_y) / math.sqrt(var_x))
intercept_y_on_x = mean_y - slope_y_on_x * mean_x
# Regression of X on Y
slope_x_on_y = r * (math.sqrt(var_x) / math.sqrt(var_y))
intercept_x_on_y = mean_x - slope_x_on_y * mean_y
return {
'y_on_x': {'slope': slope_y_on_x, 'intercept': intercept_y_on_x},
'x_on_y': {'slope': slope_x_on_y, 'intercept': intercept_x_on_y}
}
def predict_x_from_y(y_val: float, slope: float, intercept: float) -> float:
"""Use X-on-Y regression to predict X from Y."""
return slope * y_val + intercept
class TestRegressionLines(unittest.TestCase):
def test_known_case(self):
# Perfect linear relationship: y = 2x + 1
x = [1, 2, 3, 4, 5]
y = [3, 5, 7, 9, 11]
result = compute_regression_lines(x, y)
# Both regressions should be identical when r = ±1
self.assertAlmostEqual(result['y_on_x']['slope'], 2.0)
self.assertAlmostEqual(result['y_on_x']['intercept'], 1.0)
self.assertAlmostEqual(result['x_on_y']['slope'], 0.5) # because x = 0.5*(y - 1)
self.assertAlmostEqual(result['x_on_y']['intercept'], -0.5)
# Predict time (x) from battery level (y=7) → should be 3
pred_x = predict_x_from_y(7, result['x_on_y']['slope'], result['x_on_y']['intercept'])
self.assertAlmostEqual(pred_x, 3.0)
def test_ev_scenario(self):
# Simulated EV data: time (min) vs battery (%)
charging_time = [10, 20, 30, 40, 50, 60]
battery_level = [20, 38, 55, 70, 82, 90]
result = compute_regression_lines(charging_time, battery_level)
# Predict how long a car has charged if battery = 60%
estimated_time = predict_x_from_y(60,
result['x_on_y']['slope'],
result['x_on_y']['intercept'])
# Should be roughly between 30 and 40 minutes
self.assertGreater(estimated_time, 30)
self.assertLess(estimated_time, 40)
if __name__ == "__main__":
# Example: EV charging data
time_minutes = [15, 25, 35, 45, 55]
battery_pct = [25, 45, 60, 75, 85]
models = compute_regression_lines(time_minutes, battery_pct)
print("=== EV Charging Regression ===")
print(f"Y-on-X (predict battery from time): y = {models['y_on_x']['slope']:.2f}x + {models['y_on_x']['intercept']:.2f}")
print(f"X-on-Y (predict time from battery): x = {models['x_on_y']['slope']:.2f}y + {models['x_on_y']['intercept']:.2f}")
# Predict charging time for 70% battery
est_time = predict_x_from_y(70, models['x_on_y']['slope'], models['x_on_y']['intercept'])
print(f"\nEstimated charging time for 70% battery: {est_time:.1f} minutes")
# Run tests
unittest.main(argv=[''], exit=False, verbosity=2)
![c]()
Best Practices and Common Pitfalls
Always ask: “Which variable is the input, and which is the output?”
Never assume Y-on-X works for inverse prediction—it doesn’t.
Check correlation strength: if |r| < 0.5, linear regression may not be appropriate.
For X-on-Y, remember the line is expressed as X = mY + c—don’t force it into Y = mX + c unless you algebraically rearrange it.
Validate with real data: synthetic examples can hide edge cases.
Conclusion
Whether you're estimating EV charging duration from battery levels, inferring user engagement time from activity scores, or back-calculating dosage from biomarker readings—knowing how to regress X on Y is a superpower. By understanding both regression directions and applying them correctly, you avoid systematic prediction errors and build more trustworthy systems. The next time you fit a line, ask: “Am I predicting Y from X—or X from Y?” The answer changes everything.