Implementing Simple Linear Regression Model in Fabric Notebook

What is the Simple Linear Regression Model?

Simple Linear Regression is a statistical method that helps to model and analyze the relationship between two continuous variables. In simple linear regression, we have a dependent variable (also known as the response or target variable) and an independent variable (also known as the predictor or explanatory variable).

The relationship between the variables is represented by the equation of a straight line.

  • y=mx+b

Where: y is the dependent variable (response), x is the independent variable (predictor), m is the slope of the line (the change in y for a unit change in x), and b is the y-intercept (the value of y when x is 0).

The main goal of simple linear regression is to find the best-fitting line through the data points that minimizes the sum of the squared differences between the observed values (actual data points) and the values predicted by the line. This process is often called "fitting" the model.

The equation of the line is determined during the training phase using a method called the least squares method, which minimizes the sum of the squared vertical distances (residuals) between the observed and predicted values.

Implement Simple Linear Regression in Fabric Notebook

In this article, we are going to use salary data downloaded from kaggle.com. The data is based on the Salary of employees based on experience. The dataset contains two columns YearsExperience and Salary respectively.

Step 1. Import Necessary Python Library

Implementing simple linear regression in Python requires leveraging essential libraries. Pandas is employed for efficient data manipulation, NumPy facilitates mathematical calculations, while Matplotlib and Seaborn handle visualizations. For machine learning operations, Scikit-learn (Sklearn) libraries play a key role. The provided code snippet showcases a basic workflow for building and visualizing a simple linear regression model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

Python code

Step 2. Read Data into DataFrame

The salary_dataset is initially uploaded into Simple_Linear_Regression Lakehouse in Microsoft Fabric and then loaded into a Table. To read the parquet file into a DataFrame and display the top 5 records, I executed the code below.

df = pd.read_parquet("abfss://b67a4b8d-a01e-4a4e-925f-d922133adb48@onelake.dfs.fabric.microsoft.com/1c9bba50-92b9-4222-a438-1ed171369403/Tables/salary_data_for_simple_linear_regression_model")
display(df.head(5))

Step 3. Exploratory Data Analysis

Next, I executed the code below to show the dataset descriptive analysis using the describe() method

df.describe()

Workspace default

Based on the descriptive analysis, we have 30 records, the smallest salary is 37731 and the highest salary is 122391

In addition, I explored the relationship between Years of Experience and Salary with a Scatter Plot by firing this code.

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(df['YearsExperience'], df['Salary'])
plt.title('Scatter Plot')
plt.xlabel('YearsExperience')
plt.ylabel('Salary')

Python code

Scatter plot

In addition, I also visualize the distribution of Salary using Histogram by executing this code.

# Histogram
plt.subplot(1, 3, 3)
plt.hist(df['Salary'], bins=10, color='skyblue', edgecolor='black')
plt.title('Salary Histogram')
plt.xlabel('Salary')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Command executed

Salary histogram

Step 4. Split the dataset into dependent and independent variables

  • Experience (X) is the independent variable
  • Salary (y) is dependent on experience
# Splitting Variable into Dependent and Independent Variable
X = df.iloc[:, :1]  
y = df.iloc[:, 1:]  

Workspace default

Step 5. Split data into Training Testing Sets

The data was split into training (80%) and testing (20%) sets using the train_test_split function by firing this code:

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Visual Studio code

Step 6. Train the regression model

Next, I initialize the instance of the LinearRegression() model and fit the training set by the executive in this code below.

# Initialization of the Regressor model instance & Fitting the Training Set 
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Linear regression

Step 7. Prediction

The intriguing phase unfolds as we stand prepared to forecast any y-value (Salary) based on X (Experience) using the trained model through the regressor.predict method. The code below was executed to make the prediction:

y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)  

Prediction

Step 8. Plot the training and test results

Now, it's time to validate the predictions through visual representation.

I begin by plotting the training set data against predictions then initiate the plot with the training sets (X_train, y_train), where X_train is matched with the predicted values of y_train (regressor.predict(X_train)).

# Prediction on training set
plt.scatter(X_train, y_train, color = 'skyblue')
plt.plot(X_train, y_pred_train, color = 'darkorange')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()

Plot

Salary

Next, I plotted the results for the test sets (X_test, y_test), aligning X_train with the predicted values of y_train (regressor.predict(X_train)).

# Prediction on the test set
plt.scatter(X_test, y_test, color = 'skyblue')
plt.plot(X_train, y_pred_train, color = 'darkorange')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()

Log

Test set

Observe that the regressor line encompasses both training and test data in both plots.

Additionally, it is possible to plot results using the predicted values of y_test (regressor.predict(X_test)). However, the regression line remains constant as it is derived from the unique equation of linear regression with the same training data.

Recall from the outset of this article, that we delved into the linear equation y = mx + c. It's worth noting that we can extract the y-intercept (c) and slope (m or coefficient) directly from the regressor model.

Step 9. Display the Regressor coefficients and intercept

# Regressor coefficients and intercept
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')

Intercept

Conclusion

I trust that this post has provided a valuable introduction to Simple Linear Regression, elucidating its significance in constructing and evaluating linear models. Put simply, linear regression stands as a powerful supervised machine learning technique, empowering us to forecast linear associations between two variables.


Similar Articles