# Multiple Linear Regression using Python

## Multiple Linear Regression using Python

In the previous article, we studied Logistic Regression. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.

## When we should use Multiple Linear Regression?

Multiple Linear Regression is an extended version of simple Linear regression, with one most important difference being the number of features it can handle. Multiple Linear Regression can handle more than 1 feature. So, we should use Multiple Linear Regression in cases where the dataset is uniformly distributed and has more than 1 feature to process.

## How do we calculate Multiple Linear Regression?

The formula of the linear regression doesn't change, it remains y= m*X+b, only the number of coefficients increases

## Advantages/Features of Multiple Linear Regression

1. The chances of getting a better-fit increase as the generated models are dependent on more than 1 feature
2. Multiple Linear Regression can detect outliers and anomalies very effectively.

## Disadvantages/Shortcomings of Multiple Linear Regression

1. The problem of overfitting is very prevalent here, as we can use all features to generate the model, so the model can start "memorizing" the values
2. Accuracy decreases as the linearity of the dataset decreases.

## Multiple Linear Regression

Multiple linear regression (MLR) or multiple regression, is a statistical technique that uses several preparatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.

In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.

Simple linear regression is a method that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables—an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.

The multiple regression model is based on the following assumptions:
1. Linearity: There is a linear relationship between the dependent variables and the independent variables.
2. Correlation: The independent variables are not too highly correlated with each other.
3. yi observations are selected independently and randomly from the population.
4. Normal Distribution: Residuals should be normally distributed with a mean of 0 and variance σ.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation, or vertically in table form.

## Multiple Linear Regression Example

Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.

1. Using SkLearn
1. from pandas import DataFrame
2. from sklearn import linear_model
3. import statsmodels.api as sm
In the above code, we import the required python libraries.
1. Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
2.                 'Month': [1211,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
3.                 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
4.                 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
5.                 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
6.                 }
In the above code, we are defining our data.
1. df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
2.
3. X = df[['Interest_Rate','Unemployment_Rate']]
4. Y = df['Stock_Index_Price']
In the above code, we are pre-processing the data.
1. regr = linear_model.LinearRegression()
2. regr.fit(X, Y)
In the above code, we are generating the model
1. print('Intercept: \n', regr.intercept_)    Multiple Linear Regression using Python
2. print('Coefficients: \n', regr.coef_)
In the above code, we are printing the parameters of the generated model

the output that I am getting is :
Intercept: 1798.4039776258546
Coefficients: [ 345.54008701 -250.14657137]
1. # prediction with sklearn
2. New_Interest_Rate = 2.75
3. New_Unemployment_Rate = 5.3
4. print ('Predicted Stock Index Price: \n', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]]))
In the above code, we are predicting the stock price corresponding to the given feature values.

MLR_SkLearn.py
1. from pandas import DataFrame
2. from sklearn import linear_model
3. import statsmodels.api as sm
4.
5. Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
6.                 'Month': [1211,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
7.                 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
8.                 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
9.                 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
10.                 }
11.
12. df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
13.
14. X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
15. Y = df['Stock_Index_Price']
16.
17. # with sklearn
18. regr = linear_model.LinearRegression()
19. regr.fit(X, Y)
20.
21. print('Intercept: \n', regr.intercept_)
22. print('Coefficients: \n', regr.coef_)
23.
24. # prediction with sklearn
25. New_Interest_Rate = 2.75
26. New_Unemployment_Rate = 5.3
27. print ('Predicted Stock Index Price: \n', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]]))
Output

Intercept: 1798.4039776258546
Coefficients: [ 345.54008701 -250.14657137]
Predicted Stock Index Price: [1422.86238865]
1. print_model = model.summary()
2. print(print_model)
Output

### 2. Using NumPy

1. import numpy as np
2. import pandas as pd
3. import matplotlib.pyplot as plt
4. import seaborn as sns
In the above code, we are importing the necessary libraries.
In the above code, we are importing the data. You can download the "home.txt" file from the article.
1. #we need to normalize the features using mean normalization
2. my_data = (my_data - my_data.mean())/my_data.std()
3.
4. #setting the matrixes
5. X = my_data.iloc[:,0:2]
6. ones = np.ones([X.shape[0],1])
7. X = np.concatenate((ones,X),axis=1)
8.
9. y = my_data.iloc[:,2:3].values #.values converts it from pandas.core.frame.DataFrame to numpy.ndarray
10. theta = np.zeros([1,3])
In the above code, we are preprocessing the data.
1. sns.heatmap(X)
Let us visualize the data using a heatmap.

1. def computeCost(X,y,theta):
2.     tobesummed = np.power(((X @ theta.T)-y),2)
3.     return np.sum(tobesummed)/(2 * len(X))
4.
6.     cost = np.zeros(iters)
7.     for i in range(iters):
8.         theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
9.         cost[i] = computeCost(X, y, theta)
10.
11.     return theta,cost
In the above code, we are defining the methods for finding the cost and for gradient descent
1. #set hyperparameters
2. alpha = 0.01
3. iters = 1000
In the above code, we are setting the value for the hyperparameters.
2. print(g)
3.
4. finalCost = computeCost(X,y,g)
5. print(finalCost)
In the above code, we are calling the methods for fitting the model

The output that I am getting is
[[-1.10868761e-16 8.78503652e-01 -4.69166570e-02]] 0.13070336960771892
1. fig, ax = plt.subplots()
2. ax.plot(np.arange(iters), cost, 'r')
3. ax.set_xlabel('Iterations')
4. ax.set_ylabel('Cost')
5. ax.set_title('Error vs. Training Epoch')
In the above code, we are generating the graph of Error vs Training Epochs

MLR_NumPy.py
1. import numpy as np
2. import pandas as pd
3. import matplotlib.pyplot as plt
4. import seaborn as sns
5.
7.
8. #we need to normalize the features using mean normalization
9. my_data = (my_data - my_data.mean())/my_data.std()
10.
11. #setting the matrixes
12. X = my_data.iloc[:,0:2]
13. ones = np.ones([X.shape[0],1])
14. X = np.concatenate((ones,X),axis=1)
15.
16. y = my_data.iloc[:,2:3].values #.values converts it from pandas.core.frame.DataFrame to numpy.ndarray
17. theta = np.zeros([1,3])
18.
19. sns.heatmap(X)
20.
21. #computecost
22. def computeCost(X,y,theta):
23.     tobesummed = np.power(((X @ theta.T)-y),2)
24.     return np.sum(tobesummed)/(2 * len(X))
25.
27.     cost = np.zeros(iters)
28.     for i in range(iters):
29.         theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
30.         cost[i] = computeCost(X, y, theta)
31.
32.     return theta,cost
33.
34. #set hyper parameters
35. alpha = 0.01
36. iters = 1000
37.
39. print(g)
40.
41. finalCost = computeCost(X,y,g)
42. print(finalCost)
43.
44. fig, ax = plt.subplots()
45. ax.plot(np.arange(iters), cost, 'r')
46. ax.set_xlabel('Iterations')
47. ax.set_ylabel('Cost')
48. ax.set_title('Error vs. Training Epoch')
3. Using TensorFlow
1. import matplotlib.pyplot as plt
2. import tensorflow as tf
3. import tensorflow.contrib.learn as skflow
4. from sklearn.utils import shuffle
5. import numpy as np
6. import pandas as pd
7. import seaborn as sns
In the above code, we are importing the required libraries
2. print (df.describe())
In the above code, we are importing the dataset. You can download the dataset from Kaggle
1. sns.heatmap(df)
Let us visualize the data.

1. f, ax1 = plt.subplots()
2.
3. y = df['MEDV']
4.
5. for i in range (1,8):
6.     number = 420 + i
7.     ax1.locator_params(nbins=3)
8.     ax1 = plt.subplot(number)
9.     plt.title(list(df)[i])
10.     ax1.scatter(df[df.columns[i]],y) #Plot a scatter draw of the  datapoints
12.
13. plt.show()
Let us visualize each dataset column seprately.

1. X = tf.placeholder("float", name="X"# create symbolic variables
2. Y = tf.placeholder("float", name = "Y")
In the above code, we are defining the actual trainable variables.
1. with tf.name_scope("Model"):
2.
3.     w = tf.Variable(tf.random_normal([2], stddev=0.01), name="b0"# create a shared variable
4.     b = tf.Variable(tf.random_normal([2], stddev=0.01), name="b1"# create a shared variable
5.
6.     def model(X, w, b):
7.         return tf.multiply(X, w) + b # We just define the line as X*w + b0
8.
9.     y_model = model(X, w, b)
In the above code, we are defining the model.
1. with tf.name_scope("CostFunction"):
2.     cost = tf.reduce_mean(tf.pow(Y-y_model, 2)) # use sqr error for cost function
3.
In the above code, we are defining the cost function and the cost optimizer function.
1. sess = tf.Session()
2. init = tf.initialize_all_variables()
3. tf.train.write_graph(sess.graph, '/home/bonnin/linear2','graph.pbtxt')
4. cost_op = tf.summary.scalar("loss", cost)
5. merged = tf.summary.merge_all()
6. sess.run(init)
7. writer = tf.summary.FileWriter('/home/bonnin/linear2', sess.graph)
In the above code, we create the garph file which can be used to visualize the model on TensorBoard.
1. xvalues = df[[df.columns[2], df.columns[4]]].values.astype(float)
2. yvalues = df[df.columns[12]].values.astype(float)
3. b0temp=b.eval(session=sess)
4. b1temp=w.eval(session=sess)
In the above code, we are making sure that the values are accesible to us even after the session ends.
1. for a in range (1,50):
2.     cost1=0.0
3.     for i, j in zip(xvalues, yvalues):
4.         sess.run(train_op, feed_dict={X: i, Y: j})
5.         cost1+=sess.run(cost, feed_dict={X: i, Y: i})/506.00
6.     xvalues, yvalues = shuffle (xvalues, yvalues)
7.     print ("Cost over iterations",cost1)
8.     b0temp=b.eval(session=sess)
9.     b1temp=w.eval(session=sess)
In the above code, we are doing training.
1. print("the final equation comes out to be", b0temp,"+",b1temp,"*X","\n Cost :",cost1)
In the above code, we are printing the model and the final cost.

The output that I am getting is
the final equation comes out to be [4.7545404 7.7991614] + [1.0045488 7.807921 ] *X
Cost: 75.29625831573846

MLR_TensorFlow.py
1. import tensorflow as tf
2. import tensorflow.contrib.learn as skflow
3. from sklearn.utils import shuffle
4. import numpy as np
5. import pandas as pd
6.
8. print (df.describe())
9.
10. f, ax1 = plt.subplots()
11. import seaborn as sns
12. sns.heatmap(df)
13.
14. y = df['MEDV']
15.
16. for i in range (1,8):
17.     number = 420 + i
18.     ax1.locator_params(nbins=3)
19.     ax1 = plt.subplot(number)
20.     plt.title(list(df)[i])
21.     ax1.scatter(df[df.columns[i]],y) #Plot a scatter draw of the  datapoints
23.
24. plt.show()
25.
26. X = tf.placeholder("float", name="X"# create symbolic variables
27. Y = tf.placeholder("float", name = "Y")
28.
29. with tf.name_scope("Model"):
30.
31.     w = tf.Variable(tf.random_normal([2], stddev=0.01), name="b0"# create a shared variable
32.     b = tf.Variable(tf.random_normal([2], stddev=0.01), name="b1"# create a shared variable
33.
34.     def model(X, w, b):
35.         return tf.multiply(X, w) + b # We just define the line as X*w + b0
36.
37.     y_model = model(X, w, b)
38.
39. with tf.name_scope("CostFunction"):
40.     cost = tf.reduce_mean(tf.pow(Y-y_model, 2)) # use sqr error for cost function
41.
43.
44.
45. sess = tf.Session()
46. init = tf.initialize_all_variables()
47. tf.train.write_graph(sess.graph, '/home/bonnin/linear2','graph.pbtxt')
48. cost_op = tf.summary.scalar("loss", cost)
49. merged = tf.summary.merge_all()
50. sess.run(init)
51. writer = tf.summary.FileWriter('/home/bonnin/linear2', sess.graph)
52.
53. xvalues = df[[df.columns[2], df.columns[4]]].values.astype(float)
54. yvalues = df[df.columns[12]].values.astype(float)
55. b0temp=b.eval(session=sess)
56. b1temp=w.eval(session=sess)
57.
58. for a in range (1,50):
59.     cost1=0.0
60.     for i, j in zip(xvalues, yvalues):
61.         sess.run(train_op, feed_dict={X: i, Y: j})
62.         cost1+=sess.run(cost, feed_dict={X: i, Y: i})/506.00
63.     xvalues, yvalues = shuffle (xvalues, yvalues)
64.     print ("Cost over iterations",cost1)
65.     b0temp=b.eval(session=sess)
66.     b1temp=w.eval(session=sess)
67.
68. print("the final equation comes out to be", b0temp,"+",b1temp,"*X")

## Conclusion

In this article, we studied what is regression, types of regression and why should we use multiple linear regression, how do we calculate multiple linear regression, advantages of multiple linear regression, disadvantages of multiple linear regression, multiple linear regression example using sklearn, numpy, and TensorFlow. Hope you were able to understand each and everything. For any doubts, please comment on your query.

In the next article, we will learn about the Decision Tree.

Congratulations!!! You have climbed your next step in becoming a successful ML Engineer.

Next Article In this Series >> Decision Tree

C# Corner
MVP Program Director