Linear Regression In R

Introduction

I am writing this article to demonstrate prediction Linear Regression Analysis using R.

Linear Regression is a machine learning algorithm that is used to predict the value of a dependent variable also known as a response variable using an independent variable also known as the predictor variable. Linear Regression fits a straight line that minimizes the discrepancies between the actual and predicted values of the dependent variable. Linear Regression is best suited for and widely used by businesses to evaluate trends and make estimates or forecasts. For demonstration purposes, I will show the readers how we can predict the fare to be paid based on the distance traveled. I have stored the fare and distance in a CSV file, which will be used to provide input to our program.

The equation of linear regression can be expressed as Y = a + bX, where X is the independent variable and Y is the dependent variable. The term in the equation represents the slope of the line and represents the intercept, which is the value of Y when X is zero.

Using the code

Assume that we have a CSV file as follows,

distance, fare
1,5
2,27
3,38
4,42
5,60
6,69
7,77
8,87
9,99
10,100

The first column of the above CSV file is the distance and the second column is the actual fare.

Now we want to predict the fare using Linear Regression.

The following code can be used to read the data from the CSV file and extract it into the distance and fare variables.

data = read.csv("faredata.csv")
distance <- data$distance
fare <- data$fare

The lm() function can now be used to create a simple regression model as follows,

model = lm(fare~distance)

The fare represents the variable to be predicted and distance is the data on which the fare is based.

Next, we retrieve the actual fares using the data.frame() function as follows,

distances <- data.frame(distance)

The predicted fares can be obtained using the predict function as follows,

pred_fare <- predict(model,distances)

The coef() function can be used to extract model coefficients from the objects returned by modeling functions.

We can use it to find the intercept as follows,

print("Intercept")
print(coef(model)["(Intercept)"])

The slope of the regression line can be calculated by dividing the covariance of distance and fare with the variance of distance as follows,

print("Slope")
print(cov(distance,fare) / var(distance))

Finally, we can plot the regression graph as follows,

png(file="linreg.png")
plot(distance,fare,col="red",main="Distance and Fare Linear Regression",abline(lm(fare~distance)),cex=2,pch=16,xlab="Distance",ylab="Fare")
dev.off()

The above plot() function plots the distance and fares on the X and Y axes respectively. The col parameter specifies the color for the data points. The abline() function draws a regression line on the plot using the regression model. The cex parameter specifies the scaling factor for the data points. The pch parameter specifies the plot character to be used, 16 being a solid circle. The plotted chart is saved by the name specified in the png() function.

Conclusion

I hope readers of this article find it useful and that it may help them to explore more concepts of machine learning using R programming.