The Basic Concepts Of Data Science

Yusuf Karatoprak
4y
26.8k
0
3

Article

Introduction

The basics concepts of data science can be separated into two important parts. Some people may argue with me because I have to tell you supervised learning and unsupervised learning and decision tree algorithms. But my intention is not to explain the concepts of Data Science. This article is related to knowledge about who wants to be started as a data scientist.

The basic concepts of Data Science can be separated into two parts.

Regression
Classification

Why we have to learn these two concepts? The first reason is that we have to model the relationship between two variables by fitting a linear equation. And Classification is a method of classification for the data. The classifier is used for classification.

A linear regression line has an equation of the form Y = b*X+a, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and “a” is the intercept.

How can I make it by NumPy?

As I mentioned before, Linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

(http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# C: \pybook\ Data\ LinearRegressionDataSet.csv
data = pd.read_csv("C:\pybook\Data\LinearRegressionDataSet.csv")
print(data)
x = data["X"]
y = data["Y"]
x = pd.DataFrame.as_matrix(x)
y = pd.DataFrame.as_matrix(y)
print(x)
print(y)
m, b = np.polyfit(x, y, 1)
a = np.arange(150)
plt.scatter(x, y)
plt.plot(m * a + b)
z = int(input("X value ?"))
prediction = m * z + b
print(prediction)
print("y=", m, "x+", b)
plt.scatter(z, prediction, c = "red", marker = ">")
plt.show()

What is NumPy?

Numpy is a kind of scientific library for a Python developer who wants to write a functional scientific program. With the help of NumPy, you don’t have to develop software from zero to advanced level.

What are pandas?

Pandas is a software library written in the Python programming language for data manipulation and analysis. Generally, it is used for reading data from CSV resources.

What is matplotlib?

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.

Let’s understand the above codes.

First of all, we have to read data from CSV by using pandas.

x	y
108	392.5
19	46.2
13	15.7
124	422.2
40	119.4
57	170.9
23	56.9
14	77.5
45	214
10	65.3
5	20.9
48	248.1

And then, we have to set X and Y values by using x = data[“X”] and y = data[“Y”] and then, transform arrays as a matrix. Fitting columns is another important step. Polyfit function will be a good method for fitting X and Y values. And preparing data by using np.arange() function will produce 150 items. And then running code will create below the plot.

Let’s make it again by scikit-learn

import numpy as np
import pandas as pd
from sklearn.linear_model
import LinearRegression as lr
import matplotlib.pyplot as plt
data = pd.read_csv("C:\pybook\Data\LinearRegressionDataSet.csv")
x = data["X"]
y = data["Y"]
x = x.reshape(63, 1)
y = y.reshape(63, 1)
linearregression = lr()
linearregression.fit(x, y)
linearregression.predict(x)# y = m * x + b# m = coef# b = intercept
m = linearregression.coef_
b = linearregression.intercept_
a = np.arange(150)
plt.scatter(x, y)# plt.scatter(a, m * a + b)
plt.scatter(a, m * a + b, c = "red")
plt.show()

When you look at the third line you can see that sklearn library LinearRegression part. Reading CSV and reshaping X and Y values is the routine method. We have to focus on fitting and prediction method,

linearregression.fit(x,y)
linearregression.predict(x)

Coef_ function is using measuring slope value. Intercept_ is also a function to create b values.

As a result; we understand that Linear Regression and classification are basic concepts for people who want to be data scientists. We used here y = m*x+b equation that is a simple way to understand regression because there are some other methods for linear regression. Fitting CSV columns and calling prediction are awesome things to develop the visual perspective of people who want to be a data scientist.

Summary

In this article, we learned about The Basic Concepts Of Data Science.