Data Science- A Beginner's Tutorial

Rohit Gupta
4y
46.9k
0
16

Article

Introduction

In the previous article, we studied Deep Learning. One thing that I believe is that if we can correlate anything with us or our life, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.

What is Data?

Data are individual units of information. A datum describes a single quality or quantity of some object or phenomenon. In analytical processes, data are represented by variables. Although the terms "data", "information" and "knowledge" are often used interchangeably, each of these terms has a distinct meaning.

Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

Everything and anything can be called data. Data is an abstract term, which can be used for each and every type of data. The amount of data that is generated on a daily basis can be estimated by seeing the following figures:

Google: processed 24 Peta Bytes of data every data per day
Facebook: 10 million photos uploaded every hour
Youtube: 1 hour of video uploaded every second
Twitter: 400 million tweets per day
Astronomy: Satellite Data is in hundreds of petabytes

Note: 1 petabyte is 1 million megabytes or 1 quadrillion bytes or 1000 trillion bytes or 1e-15 bytes

So from the above values, you can very easily understand that on a daily basis we produce around 1 GB or GigaByte of Data. Even the above-written figures are also a type of data.

What is Data Science?

Data Science is a detailed study of the flow of information from the colossal amounts of data present in an organization’s repository. It involves obtaining meaningful insights from raw and unstructured data which is processed through analytical, programming, and business skills.

Data science is a "concept to unify statistics, data analysis, machine learning, and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.

Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. In 2015, the American Statistical Association identified database management, statistics, and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.

What is Data Munging?

Data wrangling sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

Raw data can be unstructured and messy, with information coming from disparate data sources, mismatched or missing records, and a slew of other tricky issues. Data munging is a term to describe the data wrangling to bring together data into cohesive views, as well as the janitorial work of cleaning up data so that it is polished and ready for downstream usage. This requires good pattern-recognition sense and clever hacking skills to merge and transform masses of database-level information. If not properly done, dirty data can obfuscate the 'truth' hidden in the data set and completely mislead results. Thus, any data scientist must be skillful and nimble at data munging in order to have accurate, usable data before applying more sophisticated analytical tactics.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses.

Difference between Business Intelligence and Data Science

Features	Business Intelligence (BI)	Data Science
Data Sources	Structured (Usually SQL, often Data Warehouse)	Both Structured and Unstructured (logs, cloud data, SQL, NoSQL, text)
Approach	Statistics and Visualization	Statistics, Machine Learning, Graph Analysis, Neuro-linguistic Programming (NLP)
Focus	Past and Present	Present and Future
Tools	Pentaho, Microsoft BI, QlikView, R	RapidMiner, BigML, Weka, R
Coining of Term	Business Analytics has been used since the late 19th Century when it was put in place by Frederick Winslow Taylor	DJ Patil and Jeff Hammerbacher who were working in LinkedIn and Facebook respectively, first coined the term in 2008
Concept	Use of statistical concepts to extract insights from business data	The interdisciplinary field of data inference, algorithm building, and systems to gain insights from data
Top 5 Industries	1. Financial 2. Technology 3. Mix of fields 4. CRM/Marketing 5. Retail	1. Technology 2. Financial 3. Mix of fields 4. Internet-based 5. Academic
Coding	It does not involve much coding. More statistics oriented	Coding is used widely. The field is a combination of traditional analytics practices with sound knowledge of computer science
Language Recommendations	C/C++/C#, JAVA, MATLAB, PYTHON, R, SAS, SCALA, SQL	C/C++/C#, HASKELL, JAVA, MATLAB, PYTHON, R, SAS, SCALA, SQL, JULIA, STATA
Statistics	The whole analysis is based on statistical concepts	Statistics is used at the end of analysis following algorithm building and coding
Data Needed	Predominantly structured data	Both structured and unstructured data

The following image shows a comparison between the popularity of Business Intelligence and Data Science.

Legend:

Blue: Business Intelligence

Red: Data Science

Source: Google Trends

Difference between Machine Learning and Data Science

Features	Machine Learning	Data Science
Scope	Accurately classify or predict the outcome for new data point by learning patterns from historical data using mathematical models	Create insights from data dealing with all real-world complexities. This includes tasks like understand requirements extracting data, etc.
Input Data	Input data for ML will be transformed specifically for algorithms used. Feature scaling, word embedding or adding polynomial features are some examples	Most of the input data is generated as human consumable data which is to be read or analyzed by humans like tabular data or images
System Complexity	1. Major complexity is with algorithms and mathematical concepts behind that 2. Ensemble models will have more than one ML model and each will have a weighted contribution on the final output	1. Components for handling unstructured raw data coming 2. A lot of moving components typically scheduled by an orchestration layer to synchronize independent jobs
Preferred Skill Set	1. Strong Maths understanding 2. Python/R programming 3. Data wrangling with SQL Model specific visualization	1. Domain expertise 2. ETL and data profiling 3. String SQL 4. NoSQL system standard reporting/visualization
Hardware Specification	1. GPUs are preferred for intensive vector operations 2. More powerful versions like TPUs are on the way	1. Horizontally scalable systems preferred to handle massive data 2. High RAM and SSDs used to overcome I/O bottleneck

The following image shows a comparison between the popularity of Machine Learning and Data Science.

Legend:

Blue: Machine Learning

Red: Data Science

Source: Google Trends

Difference between Data Science and Artificial Intelligence

Factors	Data Science	Artificial Intelligence
Scope	Involves various underlying data operations	Limited to the implementation of ML algorithms
Type of Data	Structured and Unstructured	Standardized in the form of embeddings and vectors
Tools	R, Python, SAS, SPSS, TensorFlow, Keras, Scikit- Learn	Scikit-Learn, Kaffe, PyTorch, TensorFlow, Shogun, Mahout
Applications	Advertising, Marketing, Internet Search Engines	Manufacturing, Automation, Robotics, Transport, Healthcare

The following image shows a comparison between the popularity of Artificial Intelligence and Data Science.

Legend:

Blue: Data Science

Red: Artificial Intelligence

Source: Google Trends

Advantages/Features of Data Science

1. It detects and corrects the errors from data sets with the help of data cleansing. This helps in improving the quality of data and consecutively benefits both customers and institutions such as banks, insurance, and finance companies.

2. It removes duplicate information from data sets and hence saves large amount of memory space. This decreases the cost to the company.

3. It helps in displaying relevant advertisements on the online shopping websites based on historic data and purchase behavior of the users. Machine learning algorithms are applied to the same. This helps in increasing the revenue and productivity of the companies.

4. It reduces banking risks by identifying probable fraudulent customers based on historic data analysis. This helps institutes in deciding whether to issue loans or credit cards to the applicants or not.

5. It is used by security agencies for surveillance and monitoring purposes based on information collected by huge number of sensors. This helps in preventing any wrongdoings and/or calamities.

Disadvantages/Shortcomings of Data Science

1. This may breach the privacy of the customers as their information such as purchases, online transactions, subscriptions are visible to their parent companies. The companies may exchange these useful customer databases for their mutual benefits.

2. The cost of data analytics tools varies based on applications and features supported. Moreover, some of the data analytics tools are complex to use and require training. These increases cost to the company willing to adopt data analytics tools or software.

3. The information obtained using data analytics can also be misused against a group of people in certain countries or community or caste.

4. It is very difficult to select the right data analytics tools. This is due to the fact that it requires knowledge of the tools and their accuracy in analyzing the relevant data as per applications. This increases the time and cost to the company.

What are the skills required to be a Data Scientist?

A Data Scientist is a professional with the capabilities to gather large amounts of data to analyze and synthesize the information into actionable plans for companies and other organizations. Following are some requirements that need to be met so as to become a good Data Scientist:

1. Mathematics Expertise

At the heart of mining data insight and building data product are the ability to view the data through a quantitative lens. There are textures, dimensions, and correlations in data that can be expressed mathematically. Finding solutions utilizing data becomes a brain teaser of heuristics and quantitative technique. Solutions to many business problems involve building analytic models grounded in hard math, where being able to understand the underlying mechanics of those models is key to success in building them. Also, a misconception is that data science is all about statistics. While statistics are important, it is not the only type of math utilized.

First, there are two branches of statistics – classical statistics and Bayesian statistics. When most people refer to stats they are generally referring to classical stats, but knowledge of both types is helpful. Furthermore, many inferential techniques and machine learning algorithms lean on the knowledge of linear algebra.

For example, a popular method to discover hidden characteristics in a data set is SVD, which is grounded in matrix math and has much less to do with classical stats. Overall, it is helpful for data scientists to have breadth and depth in their knowledge of mathematics.

2. Technology and Hacking

First, let's clarify that we are not talking about hacking as in breaking into computers. We're referring to the tech programmer subculture meaning of hacking – i.e., creativity and ingenuity in using technical skills to build things and find clever solutions to problems.

Why is hacking ability important? Because data scientists utilize technology in order to wrangle enormous data sets and work with complex algorithms, and it requires tools far more sophisticated than Excel. Data scientists need to be able to code — prototype quick solutions, as well as integrate with complex data systems. Core languages associated with data science include SQL, Python, R, and SAS. On the periphery are Java, Scala, Julia, and others. But it is not just knowing language fundamentals. A hacker is a technical ninja, able to creatively navigate their way through technical challenges in order to make their code work.

Along these lines, a data science hacker is a solid algorithmic thinker, having the ability to break down messy problems and recompose them in ways that are solvable. This is critical because data scientists operate within a lot of algorithmic complexity. They need to have a strong mental comprehension of high-dimensional data and tricky data control flows. Full clarity on how all the pieces come together to form a cohesive solution.

3. Strong Business Acumen

It is important for a data scientist to be a tactical business consultant. Working so closely with data, data scientists are positioned to learn from data in ways no one else can. That creates the responsibility to translate observations to shared knowledge and contribute to strategy on how to solve core business problems. This means a core competency of data science is using data to cogently tell a story. No data-puking – rather, present a cohesive narrative of problem and solution, using data insights as supporting pillars, that lead to guidance.

Having this business acumen is just as important as having acumen for tech and algorithms. There needs to be clear alignment between data science projects and business goals. Ultimately, the value doesn't come from data, math, and tech itself. It comes from leveraging all of the above to build valuable capabilities and has strong business influence.

In short, the skills required to be a data scientist are:

Statistics
Programming skills
Critical thinking
Knowledge of AI, ML, and Deep Learning
Comfort with math
Good Knowledge of Python, R, SAS, and Scala
Communication
Data Wrangling
Data Visualization
Ability to understand analytical functions
Experience with SQL
Ability to work with unstructured data

Data Science Components

Now, in this ‘What is Data Science?’ blog, we will discuss some of the key components of Data Science, which are

1. Data (and Its Various Types)

The raw dataset is the foundation of Data Science, and it can be of various types like structured data (mostly in a tabular form) and unstructured data (images, videos, emails, PDF files, etc.)

2. Programming (Python and R)

Data management and analysis are done by computer programming. In Data Science, two programming languages are most popular: Python and R.

3. Statistics and Probability

Data is manipulated to extract information out of it. The mathematical foundation of Data Science is statistics and probability. Without having a clear knowledge of statistics and probability, there is a high possibility of misinterpreting data and reaching incorrect conclusions. That’s the reason why statistics and probability play a crucial role in Data Science.

4. Machine Learning

As a Data Scientist, every day, you will be using Machine Learning algorithms such as regression and classification methods. It is very important for a Data Scientist to know Machine learning as a part of their job so that they can predict valuable insights from available data.

5. Big Data

In the current world, raw data is compared with crude oil, and the way we extract refined oil from the crude oil, by applying Data Science, we can extract different kinds of information from raw data. Different tools used by Data Scientists to process big data are Java, Hadoop, R, Pig, Apache Spark, etc.

What is Data Science Process?

Data Science Process is the lifecycle of Data Science. It consists of a chronological set of steps. This process is distributed in 6 subparts as:

Phase 1—Discovery

The first phase in the Data Science life cycle is data discovery for any Data Science problem. It includes ways to discover data from various sources which could be in an unstructured format like videos or images or in a structured format like in text files, or it could be from relational database systems. Organizations are also peeping into customer social media data, and the like, to understand customer mindset better.

In this stage, as a Data Scientist, our objective would be to boost the sales of Mr. X’s retail store. Here, factors affecting the sales could be:

Store location
Staff
Working hours
Promotions
Product placement
Product pricing
Competitors’ location and promotions, and so on

Keeping these factors in mind, we would develop clarity on the data and procure this data for our analysis. At the end of this stage, we would collect all data that pertain to the elements listed above.

Phase 2—Data preparation

Once the data discovery phase is completed, the next stage is data preparation. It includes converting disparate data into a common format in order to work with it seamlessly. This process involves collecting clean data subsets and inserting suitable defaults, and it can also involve more complex methods like identifying missing values by modeling, and so on. Once the data cleaning is done, the next step is to integrate and create a conclusion from the dataset for analysis. This involves the integration of data which includes merging two or more tables of the same objects, but storing different information, or summarizing fields in a table using aggregation. Here, we would also try to explore and understand what patterns and values our datasets have.

Phase 3- Model Planning

Here, you will determine the methods and techniques to draw the relationships between variables. These relationships will set the base for the algorithms which you will implement in the next phase. You will apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools.

R has a complete set of modeling capabilities and provides a good environment for building interpretive models.
SQL Analysis services can perform in-database analytics using common data mining functions and basic predictive models.
SAS/ACCESS can be used to access data from Hadoop and is used for creating repeatable and reusable model flow diagrams.

Phase 4—Model building

In this phase, you will develop datasets for training and testing purposes. You will consider whether your existing tools will suffice for running the models or it will need a more robust environment (like fast and parallel processing). You will analyze various learning techniques like classification, association, and clustering to build the model.

Common tools for Model Building

SAS Enterprise Miner
WEKA
SPCS Modeler
Matlab
Alpine Miner
Statistica

Phase 5-Operationalize

In this phase, you deliver final reports, briefings, code, and technical documents. In addition, sometimes a pilot project is also implemented in a real-time production environment. This will provide you a clear picture of the performance and other related constraints on a small scale before full deployment.

Phase 6-Communicate Results

Now it is important to evaluate if you have been able to achieve the goal that you had planned in the first phase. So, in the last phase, you identify all the key findings, communicate to the stakeholders and determine if the results of the project are a success or a failure based on the criteria developed in Phase 1.

Data Science Python Implementation

By now you all have gained a lot of theoretical knowledge about the concept. Now let's see how do we implement the concept using python.

Here, I am using titanic data, you can download the titanic.csv either from here or from kaggle. I am using Google Colab for this, you can use Jupyter Notebook or any other tool for this.

Before we start with this, it is highly recommended you read the following tutorials

1. Loading Data to Google Colab

To know about Google Colab please click.

There are three ways of uploading a dataset to Google Colab, the following is the way I think is simple, you guys can search and use the other 2 ways.

I use Google Colab, because of the following reasons:

as unlike Jupyter Notebook, I need not install any libraries
being a web application, the processing speed is high ( as we get the advantage of using GPUs and TPUs)
also, we are able to perform Machine Learning on systems with low RAM and system capabilities (I think the most important of all)

from google.colab import files
uploaded = files.upload()

Click upload and select the file you want to upload, in my case it is "titanic.csv". After that, I will start uploading automatically.

2. Preparing the Notebook (Google Colab)

Here, we will be setting up the enviornment of the notebook/Google Colab. And then we will be importing all the required python libraries.

from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""");
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import pylab as plot
params = {
'axes.labelsize': "large",
'xtick.labelsize': 'x-large',
'legend.fontsize': 20,
'figure.dpi': 150,
'figure.figsize': [25, 7]
}
plot.rcParams.update(params)

3. Loading the Dataset

We will be using pandas to load the dataset.

data = pd.read_csv('titanic.csv')

4. Print the shape or size of the data

data.shape

Output

Since I am using a CSV file of 891 rows and 12 columns so I will be getting the output as (891,12)

5. Print the top 5 rows of the data

print(data.head())

Output

6. To get a high-level description of the dataset

data.describe()

Output

The above output gives us the impression that 177 is missing from the age column, as the count value for age doesnot match with count value of other columns. so to add that we do the following:

data['Age'] = data['Age'].fillna(data['Age'].median())

Now if we do data.describe(), we get the following output :

7. Data Visualization

You may like to read Python MatPlotLib and Python Seaborn article. Under this topic, we will visualize and try to understand the dataset.

7.1. Visualization of Sex and Survival Chances

data['Died'] = 1 - data['Survived']
data.groupby('Sex').agg('sum')[['Survived', 'Died']].plot(kind='bar', figsize=(3, 3),
stacked=True, colors=['g', 'r']);

Output

We can visualize the above figure in terms of ratios as:

data.groupby('Sex').agg('mean')[['Survived', 'Died']].plot(kind='bar', figsize=(3, 3),
stacked=True, colors=['g', 'r']);

Output

7.2. Visualization of the correlation of age variable with men and women

fig = plt.figure(figsize=(3, 3))
sns.violinplot(x='Sex', y='Age',
hue='Survived', data=data, split=True,
palette={0: "r", 1: "g"});

Output

So, from the above, we inferred that women survived more than men

7.3. Visualization of the relationship between fare ticket and their survival chances

figure = plt.figure(figsize=(5, 3))
plt.hist([data[data['Survived'] == 1]['Fare'], data[data['Survived'] == 0]['Fare']],
stacked=True, color = ['g','r'],
bins = 50, label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend();

Output

7.4. Visualization of the relationship between fare ticket, age, and their survival chances

plt.figure(figsize=(25, 7))
ax = plt.subplot()
ax.scatter(data[data['Survived'] == 1]['Age'], data[data['Survived'] == 1]['Fare'],
c='green', s=data[data['Survived'] == 1]['Fare'])
ax.scatter(data[data['Survived'] == 0]['Age'], data[data['Survived'] == 0]['Fare'],
c='red', s=data[data['Survived'] == 0]['Fare']);

Output

We can observe different clusters:

Large green dots between x=20 and x=45: adults with the largest ticket fares
Small red dots between x=10 and x=45, adults from lower classes on the boat
Small greed dots between x=0 and x=7: these are the children that were saved

7.5. Visualization of the relationship between fare ticket and class

ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar', figsize=(5, 3), ax = ax);

Output

7.6. Visualization of the relationship between fare ticket, embarked, and their survival chances

fig = plt.figure(figsize=(5, 3))
sns.violinplot(x='Embarked', y='Fare', hue='Survived', data=data, split=True, palette={0: "r", 1: "g"});

Output

From the above visualization, we can infer that the passengers who paid the highest or who were set in C, survived the most

Data Wrangling Python Implementation

Here we will be cleaning the data so as to make it ready for a machine-learning algorithm to work on it, i.e. will fill missing values, see if the data is continuous or not and see if data need any modifications to be done?

data.describe()

Output

From the above data, we get the impression that the minimum age of the passengers is 0.42 i.e. 5 months. This is very suspicious as according to the news, we had a two-month-old baby on board, i.e. the minimum age should be 0.19.

Also in the "fare" field, we find some of the values to be (0,0). which either means that those passengers traveled free or the data is missing.

So, let us first check whether any of the cells in the data is empty or not?

data.count()

Output

So from the above, output it is crystal clear that we have some data missing or the corresponding cells are empty in the "embarked", "age" and "cabin" columns.

So, what to do now. Either we have to delete the columns with missing data or we did have to fill the missing places. Now, I will be telling you how I reduce the number of missing values.

There are three ways to fill the missing value

1. Replace each cell with value '0' with numpy.NaN

data = data.replace(0, np.NaN)
print(data.isnull().sum())

Output

So, we can see that to some extent we were able to clean the data, but some of the data got corrupted. So let's try another way and see the result

2. Replace each missing value by the Column Mean

data.fillna(data.mean(), inplace=True)
# count the number of NaN values in each column
print(data.isnull().sum())

Output

Here, we see that we were able to rectify "age" and also nothing got corrupted. But we were not able to rectify "cabin" and "embarked". So now let's try another way.

3. Replacing each numpy.NaN value with zero

data = data.replace(np.NaN, 0)
# count the number of NaN values in each column
print(data.isnull().sum())

Output

So, by using this I was able to achieve a "no null" situation.

Note: But replacing missing values I had achieved but it is quite possible that it may result in loss of data or may change the meaning of data.

Conclusion

In this article, we studied what is data, what is data science, what is data mugging, difference between business intelligence and data science, difference between machine learning and data science, difference between artificial intelligence and data science, advantages of data science, disadvantages of data science, what are the skills required to be a data scientist, data science components, what is data science process and python implementation of some of the data science concepts . Hope you were able to understand each and everything. For any doubts, please comment on your query.

In the next article, we will learn about Linear Regression.

Congratulations!!! you have climbed your next step in becoming a successful ML Engineer.

Next Article In this Series >> Linear Regression

MCN Solutions Pvt. Ltd.

Technical Lead