# What Is Data Visualization In Machine Learning And How Does It Work

In this article, I will show you what data visualization and basic visualization techniques are by using Matplotlib, pandas, and Seaborn, as well as Iris data.

## Data Visualization

Data visualization is the process of transforming large data sets into a statistical and graphical representation. It is an essential task of data science and knowledge discovery techniques to make data less confusing and more accessible.

## Why Data Visualization?

Visualization takes a huge complex amount of data to represent charts or graphs for quick information to absorb and better understandability. It avoids hesitation on large data sets table to hold audience interest longer.

## Types of Analysis

Bivariate plots: In Bivariate, we will compare the exact two futures to analyze its properties.

Multivariate plots: When we compare data with more than two features, it is called Multivariate.

## Python Libraries

Today, Python offers a lot of libraries and packages for various analytic techniques. Here, we will see some most frequently used libraries for effective visualization techniques.

**Requirements**

- Spyder IDE 3.7
- Iris data sets

## Statistic overview

**Importing packages and libraries**

Here, I am going to do all demonstrations with the "Spyder IDE" from Anaconda distribution which provides us advanced editing, interactive testing, debugging, and flexible analysis with fewer codes. For more details, you should visit the below link.

For my convenience, I will transform each library to different symbolic variables. Such as matplotlib to plt, pandas to pd, and seaborn to sns.

- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns

**Reading data sets**

Let’s read our Iris dataset with the help of the “Pandas” package and transform it into the “Iris” variable. The syntax goes like “variable = package.read mode (data sets path)”.

- #Read Data

- iris=pd.read_csv('iris.csv')

## Data frame overview

Let’s move on to a quick overview of the data frame to get some basic ideas about that by doing four easy steps, given below

**Step 1 - Data set shape**

The “shape()” method can help us to find how much of observation are hold in the data frame.

Here we will see what happens after executing the commands.

- #Shape

- print(iris.shape)

The Iris data set contains 150 observations under six columns of Iris measurements in centimeters.

**Step 2 - Peek at the data sets**

- #head of Iris upto 15 column

- print(iris.head(15))

**Step 3 - Distribution of class**

- #Size

- print(iris.groupby('iris-Species').size())

The Iris data sets contain 50 instances from each of the 3 class.

**Step 3 - Data sets summary**

The “describe ()” function is useful for getting a quick summary from the large volume of data sets such as min, max, and mean values.

The command goes like this,

- #Describe

- print(iris.describe())

## Visualization

Let’s look at the data frame for clear understanding with the help of Pandas, Matplotlib, and Seaborn.

- #Seaborn plot example

- sns.set_style("darkgrid")
- sns.FacetGrid(iris, hue="iris-Species", size=4) \
- .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
- .add_legend()
- plt.title('Iris Flowers')
- plt.xlabel('X-Label')
- plt.ylabel('Y-Label')
- plt.show()

- Plt.title() - To set title for the plot
- Plt.grid() - To enable Horizontal and Vertical line in background of the layer.
- sns.set_style() - Seaborn provide aesthetic style of plot whether the grid is enabled.
- Sns.facetGrid() - To takes the data frame as an input to form the row, column, and hue to structure the grid.
- Plt.xlabel() - Set variable for X axis
- Plt.ylabel() - Set variable for Y axis
- Sns.add_legend() - Labeled representation of the plots which used to identify available colored plots.

According to our previous definition (Types of analysis), we will demonstrate various visualization techniques.

## Univariate analysis

**Boxplot**

- #Seaborn Boxplot

- sns.boxplot(x='iris-Species',y='SepalLengthCm',data=iris)

- plt.show()

The above commands handle the Iris flower data sets to show under the univariate plot. The X-axis handles the class labels then the Y-axis handles the Iris distribution like Sepal length. Each flower has appeared in a different color with a combination of whisker, quartile, and outlier of it.

In the Boxplot, we can get how much of the data and outlier points presented belongs to each flower. The Iris virginica only contains an outlier point then the Setosa has holding low-level values.

Each flower was shown their values as quartiles with the help of maximum and minimum whiskers.

**Distribution plot**

The distribution plot of class label generally performs as a combination of probability density function and Histogram in a single figure.

Here the univariate analysis, how we are going to do the univariate analysis by executing these commands.sns.distplot( iris["SepalLengthCm"], bins=20 )

- #Seaborn Distribution plot

- sns.distplot( iris["SepalLengthCm"], bins=20 )

- plt.show()

The “distplot()” method can take the Iris distributions and number of bins to show the Distribution plot with the help of the seaborn library.

Above the figure, the histogram is shown data distribution forming by bins and the drawing bar shown us several sepal length observations.

**Bar chart with count plot**

- #Seaborn Countplot

- sns.countplot('iris-Species', data=iris)

- plt.show()

The “countplot()” method performs to count the entire data sets to shows with their categorical variables.

In the above figure, we can get an idea of how many observations contained in each Iris Species.

Each Flowers measurement in the data set has equal values (each 50) as we saw the “shape ()” method.

**Violin plot**

The violin plot generally performs like a combination of Boxplot and Kernel Density Estimation (KDE).

- #Seaborn Violin plot

- sns.violinplot(x='iris-Species',y='SepalWidthCm',data=iris)
- plt.show()

Above the code should be taken Class label in the X-axis and Sepalwidth at the Y-axis.

In the above figure, we can see a higher density of Sepal length belongs to three Iris flower datasets. The Iris Setosa Sepal length has high-density values among the three datasets.

## Bivariate Analysis

Here, we will switch our positions to see all the demos with distribution plots of Iris data sets.

**Scatter plot**

- #Pandas Scatter plot

- iris.plot(kind='scatter', x='SepalLengthCm', y='SepalWidthCm',label='iris',color='red')

- plt.show()

The “plt.scatter()” method takes few categorical variables from large amount data sets to display simple visualization.

The seaborn method helps us to display attractive 2D & 3D graphical representation from a large amount of data. The entire data sets will be present as a scatter plot to shows us the correlation between categorical variables.

- #Seaborn Scatter plot

- sns.FacetGrid(iris, hue="iris-Species", size=5) \
- .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
- .add_legend()
- plt.show()

The “hue” argument can decide to show different color plots according to the Iris Species(Class label).

Above the colored scatter has presented according to their class labels mentioned to the right side of the figure.

In this seaborn scatter figure, we can get a clear understanding of data distributions. The Iris versicolor and virginica contain some overlap points that belong to their sepal length and sepal width.

## Multivariate plot

**Pair plot**

- #Seaborn Pair plot

- sns.pairplot(iris,hue='iris-Species',kind='reg')
- plt.show()

Above the figure, we can get the line plot and histogram in a different color from the class labels, respectively. Here we can see, the Iris Setosa can hold quite different petal length and petal width values; that’s why they separated with others. Here we get the pairwise relationship between all variables through the univariate distribution of diagonal axes.

**Heat map**

A heat map is a 2D graph that can take an entire data frame to differentiate features with high positive or negative values. It will be creating a Grid like a plot where each Tile is color based on the values. It helps us to find out the correlation and coefficient between different features. It is useful where will be cluster analysis or deal with a large number of data sets. An example is given below.

- #Seaborn Heatmap

- sns.heatmap(iris.corr(),linewidth=0.3,vmax=1.0,square=True, linecolor='black',annot=True)
- plt.show()

the “heatmap()” method can display parameter included arguments according to itself.

**Joint plot**

The joint plot considers both Univariate and Bivariate plot analysis.

- #Seaborn Joint plot

- sns.jointplot(x='SepalLengthCm',y='SepalWidthCm', data=iris, kind='resid')

- plt.show()

Here, the X-axis can include Sepal length, and Y-axis includes Sepal width of Iris species to display a joint plot with the help of the seaborn library.

The above figure, the univariate plot (KDE plot) at the top and right are KDE's of Sepal length and Sepal width respectively. Then the central graph of the scatter plot has shown us the relationship between the Iris sepal length and sepal width.

**RadViz**

In multivariate analysis, we are going to do a demo with the RadViz algorithm. It takes each feature of data sets to plot uniformly around the circumference of a circle.

It consists of the spring tension minimization algorithm -- each point represents as a single attribute to normalizes its values on the axes.

If the data frame contains any missing or misspelled values, then the RadViz perform to throw a Data warning message like percent missing.

- #Pandas Radviz

- from pandas.plotting import radviz
- radviz(iris, "iris-Species")

plt.show()

The pandas help us to import the RadViz followed by the “radviz()” method to visualize Iris Species according to its features.

The three Iris species plotted within the circle then their distribution plotted on the circumference of a circle.

**Andrews curves**

- #Pandas Andrews curves

- from pandas.plotting import andrews_curves
- andrews_curves(iris.drop("Id",axis=1),'iris-Species')

- plt.show()

Pandas are available to support the “Andrews curves ()” method to provide a smoothed version of a parallel coordinate plot.

Each class label can differentiate with different colors to appear with understandable visualization.

## Conclusion

In this article, we had a quick overview of Visualization and why we are using it for machine learning tasks, and I hope you understood how to do this.

**References**

*https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html*

https://www.kaggle.com/benhamner/python-data-visualizations

https://www.geeksforgeeks.org/plotting-graph-using-seaborn-python

https://www.kaggle.com/benhamner/python-data-visualizations

https://www.geeksforgeeks.org/plotting-graph-using-seaborn-python

- Famous Visualization In Machine Learning
- Frequently Used Visualization
- Importance Data Data Visualization
- Iris Data Visualization
- Machine Learning Visualization
- Types Visualization Techniques In Machine Learning
- Visualization
- Visualization In Machine Learning
- What Is Data Visualization In Ml
- Why Data Visualization