Generate Bell-Shaped Distribution: PySpark & Matplotlib in Fabric Notebook

Abiola David
1y
3.2k
0
2

Article

Introduction

In this article, we will explore how to generate and visualize a sample dataset to create a bell-shaped or normal distribution using Fabric Notebook. Before, we dive into the code. Let's understand what is bell-shaped distribution. A bell-shaped distribution in statistics is a specific type of probability distribution that exhibits a symmetric, bell-like curve when plotted. This distribution is formally known as a normal distribution or Gaussian distribution and some of its characteristics are as follows:

The graph of a bell-shaped distribution is symmetric around its mean (average) value which means that the left and right tails of the distribution are mirror images of each other.
The highest point on the curve, known as the peak or mode, corresponds to the mean of the distribution. The mean, median, and mode are all located at the center, creating a sense of central tendency.
The spread or dispersion of the data is determined by the standard deviation. The larger the standard deviation, the wider the bell-shaped curve, indicating greater variability in the data.

Fabric Notebook Demo

In this article, we are going to use PySpark in Microsoft Fabric Notebook to create the Normal distribution. To do this, I opened a new Notebook in the Synapse Data Scient experience of the Microsoft Fabric

Synapse Data

In the code cell, I imported the necessary libraries and executed the following code.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

sample_data = np.random.normal(loc=0, scale=1, size=1000)

plt.hist(sample_data, bins=30, density=True, color='skyblue', edgecolor='black')
plt.title('Sample Data - Bell-Shaped Distribution')
plt.xlabel('Values')
plt.ylabel('Probability Density')
plt.show()

The code was executed in the Fabric Notebook in just 2 sec 612 ms which is quite amazing!

Fabric Notebook

The np. random.seed(42) ensures that the random numbers generated by NumPy are reproducible, meaning the same sequence of random numbers will be produced each time the code is run.

The sample_data = np. random.normal(loc=0, scale=1, size=1000) creates a sample of 1000 data points from a standard normal distribution (mean=0, standard deviation=1).

Plotting a Histogram

function generates the histogram to visualize the distribution of the sample data.

bins=30: specifies the number of bins (intervals) in the histogram.
density=True: normalizes the histogram so that the area under the curve sums to 1, providing a probability density.
color='skyblue': sets the color of the bars in the histogram.
edgecolor='black': sets the color of the edges of the bars.

Adding Titles and Labels

plt.title('Sample Data - Bell-Shaped Distribution'): adds a title to the plot.
plt.xlabel('Values') and plt.ylabel('Probability Density'): adds labels to the x and y-axis, respectively.
plt.show(): displays the generated histogram.

Generated histogram

Conclusion

The bell-shaped distribution can be applied in various fields, especially in business and data analysis.