Polars: The Fastest Dataframe Library You’ve Never Come Across!

Lokesh Varman
3y
12.8k
0
1
25
Blog

Introduction

Pandas is the most popular library when it comes to working with structured data. The reason behind this is the panda’s powerful tool called DataFrame. A DataFrame is a table where each column represents a different type of data(sometimes called field). The columns have names. Each row represents a record or entity.

An alternative for Pandas that is almost 3 times faster. Polars is one of the lesser-known libraries. Pandas is still one of the best tools out there for data manipulation and analysis, and in no way Polars can replace it, at least for the time being. I just wanted to share this library to make you know about an alternative you can try out.

Working with Polars

Polars can be installed using Pypi using the following code:

pip install polars

Importing libraries

Polars offers many functionalities that are similar to Pandas, so it won’t be a problem for anyone to switch over.

import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline

Loading Dataset

data = pl.read_csv("../sample.csv")
print(type(data))
> <class 'polars.frame.DataFrame'>

Let us start with a basic Data Analysis.

Getting familiar with the dataset

data.shape
> (150930, 11)

data.columns

columns

data.dtypes

dtypes

data.head()

data.head pplars

As you can see this is a huge dataset. with over 11 columns and 150k+ entries, we have a lot of data to analyze. The columns I am interested in are Country, points, and price. Let us see what we can find.

Null Values

Before moving forward we have to take care of the null values if present. We can find the null values easily using null_count().

data.null_count()

null values

Therefore around 13.5k entries are missing values for the price column. We can either drop these rows since it’s less than 10% of the whole dataset, but we can put some other value like the mean:

data['price'] = data['price'].fill_none('mean')

Performing Analysis

Now we dig a little deeper and look into some statistical analysis. This can help us gain some insightful knowledge of the dataset.

Our goal is to compare how price and points vary from country to country.

# Analyses of wine prices
print(f'Median price: {data["price"].median()}')
print(f'Average price: {data["price"].mean()}')
print(f'Maximum price: {data["price"].max()}')
print(f'Minimum price: {data["price"].min()}')

median price polars

# Analyses of wine points
print(f'Median points: {data["points"].median()}')
print(f'Average points: {data["points"].mean()}')
print(f'Maximum points: {data["points"].max()}')
print(f'Minimum points: {data["points"].min()}')

Thus we can see that a wine can be as cheap as 4 dollars but still have great taste. Now let’s see which countries sell wine.

countries = data['country'].unique().to_list()
print(f'There are {len(countries)} countries in the list')
>There are 49 countries in the list

Scrolling through the dataset, we can see that there are 2 strange values in the column country. These are an undefined country (“”) and another country called ‘US-France’:

print(data[(data['country'] == '') | (data['country'] == 'US-France')])

print data

Since there are just 6 entries with these weird values, so I think it’s safe if we dropped the rows.

data = data[(data['country'] != '') & (data['country'] != 'US-France')]

Now we look into countries which have the best and the costliest wines.

#wines with high points
print(data.groupby('country').select('points').mean().sort(by_column='points_mean', reverse=True))

wines with high points

#Wines which are costly
print(data.groupby('country').select('price').max().sort(by_column='price_max', reverse=True))

costly wines pypolars

Thus we can see that England has one of the best wines, but the costliest one is from France.

Conclusion

If you are interested in having a more in-depth look at the workings of the library, I highly recommend you to read this article by the creator of Polars himself.