Groupby Function In Pandas

Introduction

The article explains how grouping works in Pandas, Grouping in Pandas is done through the ‘groupby’ function, grouping split’s data into separate groups on which we can perform aggregation, filtering, transformation, creating graphs/charts for better analysis. In the article, we will cover

  1. How groupby function works and how to access groups information
  2. The groupby process: split-apply-combine
  3. Aggregation
  4. Filtering

Setup

We will work on the Kaggle dataset ‘https://www.kaggle.com/ramjasmaurya/best-cities-and-countries-for-startups-in-2021’ and the file we are using is ‘Best Countries for Startups’. This dataset analyses the Best countries for Startups.

df = pd.read_csv(‘Best Countries for Startups.csv')
df.columns

The columns of the dataset are. 

Exploring groupby function

The groupby function helps us in categorizing the data and applying functions to the categories for better analysis. In the article, we will categorize the data by ‘countries’ and perform analysis on the group.

df_grpby_country = df.groupby('country', sort=False)
df_grpby_country

df_grpby_country’ is of type ‘pandas.core.groupby.generic.DataFrameGroupBy’. The ‘DataFrameGroupBy’ contains functions / attributes for accessing group information.

ngroups

The function is used to get the number of the groups.

df_grpby_country.ngroups # 100

groups

The groups attribute returns the group object

df_grpby_country.groups

Grouping in Pandas

The List of numbers represents the row numbers.

size: The size function represents the size of each group.

df_grpby_country.size()

The above data set represents that each country has only 1 size. These attributes and functions are the important ones, there are few more available.

The groupby Process

The groupby process is a 3-step process, split, apply, combine. In Step 1 we split the data, In Step 2 applies a function to every group and Step 3 is the process of combining the data. In the article, we will see Aggregation and Filtration process as an example. For understanding the aggregation functions, please refer to my other article on C# Corner, we can apply various functions to the group like ‘max’, ‘min’, ‘count’, ‘agg’, ‘mean’ etc.

Let’s apply a ‘max’ function on every country’s total_score.

df_grpby_country.total_score.max()

The ‘agg’ function accepts a list of functions that are to be applied, for example in the ‘agg’ function we can pass either sum, min, max, mean to identify the result rather than finding them individually.

df_grpby_country.agg(['max', 'min', 'count', 'median', 'mean'])

Grouping in Pandas

The 'agg' function applies all the functions we passed in a list on every column. Under ‘ranking’ column max, min, count, the median is applied and so as in other columns ‘total_score’, ‘quality_score’ etc.

Let’s see another use case, find the top 5 countries with total_score and plot it using matplotlib.

df_grpby_country = df.groupby('country', sort=False)
countriesWithMaxScoreDf = df_grpby_country['total_score'].max().nlargest(5)
matplotlib.style.use('ggplot')
countriesWithMaxScoreDf.plot.bar()

Filtering

Filtering is straightforward, we can filter some of the groups, in our use-case we will filter all the groups whose total_score is greater than 1.

df_grpby_country.filter(lambda x : x['total_score'] > 1)

Grouping in Pandas

Summary

Most of the time we have to work on groups, as it’s a neat approach in Pandas, the 3 step-process as we discussed, once we create the groups it’s easy to apply aggregation, filtering, and transformation. I hope you like the article


Similar Articles