Pandas Aggregation Functions

Introduction

In this article, we are going to see how aggregation works in Pandas. There are various functions available in Panda’s library which are simple to understand and apply, whatever mathematical calculations we want to perform are available in Pandas. It’s difficult to cover all the functions in the article and some of them are very similar or straightforward, the article covers some of the important ones, let’s have a look.

Setup

Setup is very similar as it’s in my other panda's article’s on C# Corner, We will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new and the file we are using is ‘USvideos.csv’ for this article. 

df = pd.read_csv('USvideos.csv')
df.columns 

The columns of the data set are,

Pandas Aggregation functions

Let’s understand by example, first, we will sort the given DataFrame in descending order of the number of ‘likes’ by users. 

likesdf = df.sort_values(by='likes', ascending=False)
likesdf.head()

In the ‘likesdf’ DataFrame there are many columns like ‘publish_time’, ‘comments’ etc, let’s fetch all the numeric columns so that easier to apply aggregation functions.

newlikesdf = likesdf.select_dtypes(include=np.number)
newlikesdf.head()

The ‘newlikesdf’ DataFrame now has all the numeric columns like ‘likes’, ‘dislikes’, ‘comment_count’, ‘views’ etc. The 'newlikesdf' DataFrame looks like,

sum

The ‘sumfunction calculates the sum of columns.

newlikesdf.sum()

Since the sum() is applied to the entire DataFrame the sum is calculated on every column, sum function can be applied to individual columns as well.

newlikesdf[‘likes’].sum()
#3041147198

max

The ‘max’ function computes maximum values in every column.

newlikesdf.max() 

Just like max, another function ‘min’ is available to compute the minimum value.

mean

The mean function computes the mean values of columns. Mathematically speaking mean is the arithmetic average of set of given numbers. Mean of 3 numbers 1, 2, 3 is = 2.

newlikesdf.mean()

agg

The ‘agg’ function accepts a list of functions that are to be applied, for example in the ‘agg’ function we can pass either sum, min, max, mean to identify the result rather than finding them individually.

newlikesdf.agg(['sum', 'min'])

Let's add 'max' function to the List 

newlikesdf.agg(['sum', 'min', 'max'])

std

The ‘std’ function is used to find the Standard deviation of the columns. The Standard Deviation explains how the values are spread across the data sample and it’s the measure of the variation.

newlikesdf.std()

describe

All the functions we have learned so far except ‘agg’, all will be covered in the ‘describe’ function, which is used to fetch the descriptive analysis of the DataFrame.

newlikesdf.describe()

Summary

There are lot more functions available.

  1. Min – To compute the minimum, just like max
  2. Count – Calculates the count
  3. Var – calculates the variance
  4. Sem – Calculates the Standard Error of the mean.


Similar Articles