Pandas - Investigating Pipe Function

Introduction

The article explains the pipe function in Pandas, the ‘pipe’ function is a very useful function through which we can chain multiple processing operations into one. In this article, we will look at

  1. What is the Pipe function?
  2. How it works
  3. Understanding by example

Let’s explore

pipe function

The common operation while working with datasets are handling missing values, sorting, dropping duplicates, removing unwanted data, etc., we should create these individual operations in a function and chain all these functions through the ‘pipe’ function.

Syntax

df_final = (df.pipe(functionOne).pipe(functionTwo).pipe(functionThree))
df_final = (df.pipe(handle_missing_values).pipe(sort_df).pipe(drop_duplicates))

pipe function allows chaining together functions that have Series, DataFrame, GroupBy objects as parameters.

Setup

In this article, we look into the same dataset which I have always used in my Pandas articles we will work on a Kaggle dataset that provides YouTube video trending statistics, URL:  https://www.kaggle.com/datasnaek/youtube-new and the file we are using is ‘USvideos.csv’.

df = pd.read_csv('USvideos.csv')
df.columns

The columns of the dataset are

pipe function in Pandas

Examples of Pipe Function

In the dataset we have multiple columns, we will create a DataFrame which is sorted by the number of ‘likes’ in descending order, then filter out the rows which are liked more than a 1million times, then filter out the videos by a substring of any music video title or music group, resulting in the records we wanted after applying all these steps.

Function One: Sort DataFrame by Descending order of 'likes'

def sortedByLikesInDescendingOrder(dataframe):
    return df.sort_values(by='likes', ascending=False)

sortedByLikesInDescendingOrder(df).head()

pipe function in Pandas

Function Two: Filter Rows which has more than a million likes

def filterMoreThanMillionLikes(dataframe):
    return likesdf[likesdf['likes'] > 1000000]

filterMoreThanMillionLikes(df).head()

pipe function in Pandas

Function Three: In this function, we will filter rows where the title is 'BTS'

def filterByTitle(dataframe):
    return dataframe[dataframe.title.str.startswith('BTS')]

filterByTitle(df).head()

pipe function in Pandas

These three functions are individual functions, now enters the ‘pipe’ function, we can chain all these functions together to get the result.

df_final = (df.pipe(createLikesDf)
                   .pipe(millionLikes)
                   .pipe(filterByTitle))

df_final.head() 

pipe function in Pandas

Let’s validate the results, first validate all the music video titles has String ‘BTS’.

df_final['title']

pipe function in Pandas

All the rows have String 'BTS', Validating the number of 'likes'.

df_final['likes']

pipe function in Pandas

Summary

What we have explored here is the beauty of Higher Order Functions, Higher-Order Functions, treats functions as a value which is exactly what we did through the ‘pipe’ function. It’s an in-built function that can take care of chaining and it is sufficient, but if required we can create our own and have the customized behavior, we want in our own custom pipe function.