Pandas - Filtering DataFrame

Introduction

The article explains how filtering works in Pandas, the most important common operation in Data Analysis is to filter the data, in Pandas, there are quite a few ways to filter the DataFrame. Let's explore all of them

Setup

We will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new, and the file we are using is ‘USvideos.csv’.

df = pd.read_csv('USvideos.csv')
df.columns   

The columns of the data set are

Using Logical Operators

In Data Analysis, the most common scenario is to find the data which is greater / lesser in DataFrame, For example, in our YouTube trending video analysis, filter DataFrame which has more than a million likes.

First, create our DataFrame and sort it by the ‘likes’ column in ascending order to understand our dataset and check what the total count is,

likesdf = df.sort_values('likes', ascending=False)
len(likesdf) // 40949

Upon executing the ‘len’ function on the DataFrame the total number of rows in the dataset is 40,949, let’s filter the ‘likesdf’ and identify the videos which have more than a million likes

millionLikesDF = likesdf[likesdf['likes'] > 1000000] 
len(millionLikesDF) // 369

The total number returned is 369, to get the ‘title’ we need to simply pass the ‘title’ column in the DataFrame.

We can combine multiple logical operators for getting the more granular results, lets find videos which is liked and disliked more than one million times

DF = likesdf[(likesdf['likes'] > 1000000) & (likesdf['dislikes'] > 1000000)] //12

Upon executing we found that there are 12 videos in the dataset which is liked and disliked more than a million times.

Filtering Textual Data

For filtering the text data we should use the ‘Str’ accessor, there are various efficient functions available for us to filter Strings in Pandas.

Filter the videos whose title starts with ‘BTS’.

likesdf[likesdf.title.str.startswith('BTS')] //151

Total 151 videos exist in the dataset whose title starts with ‘BTS’, similarly, we can use the ‘contains’ function for filtering the videos based on any pattern in the title.

How about filtering the title whose title does not start with ‘BTS’, in that case, we should be using the tilde operator (~).

likesdf[~likesdf.title.str.startswith('BTS')] //40798

Total number = 40798 and 40798 + 151 = 40949, which is our total data set count.

Query Function

The query function, query the columns of the DataFrame with a Boolean expression. Using the query function, let’s filter the dataset and find the videos which has more than a million likes.

likesdf.query('likes > 1000000') //369

using query function we can filter based on multiple conditions, let’s identify the same scenario to find the videos which have more than one million likes and dislikes.

likesdf.query('likes > 1000000 and dislikes > 1000000') //12 rows.

isin

isin function is also used for filtering, the result will be returned only when all the elements match in the List on which ‘isin’ is performed.

titleList = ['BTS', 'Hollywood']
likesdf[likesdf.title.isin(titleList)]

Nothing will be returned because we don’t have any title in the dataset

titleList = ['DJI Live - Adventure Unfolds']
likesdf[likesdf.title.isin(titleList)]

This time result will be returned because the match found

loc / iloc

The loc function selects rows/columns using labels, whereas the iloc filters are based on the indices.

loc: Let’s explore .loc with an example, filter the rows in the DataFrame where category_id is 10.

likesdf.loc[likesdf['category_id'] == 10]

We can combine logical operators like and/or within the loc function as well. If we want to locate only single row data, for example, filter the ‘title’ of the 10th row of the DataFrame, we can pass index, in that case,

likesdf.loc[10, 'title']  

iloc: The iloc should be used for filtering the DataFrame based on row/column indices. Filter the data of the 0th row and 0th column in the DataFrame

likesdf.iloc[0,0] 

Or inside a loop, we can do,

for i in range(len(likesdf)):
     print(likesdf.iloc[i,3], likesdf.iloc[i,4])

Finding Largest or Smallest Value in DataFrame

For filtering the largest or smallest rows in specific columns in DataFrame, we can use ‘nlargest’ and ‘nsmallest’ functions.

Filter top two most liked video’s

likesdf.nlargest(2, 'likes')

For least liked videos

likesdf.nsmallest(2, 'likes')

These are the most efficient approaches of Data filtering in Pandas, I hope you find this article useful, and Thank You for reading.