Pandas: Sorting DataFrame

Sameer Shukla
3y
6.6k
0
4

Article

Introduction

The article explains how sorting works in Pandas, the DataFrame is a two-dimensional data structure very much like a table that holds rows and columns. In Pandas, we can sort a DataFrame either by a single column or by multiple columns.

Setup: In this, we will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new, and the file we are using is ‘USvideos.csv’.

df = pd.read_csv('USvideos.csv')
df.columns

The columns of the data set are

sort_values: In Pandas ‘sort_values’ function should be used to sort the DataFrame, the ‘sort_values’ function has the syntax

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

We will explore all the parameters with examples in the article to practically understand what these keywords do.

Sorting by Single Column

To sort the DataFrame by single column we can either pass the column name or can you optional ‘by’ parameter, both works the same way. As we have seen in the Columns, we have a column named ‘likes’ which represents the total number of likes

df.sort_values(by='likes', ascending=False)

by=list of columns to sort.

ascending= represents sorting order, by default sort_values will sort in ascending order, to sort in descending order we need to select ascending=False.

Sorting by Multiple Columns

To sort DataFrame by multiple columns, the comma-separated ‘column’ names should be provided as a List to the sort_values function.

Let’s sort the DataFrame by both the ‘likes’ and ‘dislikes’ column, meaning find the videos which are most liked and disliked as well.

The ascending attribute also takes List, which helps us in sorting DataFrame columns by different sort orders.

df.sort_values(['likes','dislikes'], ascending=[False, False])

Understanding attributes of the function

kind: The kind attribute helps us in selecting the Sorting Algorithm of our choice out of quicksort, mergesort and heapsort. By default, the quicksort algorithm is applied,

df.sort_values('likes', ascending=False, kind="mergesort")

Inplace sorting: By default, the sorting isn’t inplace, for inplace sorting we should be using the ‘inplace’ keyword.

df.sort_values('likes', ascending=False, kind="mergesort", inplace=True)

key: Through ‘key’ we can apply a function to the values just before sorting, a function can be a proper named function or a simple lambda, it expects ‘Series’ and returns a ‘Series’ with the same shape.

df.sort_values(by='title', key=lambda x: x.str.lower())

Understanding ‘sort_index’ function: The sort_index function sorts the DataFrame by index values. Let’s revisit what we did with sort_values

sdf = df.sort_values(by='likes', ascending=False)

When we sort the DataFrame using sort_values, pandas consider the column based on which sorting is expected. To sort the ‘sdf’ DataFrame using the index, we can use the sort_index function

sdf.sort_index()

To sort the columns of the DataFrame, we must use the axis parameter as ‘axis=1’ in the sort_index function.

sdf.sort_index(axis=1)

Thank you for reading.