Pandas - Useful DataFrame functions

Introduction

The article explains some of the general DataFrame functions which are important from a data scanning perspective in the DataFrame. I realized the articles I wrote so far on Pandas don’t cover the basic DataFrame functions.

The article covers the following DataFrame functions

  • head & tail functions
  • What is transpose and how it works
  • dtypes & select_dtypes
  • other basic functions like info, shape, ndim, memory_usage

These functions help in understanding the data we are dealing with, Let’s explore.

Setup

Setup is very similar as it’s in my other pandas' article’s on CSharpCorner, We will work on a Kaggle dataset that provides YouTube video trending statistics, URL: https://www.kaggle.com/datasnaek/youtube-new 

df = pd.read_csv('USvideos.csv')
df.columns

The columns of the data set are

head & tail

These are very basic functions, “head(n)” returns the first ‘n’ rows, and “tail(n)” returns the last “n” rows. 

df.head(2) #first 2 rows 
df.tail(2) #last 2 rows 

If no parameter is passed into these functions, by default first 5 and last 5 rows are returned. Another sweet hack is if you want to explore both head and tail together in a single DataFrame, the ‘append’ function can be used for ex: 

head_df = df.head(2)
newdf = head_df.append(df.tail(2))
newdf

pandas head & tail

The indexes represent 0, 1, and the last 40947, 40948.

transpose

The synonym of ‘transpose’ word is interchange, re-position, move, displace, this is exactly what transpose does, it transposes the index and column of DataFrame meaning it interchanges the rows and columns over its main diagonal. It can be achieved using either ‘transpose()’ on DataFrame or using the.T attribute both have the same results.

df.transpose() # returns 16 rows * 40949 Columns
df.T # returns 16 rows * 40949 Columns

pandas transpose

The utility of transpose is it changes the outlook the way we are analyzing our DataFrame.

dtypes & select_dtypes

Think ‘dtypes’ as datatypes, the dtypes attribute returns a Series object which reflects the data types of every column of a DataFrame. It’s important to know what data types we are dealing with before applying any transformation or any other function on a specific column.

df.dtypes

pandas dtypes

select_dtypes

The explanation of select_dtypes is very sparse on the internet and various tutorials, the function name itself reflects what it does, the function expects a ‘datatype’ as the parameter and returns the columns matching the datatype. The parameter should be supplied in ‘include’ and ‘exclude’ for including and excluding datatype as list or singular.

df.select_dtypes(include='int64') 
df.select_dtypes(include=['int64','bool'])

pandas select_dtypes

Similarly, exclude object and int and give me everything else,

df.select_dtypes(exclude=['int64','object'])

info

The info function returns a summary of the DataFrame, it returns the name, number of rows, the total number of columns, count of Boolean, integer, objects fields, memory usage, and other details

df.info()

pandas info function

Other functions and attributes

shape: shape attributed returns a tuple representing the row and column count.

df.shape # (40949, 16)

ndim: The ndim attribute returns an integer, value 1 represents Series and 2 for DataFrame

df.ndim #2 as it’s a DataFrame

memory_usage(): The memory_usage function returns the memory usage of each column in bytes

df.memory_usagae()

Summary

The functions covered in the article are important from understanding the data point of view like the number of rows, columns, their types and other basic information, these are small functions but extremely useful.


Similar Articles