Introduction
In this article I am going to demonstrate how to use dplyr package in R along with a planes dataset. We will be using various functions provided with thr dplyr package to manipulate and transform the data and to create a subset of data as well. The functions which we will be using are filter(), arrange() and select().
Loading package and dataset
We will be using predefined planes dataset which belongs to a package named nycflights13. Therefore we need to load the package first as follows,
Now we need to load the dataset planes as we will be using it to transform and manipulate the data,
The above code will generate the following output,
- > planes
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
- 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
-
Now we need to load the library named dplyr to use the functions of the package dplyr. We can use the below syntax to load the dplyr library,
The above code will generate the following output,
- > library(dplyr)
-
- Attaching package: ‘dplyr’
-
- The following objects are masked from ‘package:stats’:
-
- filter, lag
-
- The following objects are masked from ‘package:base’:
-
- intersect, setdiff, setequal, union
-
- >
Filter() function to filter data on the basis of variable names
Filter function is used to select and display observations according to the values of the arguments included inside the function. Filter function creates subsets of observations. Inside the function the data frame name is the first argument and the second argument is the name of the variable of the dataset whose values we will be using to filter the data frame.
Let us discuss this function with the help of the example below,
- > filter(planes, year == 1999)
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 2 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 3 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 4 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 5 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 6 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 7 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 8 N111US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 9 N112US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
- 10 N113UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
-
In the argument part of the filter function, we have mentioned year == 1999, therefore a new dataset containing all the observations of year 1999 will be displayed.
- > filter(planes, year == 1999, seats == 55)
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N12967 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 2 N13964 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 3 N13965 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 4 N13968 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 5 N13969 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 6 N13970 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 7 N13975 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 8 N13978 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 9 N13979 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 10 N14972 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
-
- >
In the argument part of the filter function, we have mentioned the value of year as 1999 and the value of seats as 55, therefore filtering operation will be executed by dplyr library and a new dataset containing all the observations of 1999 and containing 55 seats will be displayed.
Using assignment operator
The result of the new subset of dataset that has been created can also be saved in another variable using assignment operator,
- > df <- filter(planes, year == 2000, seats == 55)
If we want to save the subset in a variable and print the result at the same time then we can include the entire assignment syntax inside parentheses as follows,
- > (df <- filter(planes, year == 2000, seats == 55))
The above code will generate the following output,
- > (df <- filter(planes, year == 2000, seats == 55))
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N12921 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 2 N12922 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 3 N12924 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 4 N12996 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 5 N13988 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 6 N13989 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 7 N13992 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 8 N13994 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 9 N13995 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
- 10 N13997 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
-
- >
Arrange function to arrange data in different orders
arrange() function is used to select and display rows either in ascending or descending order according to the order mentioned inside arrange() function as argument. It takes the name of the variables as arguments and displays data in two different orders.
- > arrange(planes, year, seats, speed)
The above code will generate the following output,
- > arrange(planes, year, seats, speed)
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N381AA 1956 Fixed wing multi engine DOUGLAS DC-7BF 4 102 232 Reciprocating
- 2 N201AA 1959 Fixed wing single engine CESSNA 150 1 2 90 Reciprocating
- 3 N567AA 1959 Fixed wing single engine DEHAVILLAND OTTER DHC-3 1 16 95 Reciprocating
- 4 N378AA 1963 Fixed wing single engine CESSNA 172E 1 4 105 Reciprocating
- 5 N575AA 1963 Fixed wing single engine CESSNA 210-5(205) 1 6 NA Reciprocating
- 6 N14629 1965 Fixed wing multi engine BOEING 737-524 2 149 NA Turbo-fan
- 7 N615AA 1967 Fixed wing multi engine BEECH 65-A90 2 9 202 Turbo-prop
- 8 N425AA 1968 Fixed wing single engine PIPER PA-28-180 1 4 107 Reciprocating
- 9 N383AA 1972 Fixed wing multi engine BEECH E-90 2 10 NA Turbo-prop
- 10 N364AA 1973 Fixed wing multi engine CESSNA 310Q 2 6 167 Reciprocating
-
We can also use desc() function as argument inside arrange function to display the rows and columns of the new data frame reordered in descending order,
- > arrange(planes, desc(year))
-
- tailnum year type manufacturer model engines seats speed engine
- <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
- 1 N150UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 2 N151UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 3 N152UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 4 N153UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 5 N154UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 6 N155UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 7 N156UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 8 N157UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 9 N198UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
- 10 N199UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
-
- >
Select() function to select columns of a table
Select function is used to display a subset of dataset containing only those columns which are mentioned inside select function as arguments. The select() function also displays data based on the conditions mentioned inside select() function. These conditions are applied on the variables of the dataset loaded at the beginning.
Displaying columns by including column names as argument. To select columns by names we can use the below syntax,
- select(planes, year, manufacturer, seats)
The above code will generate the following output,
- > select(planes, year, manufacturer, seats)
-
- year manufacturer seats
- <int> <chr> <int>
- 1 2004 EMBRAER 55
- 2 1998 AIRBUS INDUSTRIE 182
- 3 1999 AIRBUS INDUSTRIE 182
- 4 1999 AIRBUS INDUSTRIE 182
- 5 2002 EMBRAER 55
- 6 1999 AIRBUS INDUSTRIE 182
- 7 1999 AIRBUS INDUSTRIE 182
- 8 1999 AIRBUS INDUSTRIE 182
- 9 1999 AIRBUS INDUSTRIE 182
- 10 1999 AIRBUS INDUSTRIE 182
-
- >
To display all the columns between type and engines, we can use the below syntax,
- > select(planes, type:engines)
The above code will generate the following output,
- > select(planes, type:engines)
-
- type manufacturer model engines
- <chr> <chr> <chr> <int>
- 1 Fixed wing multi engine EMBRAER EMB-145XR 2
- 2 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 3 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 4 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 5 Fixed wing multi engine EMBRAER EMB-145LR 2
- 6 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 7 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 8 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 9 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
- 10 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
-
- >
To display all the selected columns excluding those from type to engines, we can use the below syntax,
- > select(planes, -(type:engines))
The above code will generate the following output,
- > select(planes, -(type:engines))
-
- tailnum year seats speed engine
- <chr> <int> <int> <int> <chr>
- 1 N10156 2004 55 NA Turbo-fan
- 2 N102UW 1998 182 NA Turbo-fan
- 3 N103US 1999 182 NA Turbo-fan
- 4 N104UW 1999 182 NA Turbo-fan
- 5 N10575 2002 55 NA Turbo-fan
- 6 N105UW 1999 182 NA Turbo-fan
- 7 N107US 1999 182 NA Turbo-fan
- 8 N108UW 1999 182 NA Turbo-fan
- 9 N109UW 1999 182 NA Turbo-fan
- 10 N110UW 1999 182 NA Turbo-fan
-
- >
Summary
In this article I demonstrated how to use dplyr package in R along with planes dataset. We have used various functions provided with dplyr package to manipulate and transform the data and to create a subset of data as well. Various functions such as filter(), arrange() and select() are used. Proper coding snippets and outputs are also provided.