How To Create Subsets Of Data Using Logical Values In R

Introduction

 
In this article, I am going to demonstrate how to create subsets of data using logical values for analysis of datasets so as to extract relevant data for creating a machine learning model. Logical values creates subsets of data containing only those observations that return a Boolean value of true when checked for certain conditions along with dollar operator and single square brackets operator and excluding those values which return false against conditions specified inside the square brackets.
 
Extracting data from datasets or creating subsets of data is a part of a data pre-processing technique used in R to obtain clean and relevant data for accurate predictions to be made through a machine learning model.
 
For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of dataset and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.
 
Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used.
 

Different types of operators for creating subsets of data

 
There are three kinds of operators which can be used to create different subsets which are as follows,
 
Dollar operator
 
We can create subsets of entire datasets by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of datasets at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.
 
Now we will discuss with some examples, on how to use dollar operator to create subsets of datasets using logical values. We will be creating subsets of dataset using logical values. We will be using quakes dataset to use different operators as follows,
  1. > data = quakes[-(30:990),]  
  2. > data  
  3.         lat   long depth mag stations  
  4. 1    -20.42 181.62   562 4.8       41  
  5. 2    -20.62 181.03   650 4.2       15  
  6. 3    -26.00 184.10    42 5.4       43  
  7. 4    -17.97 181.66   626 4.1       19  
  8. 5    -20.42 181.96   649 4.0       11  
  9. 6    -19.68 184.31   195 4.0       12  
  10. 7    -11.70 166.10    82 4.8       43  
  11. 8    -28.11 181.93   194 4.4       15  
  12. 9    -28.74 181.74   211 4.7       35  
  13. 10   -17.47 179.59   622 4.3       19  
  14. 11   -21.44 180.69   583 4.4       13  
  15. 12   -12.26 167.00   249 4.6       16  
  16. 13   -18.54 182.11   554 4.4       19  
  17. 14   -21.00 181.66   600 4.4       10  
  18. 15   -20.70 169.92   139 6.1       94  
  19. 16   -15.94 184.95   306 4.3       11  
  20. 17   -13.64 165.96    50 6.0       83  
  21. 18   -17.83 181.50   590 4.5       21  
  22. 19   -23.50 179.78   570 4.4       13  
  23. 20   -22.63 180.31   598 4.4       18  
  24. 21   -20.84 181.16   576 4.5       17  
  25. 22   -10.98 166.32   211 4.2       12  
  26. 23   -23.30 180.16   512 4.4       18  
  27. 24   -30.20 182.00   125 4.7       22  
  28. 25   -19.66 180.28   431 5.4       57  
  29. 26   -17.94 181.49   537 4.0       15  
  30. 27   -14.72 167.51   155 4.6       18  
  31. 28   -16.46 180.79   498 5.2       79  
  32. 29   -20.97 181.47   582 4.5       25  
  33. 991  -20.73 181.42   575 4.3       18  
  34. 992  -15.45 181.42   409 4.3       27  
  35. 993  -20.05 183.86   243 4.9       65  
  36. 994  -17.95 181.37   642 4.0       17  
  37. 995  -17.70 188.10    45 4.2       10  
  38. 996  -25.93 179.54   470 4.4       22  
  39. 997  -12.28 167.06   248 4.7       35  
  40. 998  -20.13 184.20   244 4.5       34  
  41. 999  -17.40 187.80    40 4.5       14  
  42. 1000 -21.59 170.56   165 6.0      119  
  43. >  
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and includes only those observations which are not mentioned inside parenthesis along with negative sign.
  1. > data = quakes[-(40:980),-(2:4)]  
  2. > data  
  3.         lat stations  
  4. 1    -20.42       41  
  5. 2    -20.62       15  
  6. 3    -26.00       43  
  7. 4    -17.97       19  
  8. 5    -20.42       11  
  9. 6    -19.68       12  
  10. 7    -11.70       43  
  11. 8    -28.11       15  
  12. 9    -28.74       35  
  13. 10   -17.47       19  
  14. 11   -21.44       13  
  15. 12   -12.26       16  
  16. 13   -18.54       19  
  17. 14   -21.00       10  
  18. 15   -20.70       94  
  19. 16   -15.94       11  
  20. 17   -13.64       83  
  21. 18   -17.83       21  
  22. 19   -23.50       13  
  23. 20   -22.63       18  
  24. 21   -20.84       17  
  25. 22   -10.98       12  
  26. 23   -23.30       18  
  27. 24   -30.20       22  
  28. 25   -19.66       57  
  29. 26   -17.94       15  
  30. 27   -14.72       18  
  31. 28   -16.46       79  
  32. 29   -20.97       25  
  33. 30   -19.84       17  
  34. 31   -22.58       21  
  35. 32   -16.32       30  
  36. 33   -15.55       42  
  37. 34   -23.55       10  
  38. 35   -16.30       10  
  39. 36   -25.82       13  
  40. 37   -18.73       17  
  41. 38   -17.64       17  
  42. 39   -17.66       17  
  43. 981  -20.82       67  
  44. 982  -22.95       21  
  45. 983  -28.22       49  
  46. 984  -27.99       22  
  47. 985  -15.54       17  
  48. 986  -12.37       16  
  49. 987  -22.33       51  
  50. 988  -22.70       27  
  51. 989  -17.86       12  
  52. 990  -16.00       33  
  53. 991  -20.73       18  
  54. 992  -15.45       27  
  55. 993  -20.05       65  
  56. 994  -17.95       17  
  57. 995  -17.70       10  
  58. 996  -25.93       22  
  59. 997  -12.28       35  
  60. 998  -20.13       34  
  61. 999  -17.40       14  
  62. 1000 -21.59      119  
  63. >  
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and observations but exclude those variables and observations which are mentioned inside parenthesis along with negative sign.
 
Now we will use dollar operator with and logical value and lat variable as follows,
  1. > ds = data$lat[data$lat<20]  
  2. > ds  
  3.  [1] -20.42 -20.62 -26.00 -17.97 -20.42 -19.68 -11.70 -28.11 -28.74 -17.47  
  4. [11] -21.44 -12.26 -18.54 -21.00 -20.70 -15.94 -13.64 -17.83 -23.50 -22.63  
  5. [21] -20.84 -10.98 -23.30 -30.20 -19.66 -17.94 -14.72 -16.46 -20.97 -20.73  
  6. [31] -15.45 -20.05 -17.95 -17.70 -25.93 -12.28 -20.13 -17.40 -21.59  
  7. >  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having lat variable and its observations. The subset is stored in a variable named ds. The subset extracts all the elements but exclude those elements which do not return a logical value of true.
  1. > df = data$stations[data$stations<40]  
  2. > df  
  3.  [115 19 11 12 15 35 19 13 16 19 10 11 21 13 18 17 12 18 22 15 18 25 18 27 17  
  4. [2610 22 35 34 14  
  5. >  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having stations variable and its observations. The subset is stored in a variable named df. The subset extracts all the elements but exclude those elements which does not return a logical value of true.
  1. > dn = data$dept[data$dept<600]  
  2. > dn  
  3.  [1562  42 195  82 194 211 583 249 554 139 306  50 590 570 598 576 211 512 125  
  4. [20431 537 155 498 582 575 409 243 45 470 248 244  40 165  
  5. >  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having dept variable and its observations. The subset is stored in a variable named dn. The subset extracts all the elements but excludes those elements which do not return a logical value of true.
  1. > da = data$mag[data$mag<10]  
  2. > da  
  3.  [14.8 4.2 5.4 4.1 4.0 4.0 4.8 4.4 4.7 4.3 4.4 4.6 4.4 4.4 6.1 4.3 6.0 4.5 4.4  
  4. [204.4 4.5 4.2 4.4 4.7 5.4 4.0 4.6 5.2 4.5 4.3 4.3 4.9 4.0 4.2 4.4 4.7 4.5 4.5  
  5. [396.0  
  6. >  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having mag variable and its observations. The subset is stored in a variable named da. The subset extracts all the elements but exclude those elements which does not return a logical value of true.
 
Double square brackets operator
 
The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
  1. > data[['long']]  
  2.  [1181.62 181.03 184.10 181.66 181.96 184.31 166.10 181.93 181.74 179.59 180.69 167.00 182.11 181.66 169.92 184.95 165.96 181.50 179.78 180.31 181.16 166.32 180.16  
  3. [24182.00 180.28 181.49 167.51 180.79 181.47 182.37 179.24 166.74 185.05 180.80 186.00 179.33 169.23 181.28 181.40 169.33 176.78 186.10 179.82 186.04 169.41 182.30  
  4. [47181.70 166.32 180.08 185.25  
As we can see the above code snippet created a subset containing a single variable long. The argument is a variable name inside double square brackets operator.
  1. > data[[3]]  
  2.  [1562 650  42 626 649 195  82 194 211 622 583 249 554 600 139 306  50 590 570 598 576 211 512 125 431 537 155 498 582 328 553  50 292 349  48 600 206 574 585 230  
  3. [41263  96 511  94 246  56 329  70 493 129  
As we can see the above code snippet created a subset containing a single variable dept. The argument is an index position of the variable named dept inside double square brackets operator.
  1. > data[[3]][2]  
  2. [1650  
  3. >  
As we can see the above code snippet created a subset containing a single observation of the variable dept. The arguments are an index positions of the rows and columns of that particular observation of the variable dept inside double square brackets operator.
 
Single square brackets operator
 
The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator along with logical values and dollar sign to create subsets of dataset as follows,
  1. > data = quakes  
  2. > data  
  3. > da = data$dept[data$dept<100]  
  4. > da  
  5.   [142 82 50 50 48 96 94 56 70 46 84 40 96 75 69 50 72 42 42 46 64 82 81 49 94  
  6.  [2663 53 42 97 48 56 69 93 42 59 40 99 67 45 93 90 65 71 57 74 44 48 46 97 65  
  7.  [5182 67 55 74 49 93 83 61 42 56 68 69 45 43 65 80 51 68 69 61 69 51 55 54 59  
  8.  [7656 65 60 40 48 56 44 52 41 40 99 66 47 70 57 80 82 90 45 45 95 65 54 47 94  
  9. [10180 54 57 49 62 63 51 45 63 66 58 70 50 58 69 70 41 51 64 45 50 44 68 47 40  
  10. [12685 98 58 89 49 40 42 76 63 93 64 83 40 62 75 44 63 40 70 41 82 50 70 74 89  
  11. [15153 68 52 66 51 67 64 47 49 75 60 75 56 48 53 85 57 79 82 93 47 98 61 83 55  
  12. [17686 78 45 50 57 66 57 89 85 50 75 46 50 80 86 83 70 74 40 87 63 47 71 42 97  
  13. [20156 43 93 66 70 54 82 43 77 68 71 68 99 40 62 94 56 49 42 69 48 47 76 61 90  
  14. [22657 69 51 44 51 63 87 61 60 63 82 41 40 60 43 54 68 42 43 42 75 71 60 69 45  
  15. [25140  
  16. >  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
  1. > ds = data$long[data$long<170]  
  2. > ds  
  3.   [1166.10 167.00 169.92 165.96 166.32 167.51 166.74 169.23 169.33 169.41  
  4.  [11166.32 166.22 166.20 167.06 167.53 167.06 169.71 166.54 169.49 167.40  
  5.  [21169.48 166.97 167.89 168.98 168.02 169.46 167.10 167.62 165.80 167.68  
  6.  [31166.07 169.84 166.24 167.16 169.42 169.31 169.09 166.66 166.53 166.00  
  7.  [41169.50 166.26 167.24 169.33 169.01 167.24 168.80 166.20 169.32 169.28  
  8.  [51169.58 169.63 169.24 167.10 167.32 166.36 165.77 166.24 166.60 166.29  
  9.  [61166.47 169.21 167.95 167.14 167.33 165.99 166.14 167.51 169.14 167.26  
  10.  [71167.26 169.15 169.48 166.37 168.52 167.70 167.32 167.50 166.06 169.04  
  11.  [81166.87 165.98 165.96 165.76 166.02 167.38 167.18 167.01 167.01 166.83  
  12.  [91166.94 167.25 166.69 167.34 167.42 166.90 166.85 166.80 166.91 167.54  
  13. [101166.18 168.71 166.62 166.49 167.26 167.16 166.36 168.75 167.15 166.28  
  14. [111169.76 166.78 168.98 168.69 165.67 167.39 167.91 166.07 166.10 167.10  
  15. [121169.37 169.10 167.32 167.18 167.91 168.08 169.71 167.24 169.66 167.03  
  16. [131167.43 166.75 167.41 166.55 165.80 166.64 169.46 169.52 167.10 168.93  
  17. [141166.90 168.63 169.44 169.90 166.56 167.23 167.24 166.66 169.63 167.02  
  18. [151167.05 167.01 166.20 166.30 169.50 167.11 166.53 169.53 165.97 169.75  
  19. [161167.95 167.32 166.01 167.44 166.72 166.98 169.05 166.93 167.06  
  20. >  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
 
The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset. For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
 
The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.
 
Now we will discuss how to use the above mentioned operators to create the subsets of a specified number of variables of a dataset. We will discuss methods to create subsets using positive numerical values of dataset containing data with all the variables and observations of the datasets.
 

Creating subsets using logical values

 
The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.
 
A subset using logical values can be created using single square brackets operator preceded by dataset name and dollar operator inside square brackets. Such subsets contains only those variables and observations of a dataset whose index positions are not mentioned inside square brackets. Using Single Square brackets operator preceded by dataset name and dollar sign we can mention the index numbers of required number of columns we want to exclude in a resultant subset.
 
Now we will be using predefined dataset rock of type data frame containing four variables and 48 observations to create subsets using logical values. We will be creating subsets using logical values of several predefined datasets available in R as follows,
  1. > str(rock)  
  2. 'data.frame':   48 obs. of  4 variables:  
  3.  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...  
  4.  $ peri : num  2792 3893 3931 3869 3949 ...  
  5.  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...  
  6.  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...  
  7. >  
The subsets using logical values for the above rock dataset is as follows,
  1. > da = data$area[data$area<10000]  
  2. > da  
  3.  [14990 7002 7558 7352 7943 7979 9333 8209 8393 6425 9364 8624 8868 9417 8874  
  4. [169867 7838 8233 6360 4193 7416 5246 6509 4895 6775 7894 5980 5318 7392 7894  
  5. [313469 1468 3524 5267 5048 1016 5605 8793 3475 1651 5514 9718  
  6. >  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
  1. > ds = data$peri[data$peri<4000]  
  2. > ds  
  3.  [12791.900 3892.600 3930.660 3869.320 3948.540 3682.040 3098.650 3986.240  
  4.  [93518.040 3999.370 3629.070 3428.740 3518.440 1977.390 1379.350 1916.240  
  5. [171585.420 1851.210 1239.660 1728.140 1461.060 1426.760  990.388 1350.760  
  6. [251461.060 1376.700  476.322 1189.460 1644.960  941.543  308.642 1145.690  
  7. [332280.490 1174.110  597.808 1455.880 1485.580  
  8. >  
As we can see from above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
  1. > dn = data$shape[data$shape<1]  
  2. > dn  
  3.  [10.0903296 0.1486220 0.1833120 0.1170630 0.1224170 0.1670450 0.1896510  
  4.  [80.1641270 0.2036540 0.1623940 0.1509440 0.1481410 0.2285950 0.2316230  
  5. [150.1725670 0.1534810 0.2043140 0.2627270 0.2000710 0.1448100 0.1138520  
  6. [220.2910290 0.2400770 0.1618650 0.2808870 0.1794550 0.1918020 0.1330830  
  7. [290.2252140 0.3412730 0.3116460 0.2760160 0.1976530 0.3266350 0.1541920  
  8. [360.2760160 0.1769690 0.4387120 0.1635860 0.2538320 0.3286410 0.2300810  
  9. [430.4641250 0.4204770 0.2007440 0.2626510 0.1824530 0.2004470  
  10. >  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dn.
  1. > dh = data$perm[data$perm<500]  
  2. > dh  
  3.  [1]   6.3   6.3   6.3   6.3  17.1  17.1  17.1  17.1 119.0 119.0 119.0 119.0  82.4  82.4  82.4  82.4  58.6  58.6  58.6  58.6 142.0 142.0 142.0 142.0 100.0 100.0 100.0  
  4. [28100.0  
  5. >  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dh.
 
The structure of mtcars dataset is as follows,
  1. > str(mtcars)  
  2. 'data.frame':   32 obs. of  11 variables:  
  3.  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...  
  4.  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...  
  5.  $ disp: num  160 160 108 258 360 ...  
  6.  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...  
  7.  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...  
  8.  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...  
  9.  $ qsec: num  16.5 17 18.6 19.4 17 ...  
  10.  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...  
  11.  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...  
  12.  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...  
  13.  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...  
  14. >  
The subsets using logical values of mtcars dataset is as follows,
  1. > ds = data$mpg[data$mpg<30]  
  2. > ds  
  3.  [121.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5 15.5 15.2 13.3 19.2 27.3 26.0 15.8 19.7 15.0 21.4  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
  1. > da = data$cyl[data$cyl<10]  
  2. > da  
  3.  [16 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
  1. > df = data$disp[data$disp<1000]  
  2. > df  
  3.  [1160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0 400.0  79.0 120.3  
  4. [28]  95.1 351.0 145.0 301.0 121.0  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and the result is stored in a variable named df.
  1. > dn = data$hp[data$hp<300]  
  2. > dn  
  3.  [1110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 109  
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dn.
  1. > dg = data$drat[data$drat<10]  
  2. > dg  
  3.  [13.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62 3.54 4.11  
  4. >  
 As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dg.
 

Summary

 
In this article, I demonstrated how to create subsets of dataset using logical values for analysis of dataset so as to extract relevant data. Different kinds of operators and datasets are used to create subsets of dataset using logical values. Proper coding snippets along with outputs are also provided.