How To Create Subsets Using Positive Numbers In R

Introduction

 
In this article, I am going to demonstrate how to create subsets of data using positive numbers for analysis of datasets so as to extract relevant data for creating a machine learning model. Extracting data from a dataset or creatinga  subset of data is a part of data pre-processing techniques used in R to obtain clean and relevant data for accurate predictions to be made through a machine learning model.
 
For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of dataset and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.
 
Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used. There are different types of operators for creating subsets of data. There are three kinds of operators which can be used to create different subsets which are as follows,
 
Dollar operator
 
We can create subsets of entire datasets by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of datasets at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.
 
Now we will discuss with some examples, on how to use dollar operator to create subset of dataset. We will be using quakes dataset to use different operators as follows,
  1. > data = quakes[1:50,]  
  2. > data  
  3.       lat   long depth mag stations  
  4. 1  -20.42 181.62   562 4.8       41  
  5. 2  -20.62 181.03   650 4.2       15  
  6. 3  -26.00 184.10    42 5.4       43  
  7. 4  -17.97 181.66   626 4.1       19  
  8. 5  -20.42 181.96   649 4.0       11  
  9. 6  -19.68 184.31   195 4.0       12  
  10. 7  -11.70 166.10    82 4.8       43  
  11. 8  -28.11 181.93   194 4.4       15  
  12. 9  -28.74 181.74   211 4.7       35  
  13. 10 -17.47 179.59   622 4.3       19  
  14. 11 -21.44 180.69   583 4.4       13  
  15. 12 -12.26 167.00   249 4.6       16  
  16. 13 -18.54 182.11   554 4.4       19  
  17. 14 -21.00 181.66   600 4.4       10  
  18. 15 -20.70 169.92   139 6.1       94  
  19. 16 -15.94 184.95   306 4.3       11  
  20. 17 -13.64 165.96    50 6.0       83  
  21. 18 -17.83 181.50   590 4.5       21  
  22. 19 -23.50 179.78   570 4.4       13  
  23. 20 -22.63 180.31   598 4.4       18  
  24. 21 -20.84 181.16   576 4.5       17  
  25. 22 -10.98 166.32   211 4.2       12  
  26. 23 -23.30 180.16   512 4.4       18  
  27. 24 -30.20 182.00   125 4.7       22  
  28. 25 -19.66 180.28   431 5.4       57  
  29. 26 -17.94 181.49   537 4.0       15  
  30. 27 -14.72 167.51   155 4.6       18  
  31. 28 -16.46 180.79   498 5.2       79  
  32. 29 -20.97 181.47   582 4.5       25  
  33. 30 -19.84 182.37   328 4.4       17  
  34. 31 -22.58 179.24   553 4.6       21  
  35. 32 -16.32 166.74    50 4.7       30  
  36. 33 -15.55 185.05   292 4.8       42  
  37. 34 -23.55 180.80   349 4.0       10  
  38. 35 -16.30 186.00    48 4.5       10  
  39. 36 -25.82 179.33   600 4.3       13  
  40. 37 -18.73 169.23   206 4.5       17  
  41. 38 -17.64 181.28   574 4.6       17  
  42. 39 -17.66 181.40   585 4.1       17  
  43. 40 -18.82 169.33   230 4.4       11  
  44. 41 -37.37 176.78   263 4.7       34  
  45. 42 -15.31 186.10    96 4.6       32  
  46. 43 -24.97 179.82   511 4.4       23  
  47. 44 -15.49 186.04    94 4.3       26  
  48. 45 -19.23 169.41   246 4.6       27  
  49. 46 -30.10 182.30    56 4.9       34  
  50. 47 -26.40 181.70   329 4.5       24  
  51. 48 -11.77 166.32    70 4.4       18  
  52. 49 -24.12 180.08   493 4.3       21  
  53. 50 -18.97 185.25   129 5.1       73  
  54. >  
Now we will use dollar operator with lat variable as follows,
  1. > ds = data$lat  
  2. > ds  
  3.  [1] -20.42 -20.62 -26.00 -17.97 -20.42 -19.68 -11.70 -28.11 -28.74 -17.47 -21.44 -12.26 -18.54 -21.00 -20.70 -15.94 -13.64 -17.83 -23.50 -22.63 -20.84 -10.98 -23.30  
  4. [24] -30.20 -19.66 -17.94 -14.72 -16.46 -20.97 -19.84 -22.58 -16.32 -15.55 -23.55 -16.30 -25.82 -18.73 -17.64 -17.66 -18.82 -37.37 -15.31 -24.97 -15.49 -19.23 -30.10  
  5. [47] -26.40 -11.77 -24.12 -18.97  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. The subset is having lat variable and its observations. The subset is stored in a variable named ds.
  1. > df = data$long  
  2. > df  
  3.  [1181.62 181.03 184.10 181.66 181.96 184.31 166.10 181.93 181.74 179.59 180.69 167.00 182.11 181.66 169.92 184.95 165.96 181.50 179.78 180.31 181.16 166.32 180.16  
  4. [24182.00 180.28 181.49 167.51 180.79 181.47 182.37 179.24 166.74 185.05 180.80 186.00 179.33 169.23 181.28 181.40 169.33 176.78 186.10 179.82 186.04 169.41 182.30  
  5. [47181.70 166.32 180.08 185.25  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. The subset is having long variable and its observations. The subset is stored in a variable named df.
  1. > da = data$dept  
  2. > da  
  3.  [1562 650  42 626 649 195  82 194 211 622 583 249 554 600 139 306  50 590 570 598 576 211 512 125 431 537 155 498 582 328 553  50 292 349  48 600 206 574 585 230  
  4. [41263  96 511  94 246  56 329  70 493 129  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. The subset is having dept variable and its observations. The subset is stored in a variable named da.
  1. > dn = data$mag  
  2. >dn  
  3.  [14.8 4.2 5.4 4.1 4.0 4.0 4.8 4.4 4.7 4.3 4.4 4.6 4.4 4.4 6.1 4.3 6.0 4.5 4.4 4.4 4.5 4.2 4.4 4.7 5.4 4.0 4.6 5.2 4.5 4.4 4.6 4.7 4.8 4.0 4.5 4.3 4.5 4.6 4.1 4.4  
  4. [414.7 4.6 4.4 4.3 4.6 4.9 4.5 4.4 4.3 5.1  
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. The subset is having mag variable and its observations. The subset is stored in a variable named dn.
 
Double square brackets operator
 
The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
  1. > data[['long']]  
  2.  [1181.62 181.03 184.10 181.66 181.96 184.31 166.10 181.93 181.74 179.59 180.69 167.00 182.11 181.66 169.92 184.95 165.96 181.50 179.78 180.31 181.16 166.32 180.16  
  3. [24182.00 180.28 181.49 167.51 180.79 181.47 182.37 179.24 166.74 185.05 180.80 186.00 179.33 169.23 181.28 181.40 169.33 176.78 186.10 179.82 186.04 169.41 182.30  
  4. [47181.70 166.32 180.08 185.25  
As we can see above code snippet created a subset containing a single variable long. The argument is a variable name inside double square brackets operator.
  1. > data[[3]]  
  2.  [1562 650  42 626 649 195  82 194 211 622 583 249 554 600 139 306  50 590 570 598 576 211 512 125 431 537 155 498 582 328 553  50 292 349  48 600 206 574 585 230  
  3. [41263  96 511  94 246  56 329  70 493 129  
As we can see above code snippet created a subset containing a single variable dept. The argument is an index position of the variable named dept inside double square brackets operator.
  1. > data[[3]][2]  
  2. [1650  
  3. >  
As we can see above code snippet created a subset containing a single observation of the variable dept. The arguments are an index positions of the rows and columns of that particular observation of the variable dept inside double square brackets operator.
 
Single square brackets operator
 
The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator to create subset of dataset as follows,
  1. > data[1:8,]  
  2.      lat   long depth mag stations  
  3. 1 -20.42 181.62   562 4.8       41  
  4. 2 -20.62 181.03   650 4.2       15  
  5. 3 -26.00 184.10    42 5.4       43  
  6. 4 -17.97 181.66   626 4.1       19  
  7. 5 -20.42 181.96   649 4.0       11  
  8. 6 -19.68 184.31   195 4.0       12  
  9. 7 -11.70 166.10    82 4.8       43  
  10. 8 -28.11 181.93   194 4.4       15  
  11. >  
As we can see from the above output single square brackets operator created a subset of quakes dataset containing all the variables and observations.
  1. > data[c(3,1,4)]  
  2.    depth    lat mag  
  3. 1    562 -20.42 4.8  
  4. 2    650 -20.62 4.2  
  5. 3     42 -26.00 5.4  
  6. 4    626 -17.97 4.1  
  7. 5    649 -20.42 4.0  
  8. 6    195 -19.68 4.0  
  9. 7     82 -11.70 4.8  
  10. 8    194 -28.11 4.4  
  11. 9    211 -28.74 4.7  
  12. 10   622 -17.47 4.3  
  13. 11   583 -21.44 4.4  
  14. 12   249 -12.26 4.6  
  15. 13   554 -18.54 4.4  
  16. 14   600 -21.00 4.4  
  17. 15   139 -20.70 6.1  
  18. 16   306 -15.94 4.3  
  19. 17    50 -13.64 6.0  
  20. 18   590 -17.83 4.5  
  21. 19   570 -23.50 4.4  
  22. 20   598 -22.63 4.4  
  23. 21   576 -20.84 4.5  
  24. 22   211 -10.98 4.2  
  25. 23   512 -23.30 4.4  
  26. 24   125 -30.20 4.7  
  27. 25   431 -19.66 5.4  
  28. 26   537 -17.94 4.0  
  29. 27   155 -14.72 4.6  
  30. 28   498 -16.46 5.2  
  31. 29   582 -20.97 4.5  
  32. 30   328 -19.84 4.4  
  33. 31   553 -22.58 4.6  
  34. 32    50 -16.32 4.7  
  35. 33   292 -15.55 4.8  
  36. 34   349 -23.55 4.0  
  37. 35    48 -16.30 4.5  
  38. 36   600 -25.82 4.3  
  39. 37   206 -18.73 4.5  
  40. 38   574 -17.64 4.6  
  41. 39   585 -17.66 4.1  
  42. 40   230 -18.82 4.4  
  43. 41   263 -37.37 4.7  
  44. 42    96 -15.31 4.6  
  45. 43   511 -24.97 4.4  
  46. 44    94 -15.49 4.3  
  47. 45   246 -19.23 4.6  
The above code pulls out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 3, 1 and 4 index positions.
  1. > data[c(2,4,1)]  
  2.      long mag    lat  
  3. 1  181.62 4.8 -20.42  
  4. 2  181.03 4.2 -20.62  
  5. 3  184.10 5.4 -26.00  
  6. 4  181.66 4.1 -17.97  
  7. 5  181.96 4.0 -20.42  
  8. 6  184.31 4.0 -19.68  
  9. 7  166.10 4.8 -11.70  
  10. 8  181.93 4.4 -28.11  
  11. 9  181.74 4.7 -28.74  
  12. 10 179.59 4.3 -17.47  
  13. 11 180.69 4.4 -21.44  
  14. 12 167.00 4.6 -12.26  
  15. 13 182.11 4.4 -18.54  
  16. 14 181.66 4.4 -21.00  
  17. 15 169.92 6.1 -20.70  
  18. 16 184.95 4.3 -15.94  
  19. 17 165.96 6.0 -13.64  
  20. 18 181.50 4.5 -17.83  
  21. 19 179.78 4.4 -23.50  
  22. 20 180.31 4.4 -22.63  
  23. 21 181.16 4.5 -20.84  
  24. 22 166.32 4.2 -10.98  
  25. 23 180.16 4.4 -23.30  
  26. 24 182.00 4.7 -30.20  
  27. 25 180.28 5.4 -19.66  
  28. 26 181.49 4.0 -17.94  
  29. 27 167.51 4.6 -14.72  
  30. 28 180.79 5.2 -16.46  
  31. 29 181.47 4.5 -20.97  
  32. 30 182.37 4.4 -19.84  
  33. 31 179.24 4.6 -22.58  
  34. 32 166.74 4.7 -16.32  
  35. 33 185.05 4.8 -15.55  
  36. 34 180.80 4.0 -23.55  
  37. 35 186.00 4.5 -16.30  
  38. 36 179.33 4.3 -25.82  
  39. 37 169.23 4.5 -18.73  
  40. 38 181.28 4.6 -17.64  
  41. 39 181.40 4.1 -17.66  
  42. 40 169.33 4.4 -18.82  
  43. 41 176.78 4.7 -37.37  
  44. 42 186.10 4.6 -15.31  
  45. 43 179.82 4.4 -24.97  
  46. 44 186.04 4.3 -15.49  
  47. 45 169.41 4.6 -19.23  
  48. 46 182.30 4.9 -30.10  
  49. 47 181.70 4.5 -26.40  
  50. 48 166.32 4.4 -11.77  
Above code pull out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 2, 4 and 1 index positions.
 
The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[ creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset. For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
 
The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.
 
Now we will discuss how to use above mentioned operators to create the subsets of specified number of variables of a dataset. We will discuss methods to create subsets using positive numerical values of dataset containing data with all the variables and observations of the datasets.
 

Creating subsets using positive numerical values

 
The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.
 
A subset using positive numerical values can be created using single square brackets operator preceded by dataset name. Such subsets contains only those variables and observations of a dataset whose index positions are mentioned inside square brackets. Using Single Square brackets operator preceded by dataset name we can mention the index numbers of required number of variables we want to insert in a resultant subset.
 
Now we will be using predefined dataset rock of type data frame containing four variables and 48 observations to create subsets using positive numerical values. We will be creating subsets using positive numerical values of several predefined datasets available in R as follows,
  1. > str(rock)  
  2. 'data.frame':   48 obs. of  4 variables:  
  3.  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...  
  4.  $ peri : num  2792 3893 3931 3869 3949 ...  
  5.  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...  
  6.  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...  
  7. >  
The subsets using positive numerical values for above rock dataset is as follows,
  1. > rock[c(4123)]  
  2.      perm  area     peri     shape  
  3. 1     6.3  4990 2791.900 0.0903296  
  4. 2     6.3  7002 3892.600 0.1486220  
  5. 3     6.3  7558 3930.660 0.1833120  
  6. 4     6.3  7352 3869.320 0.1170630  
  7. 5    17.1  7943 3948.540 0.1224170  
  8. 6    17.1  7979 4010.150 0.1670450  
  9. 7    17.1  9333 4345.750 0.1896510  
  10. 8    17.1  8209 4344.750 0.1641270  
  11. 9   119.0  8393 3682.040 0.2036540  
  12. 10  119.0  6425 3098.650 0.1623940  
  13. 11  119.0  9364 4480.050 0.1509440  
  14. 12  119.0  8624 3986.240 0.1481410  
  15. 13   82.4 10651 4036.540 0.2285950  
  16. 14   82.4  8868 3518.040 0.2316230  
  17. 15   82.4  9417 3999.370 0.1725670  
  18. 16   82.4  8874 3629.070 0.1534810  
  19. 17   58.6 10962 4608.660 0.2043140  
  20. 18   58.6 10743 4787.620 0.2627270  
  21. 19   58.6 11878 4864.220 0.2000710  
  22. 20   58.6  9867 4479.410 0.1448100  
  23. 21  142.0  7838 3428.740 0.1138520  
  24. 22  142.0 11876 4353.140 0.2910290  
  25. 23  142.0 12212 4697.650 0.2400770  
  26. 24  142.0  8233 3518.440 0.1618650  
  27. 25  740.0  6360 1977.390 0.2808870  
  28. 26  740.0  4193 1379.350 0.1794550  
  29. 27  740.0  7416 1916.240 0.1918020  
  30. 28  740.0  5246 1585.420 0.1330830  
  31. 29  890.0  6509 1851.210 0.2252140  
  32. 30  890.0  4895 1239.660 0.3412730  
  33. 31  890.0  6775 1728.140 0.3116460  
  34. 32  890.0  7894 1461.060 0.2760160  
  35. 33  950.0  5980 1426.760 0.1976530  
  36. 34  950.0  5318  990.388 0.3266350  
  37. 35  950.0  7392 1350.760 0.1541920  
  38. 36  950.0  7894 1461.060 0.2760160  
  39. 37  100.0  3469 1376.700 0.1769690  
  40. 38  100.0  1468  476.322 0.4387120  
  41. 39  100.0  3524 1189.460 0.1635860  
  42. 40  100.0  5267 1644.960 0.2538320  
  43. 41 1300.0  5048  941.543 0.3286410  
  44. 42 1300.0  1016  308.642 0.2300810  
  45. 43 1300.0  5605 1145.690 0.4641250  
  46. 44 1300.0  8793 2280.490 0.4204770  
  47. 45  580.0  3475 1174.110 0.2007440  
  48. 46  580.0  1651  597.808 0.2626510  
  49. 47  580.0  5514 1455.880 0.1824530  
  50. 48  580.0  9718 1485.580 0.2004470  
  51. >  
The structure of mtcars dataset is as follows,
  1. > str(mtcars)  
  2. 'data.frame':   32 obs. of  11 variables:  
  3.  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...  
  4.  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...  
  5.  $ disp: num  160 160 108 258 360 ...  
  6.  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...  
  7.  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...  
  8.  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...  
  9.  $ qsec: num  16.5 17 18.6 19.4 17 ...  
  10.  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...  
  11.  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...  
  12.  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...  
  13.  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...  
  14. >  
The subsets using positive numerical values of mtcars dataset is as follows,
  1. > mtcars[c(6538)]  
  2.                        wt drat  disp vs  
  3. Mazda RX4           2.620 3.90 160.0  0  
  4. Mazda RX4 Wag       2.875 3.90 160.0  0  
  5. Datsun 710          2.320 3.85 108.0  1  
  6. Hornet 4 Drive      3.215 3.08 258.0  1  
  7. Hornet Sportabout   3.440 3.15 360.0  0  
  8. Valiant             3.460 2.76 225.0  1  
  9. Duster 360          3.570 3.21 360.0  0  
  10. Merc 240D           3.190 3.69 146.7  1  
  11. Merc 230            3.150 3.92 140.8  1  
  12. Merc 280            3.440 3.92 167.6  1  
  13. Merc 280C           3.440 3.92 167.6  1  
  14. Merc 450SE          4.070 3.07 275.8  0  
  15. Merc 450SL          3.730 3.07 275.8  0  
  16. Merc 450SLC         3.780 3.07 275.8  0  
  17. Cadillac Fleetwood  5.250 2.93 472.0  0  
  18. Lincoln Continental 5.424 3.00 460.0  0  
  19. Chrysler Imperial   5.345 3.23 440.0  0  
  20. Fiat 128            2.200 4.08  78.7  1  
  21. Honda Civic         1.615 4.93  75.7  1  
  22. Toyota Corolla      1.835 4.22  71.1  1  
  23. Toyota Corona       2.465 3.70 120.1  1  
  24. Dodge Challenger    3.520 2.76 318.0  0  
  25. AMC Javelin         3.435 3.15 304.0  0  
  26. Camaro Z28          3.840 3.73 350.0  0  
  27. Pontiac Firebird    3.845 3.08 400.0  0  
  28. Fiat X1-9           1.935 4.08  79.0  1  
  29. Porsche 914-2       2.140 4.43 120.3  0  
  30. Lotus Europa        1.513 3.77  95.1  1  
  31. Ford Pantera L      3.170 4.22 351.0  0  
  32. Ferrari Dino        2.770 3.62 145.0  0  
  33. Maserati Bora       3.570 3.54 301.0  0  
  34. Volvo 142E          2.780 4.11 121.0  1  
  35. >  

Summary

 
In this article, I demonstrated how to create subsets of dataset using positive numerical values for analysis of dataset so as to extract relevant data. Different kinds of operators and datasets are used to create subsets of dataset using positive numerical values. Proper coding snippets along with outputs are also provided.