How To Create A Blank Subset Of Data In R

Introduction

 
In this article, I am going to demonstrate how to create a blank subset of a dataset for analysis of datasets so as to extract relevant data for creating a machine learning model. Extracting data from datasets or creating a subset of data is a part of a data pre-processing technique used in R to obtain clean and relevant data for accurate predictions to be made through a machine learning model.
 
For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of dataset and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.
 
Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used which are as follows.
 

Different types of operators for creating subset of data

 
There are three kinds of operators which can be used to create different subsets which are as follows.
 
Currency operator ($)
 
We can create subsets of entire dataset by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of dataset at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.
 
Now we will discuss with some examples, on how to use dollar operator to create subset of datasets.
 
We will be using mtcars dataset to use different operators.
  1. > data = mtcars  
  2. > data  
  3.                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb  
  4. Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  
  5. Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  
  6. Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  
  7. Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  
  8. Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  
  9. Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  
  10. Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4  
  11. Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2  
  12. Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2  
  13. Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  
  14. Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4  
  15. Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3  
  16. Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3  
  17. Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3  
  18. Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  
  19. Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4  
  20. Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4  
  21. Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1  
  22. Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2  
  23. Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1  
  24. Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  
  25. Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2  
  26. AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2  
  27. Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4  
  28. Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2  
  29. Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1  
  30. Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2  
  31. Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2  
  32. Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4  
  33. Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  
  34. Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8  
  35. Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2  
  36. >  
Now we will use dollar operator with mpg variable.
  1. > ds = data$mpg  
  2. > ds  
  3.  [121.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4  
  4. [1610.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7  
  5. [3115.0 21.4  
As we can see from the above output, using dollar operator with dataset and variable name, a subset of mtcars dataset is created. The subset has mpg variable and its observations. The subset is stored in a variable named ds.
  1. > df = data$cyl  
  2. > df  
  3.  [16 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4  
As we can see from the above output, using dollar operator with dataset and variable name, a subset of mtcars dataset is created. The subset has cyl variable and its observations. The subset is stored in a variable named df.
  1. > da = data$disp  
  2. > da  
  3.  [1160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0 400.0  79.0 120.3  
  4. [28]  95.1 351.0 145.0 301.0 121.0  
As we can see from the above output, using dollar operator with dataset and variable name a subset of mtcars dataset is created. The subset is having disp variable and its observations. The subset is stored in a variable named da.
  1. > dn = data$hp  
  2. >dn  
  3.  [1110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109  
As we can see from the above output, using dollar operator with dataset and variable name a subset of mtcars dataset is created. The subset is having hp variable and its observations. The subset is stored in a variable named dn.
 
Double square brackets operator ([[)
 
The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
  1. > data[['mpg']]  
  2.  [121.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4  
As we can see above code snippet created a subset containing a single variable mpg. The argument is a variable name inside double square brackets operator.
  1. > data[[3]]  
  2.  [1160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0 400.0  79.0 120.3  
  3. [28]  95.1 351.0 145.0 301.0 121.0  
As we can see above code snippet created a subset containing a single variable disp. The argument is an index position of the variable named disp inside double square brackets operator.
  1. > data[[3]][2]  
  2. [1160  
  3. >  
As we can see above code snippet created a subset containing a single observation of the variable disp. The arguments are an index positions of the rows and columns of that particular observation of the variable disp inside double square brackets operator.
 
Single square brackets operator ([)
 
The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator to create subset of dataset as follows,
  1. > data[]  
  2.                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb  
  3. Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  
  4. Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  
  5. Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  
  6. Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  
  7. Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  
  8. Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  
  9. Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4  
  10. Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2  
  11. Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2  
  12. Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  
As we can see from the above output single square brackets operator created a subset of mtcars dataset containing all the variables and observations.
  1. > data[c(3,2,5)]  
  2.                      disp cyl drat  
  3. Mazda RX4           160.0   6 3.90  
  4. Mazda RX4 Wag       160.0   6 3.90  
  5. Datsun 710          108.0   4 3.85  
  6. Hornet 4 Drive      258.0   6 3.08  
  7. Hornet Sportabout   360.0   8 3.15  
  8. Valiant             225.0   6 2.76  
  9. Duster 360          360.0   8 3.21  
  10. Merc 240D           146.7   4 3.69  
  11. Merc 230            140.8   4 3.92  
  12. Merc 280            167.6   6 3.92  
  13. Merc 280C           167.6   6 3.92  
  14. Merc 450SE          275.8   8 3.07  
  15. Merc 450SL          275.8   8 3.07  
  16. Merc 450SLC         275.8   8 3.07  
  17. Cadillac Fleetwood  472.0   8 2.93  
  18. Lincoln Continental 460.0   8 3.00  
  19. Chrysler Imperial   440.0   8 3.23  
  20. Fiat 128             78.7   4 4.08  
  21. Honda Civic          75.7   4 4.93  
  22. Toyota Corolla       71.1   4 4.22  
  23. Toyota Corona       120.1   4 3.70  
  24. Dodge Challenger    318.0   8 2.76  
  25. AMC Javelin         304.0   8 3.15  
  26. Camaro Z28          350.0   8 3.73  
  27. Pontiac Firebird    400.0   8 3.08  
  28. Fiat X1-9            79.0   4 4.08  
  29. Porsche 914-2       120.3   4 4.43  
  30. Lotus Europa         95.1   4 3.77  
  31. Ford Pantera L      351.0   8 4.22  
  32. Ferrari Dino        145.0   6 3.62  
  33. Maserati Bora       301.0   8 3.54  
  34. Volvo 142E          121.0   4 4.11  
The above code pulls out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 3, 2 and 5 index positions.
  1. > data[c(4,3,6)]  
  2.                      hp  disp    wt  
  3. Mazda RX4           110 160.0 2.620  
  4. Mazda RX4 Wag       110 160.0 2.875  
  5. Datsun 710           93 108.0 2.320  
  6. Hornet 4 Drive      110 258.0 3.215  
  7. Hornet Sportabout   175 360.0 3.440  
  8. Valiant             105 225.0 3.460  
  9. Duster 360          245 360.0 3.570  
  10. Merc 240D            62 146.7 3.190  
  11. Merc 230             95 140.8 3.150  
  12. Merc 280            123 167.6 3.440  
  13. Merc 280C           123 167.6 3.440  
  14. Merc 450SE          180 275.8 4.070  
  15. Merc 450SL          180 275.8 3.730  
  16. Merc 450SLC         180 275.8 3.780  
  17. Cadillac Fleetwood  205 472.0 5.250  
  18. Lincoln Continental 215 460.0 5.424  
  19. Chrysler Imperial   230 440.0 5.345  
  20. Fiat 128             66  78.7 2.200  
  21. Honda Civic          52  75.7 1.615  
  22. Toyota Corolla       65  71.1 1.835  
  23. Toyota Corona        97 120.1 2.465  
  24. Dodge Challenger    150 318.0 3.520  
  25. AMC Javelin         150 304.0 3.435  
  26. Camaro Z28          245 350.0 3.840  
  27. Pontiac Firebird    175 400.0 3.845  
  28. Fiat X1-9            66  79.0 1.935  
  29. Porsche 914-2        91 120.3 2.140  
  30. Lotus Europa        113  95.1 1.513  
  31. Ford Pantera L      264 351.0 3.170  
  32. Ferrari Dino        175 145.0 2.770  
  33. Maserati Bora       335 301.0 3.570  
  34. Volvo 142E          109 121.0 2.780  
  35. >  
The above code pulls out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 4, 3 and 6 index positions.
 
The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[ creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset.
 
For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
 
The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.
 
Now we will discuss how to use the above-mentioned operators to create the subsets of a specified number of variables of a dataset. We will discuss methods to create blank subsets of datasets containing data with all the variables and observations of the datasets.
 

Creating blank subsets of a dataset

 
The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.
 
Now we will be using a predefined dataset rock of type data frame containing four variables and 48 observations to create blank subsets containing all the variables of a dataset as follows,
 
A blank subset can be created using single square brackets operator preceded by dataset name. A blank subset contains all the variables and observations of a dataset. Using Single Square brackets operator preceded by dataset name we can mention the required number of variables we want to insert in a resultant subset.
 
Now we will be creating blank subsets of several predefined datasets available in R as follows,
  1. > str(rock)  
  2. 'data.frame':   48 obs. of  4 variables:  
  3.  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...  
  4.  $ peri : num  2792 3893 3931 3869 3949 ...  
  5.  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...  
  6.  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...  
  7. >  
The blank subset for the above rock dataset is as follows,
  1. > rock[]  
  2.     area     peri     shape   perm  
  3. 1   4990 2791.900 0.0903296    6.3  
  4. 2   7002 3892.600 0.1486220    6.3  
  5. 3   7558 3930.660 0.1833120    6.3  
  6. 4   7352 3869.320 0.1170630    6.3  
  7. 5   7943 3948.540 0.1224170   17.1  
  8. 6   7979 4010.150 0.1670450   17.1  
  9. 7   9333 4345.750 0.1896510   17.1  
  10. 8   8209 4344.750 0.1641270   17.1  
  11. 9   8393 3682.040 0.2036540  119.0  
  12. 10  6425 3098.650 0.1623940  119.0  
  13. 11  9364 4480.050 0.1509440  119.0  
  14. 12  8624 3986.240 0.1481410  119.0  
  15. 13 10651 4036.540 0.2285950   82.4  
  16. 14  8868 3518.040 0.2316230   82.4  
  17. 15  9417 3999.370 0.1725670   82.4  
  18. 16  8874 3629.070 0.1534810   82.4  
  19. 17 10962 4608.660 0.2043140   58.6  
  20. 18 10743 4787.620 0.2627270   58.6  
  21. 19 11878 4864.220 0.2000710   58.6  
  22. 20  9867 4479.410 0.1448100   58.6  
  23. 21  7838 3428.740 0.1138520  142.0  
  24. 22 11876 4353.140 0.2910290  142.0  
  25. 23 12212 4697.650 0.2400770  142.0  
  26. 24  8233 3518.440 0.1618650  142.0  
  27. 25  6360 1977.390 0.2808870  740.0  
  28. 26  4193 1379.350 0.1794550  740.0  
  29. 27  7416 1916.240 0.1918020  740.0  
  30. 28  5246 1585.420 0.1330830  740.0  
  31. 29  6509 1851.210 0.2252140  890.0  
  32. 30  4895 1239.660 0.3412730  890.0  
  33. 31  6775 1728.140 0.3116460  890.0  
  34. 32  7894 1461.060 0.2760160  890.0  
  35. 33  5980 1426.760 0.1976530  950.0  
  36. 34  5318  990.388 0.3266350  950.0  
  37. 35  7392 1350.760 0.1541920  950.0  
  38. 36  7894 1461.060 0.2760160  950.0  
  39. 37  3469 1376.700 0.1769690  100.0  
  40. 38  1468  476.322 0.4387120  100.0  
  41. 39  3524 1189.460 0.1635860  100.0  
  42. 40  5267 1644.960 0.2538320  100.0  
  43. 41  5048  941.543 0.3286410 1300.0  
  44. 42  1016  308.642 0.2300810 1300.0  
  45. 43  5605 1145.690 0.4641250 1300.0  
  46. 44  8793 2280.490 0.4204770 1300.0  
  47. 45  3475 1174.110 0.2007440  580.0  
  48. 46  1651  597.808 0.2626510  580.0  
  49. 47  5514 1455.880 0.1824530  580.0  
  50. 48  9718 1485.580 0.2004470  580.0  
  51. >  
The structure of mtcars dataset is as follows,
  1. > str(mtcars)  
  2. 'data.frame':   32 obs. of  11 variables:  
  3.  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...  
  4.  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...  
  5.  $ disp: num  160 160 108 258 360 ...  
  6.  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...  
  7.  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...  
  8.  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...  
  9.  $ qsec: num  16.5 17 18.6 19.4 17 ...  
  10.  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...  
  11.  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...  
  12.  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...  
  13.  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...  
  14. >  
The blank subset of mtcars dataset is as follows,
  1. > mtcars[]  
  2.                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb  
  3. Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  
  4. Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  
  5. Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  
  6. Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  
  7. Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  
  8. Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  
  9. Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4  
  10. Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2  
  11. Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2  
  12. Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  
  13. Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4  
  14. Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3  
  15. Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3  
  16. Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3  
  17. Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  
  18. Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4  
  19. Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4  
  20. Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1  
  21. Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2  
  22. Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1  
  23. Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  
  24. Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2  
  25. AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2  
  26. Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4  
  27. Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2  
  28. Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1  
  29. Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2  
  30. Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2  
  31. Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4  
  32. Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  
  33. Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8  
  34. Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2  
  35. >  

Summary

 
In this article, I demonstrated how to create a blank subset of a dataset for analysis of datasets so as to extract relevant data. Different kinds of operators and datasets are used to create blank subsets. Proper coding snippets along with outputs are also provided.