Post

How To Create Subsets Using Negative Numerical Data In R

Introduction

In this article, I am going to demonstrate how to create subsets of data using negative numerical values for analysis of datasets so as to extract relevant data for creating a machine learning model. Negative numerical values creates subsets of data containing all the observations and excluding those values which are mentioned inside the square brackets along with negative sign.

Extracting data from datasets or creating subset of data is a part of data pre-processing technique used in R to obtain clean and relevant for accurate predictions to be made through a machine learning model.

For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of datasets and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.

Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used which are as follows.

Different types of operators for creating subsets of data

There are three kinds of operators which can be used to create different subsets which are as follows,

Dollar operator

We can create subsets of entire dataset by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of dataset at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.

Now we will discuss with some examples, on how to use dollar operator to create subset of dataset. We will be creating subsets of dataset using negative numerical values. We will be using quakes dataset to use different operators as follows,
1. > data = quakes[-(30:990),]
2. > data
3.         lat   long depth mag stations
4. 1    -20.42 181.62   562 4.8       41
5. 2    -20.62 181.03   650 4.2       15
6. 3    -26.00 184.10    42 5.4       43
7. 4    -17.97 181.66   626 4.1       19
8. 5    -20.42 181.96   649 4.0       11
9. 6    -19.68 184.31   195 4.0       12
10. 7    -11.70 166.10    82 4.8       43
11. 8    -28.11 181.93   194 4.4       15
12. 9    -28.74 181.74   211 4.7       35
13. 10   -17.47 179.59   622 4.3       19
14. 11   -21.44 180.69   583 4.4       13
15. 12   -12.26 167.00   249 4.6       16
16. 13   -18.54 182.11   554 4.4       19
17. 14   -21.00 181.66   600 4.4       10
18. 15   -20.70 169.92   139 6.1       94
19. 16   -15.94 184.95   306 4.3       11
20. 17   -13.64 165.96    50 6.0       83
21. 18   -17.83 181.50   590 4.5       21
22. 19   -23.50 179.78   570 4.4       13
23. 20   -22.63 180.31   598 4.4       18
24. 21   -20.84 181.16   576 4.5       17
25. 22   -10.98 166.32   211 4.2       12
26. 23   -23.30 180.16   512 4.4       18
27. 24   -30.20 182.00   125 4.7       22
28. 25   -19.66 180.28   431 5.4       57
29. 26   -17.94 181.49   537 4.0       15
30. 27   -14.72 167.51   155 4.6       18
31. 28   -16.46 180.79   498 5.2       79
32. 29   -20.97 181.47   582 4.5       25
33. 991  -20.73 181.42   575 4.3       18
34. 992  -15.45 181.42   409 4.3       27
35. 993  -20.05 183.86   243 4.9       65
36. 994  -17.95 181.37   642 4.0       17
37. 995  -17.70 188.10    45 4.2       10
38. 996  -25.93 179.54   470 4.4       22
39. 997  -12.28 167.06   248 4.7       35
40. 998  -20.13 184.20   244 4.5       34
41. 999  -17.40 187.80    40 4.5       14
42. 1000 -21.59 170.56   165 6.0      119
43. >
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and include only those observations which are not mentioned inside parenthesis along with negative sign.
1. > data = quakes[-(40:980),-(2:4)]
2. > data
3.         lat stations
4. 1    -20.42       41
5. 2    -20.62       15
6. 3    -26.00       43
7. 4    -17.97       19
8. 5    -20.42       11
9. 6    -19.68       12
10. 7    -11.70       43
11. 8    -28.11       15
12. 9    -28.74       35
13. 10   -17.47       19
14. 11   -21.44       13
15. 12   -12.26       16
16. 13   -18.54       19
17. 14   -21.00       10
18. 15   -20.70       94
19. 16   -15.94       11
20. 17   -13.64       83
21. 18   -17.83       21
22. 19   -23.50       13
23. 20   -22.63       18
24. 21   -20.84       17
25. 22   -10.98       12
26. 23   -23.30       18
27. 24   -30.20       22
28. 25   -19.66       57
29. 26   -17.94       15
30. 27   -14.72       18
31. 28   -16.46       79
32. 29   -20.97       25
33. 30   -19.84       17
34. 31   -22.58       21
35. 32   -16.32       30
36. 33   -15.55       42
37. 34   -23.55       10
38. 35   -16.30       10
39. 36   -25.82       13
40. 37   -18.73       17
41. 38   -17.64       17
42. 39   -17.66       17
43. 981  -20.82       67
44. 982  -22.95       21
45. 983  -28.22       49
46. 984  -27.99       22
47. 985  -15.54       17
48. 986  -12.37       16
49. 987  -22.33       51
50. 988  -22.70       27
51. 989  -17.86       12
52. 990  -16.00       33
53. 991  -20.73       18
54. 992  -15.45       27
55. 993  -20.05       65
56. 994  -17.95       17
57. 995  -17.70       10
58. 996  -25.93       22
59. 997  -12.28       35
60. 998  -20.13       34
61. 999  -17.40       14
62. 1000 -21.59      119
63. >
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and observations but exclude those variables and observations which are mentioned inside parenthesis along with negative sign.

Now we will use dollar operator with lat variable as follows,
1. > ds = data\$lat[-(10:20)]
2. > ds
3.  [1] -20.42 -20.62 -26.00 -17.97 -20.42 -19.68 -11.70 -28.11 -28.74 -20.84 -10.98 -23.30 -30.20 -19.66 -17.94 -14.72 -16.46 -20.97 -19.84 -22.58 -16.32 -15.55 -23.55
4. [24] -16.30 -25.82 -18.73 -17.64 -17.66 -20.82 -22.95 -28.22 -27.99 -15.54 -12.37 -22.33 -22.70 -17.86 -16.00 -20.73 -15.45 -20.05 -17.95 -17.70 -25.93 -12.28 -20.13
5. [47] -17.40 -21.59
6. >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using negative numerical values. The subset is having lat variable and its observations. The subset is stored in a variable named ds. The subset extracts all the elements but exclude those elements whose index positions are mentioned inside parenthesis along with negative sign.
1. > df = data\$stations[-(11:19)]
2. > df
3.  [1]  41  15  43  19  11  12  43  15  35  19  18  17  12  18  22  57  15  18  79  25  17  21  30  42  10  10  13  17  17  17  67  21  49  22  17  16  51  27  12  33
4. [41]  18  27  65  17  10  22  35  34  14 119
5. >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using negative numerical values. The subset is having stations variable and its observations. The subset is stored in a variable named df. The subset extracts all the elements but exclude those elements whose index positions are mentioned from 11 to 19 inside parenthesis along with negative sign.
1. > dn = data\$dept[-(5:10)]
2. > dn
3.  [1562 650  42 626 583 249 554 600 139 306  50 590 570 598 576 211 512 125 431 537 155 498 582 328 553  50 292 349  48 600 206 574 585 577  42  75  71  60 291 125
4. [41]  69 614 108 575 409 243 642  45 470 248 244  40 165
5. >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using negative numerical values. The subset is having dept variable and its observations. The subset is stored in a variable named dn. The subset extracts all the elements but exclude those elements whose index positions are mentioned from 5 to 10 inside parenthesis along with negative sign.
1. > da = data\$mag[-(5:10)]
2. > da
3.  [14.8 4.2 5.4 4.1 4.4 4.6 4.4 4.4 6.1 4.3 6.0 4.5 4.4 4.4 4.5 4.2 4.4 4.7 5.4 4.0 4.6 5.2 4.5 4.4 4.6 4.7 4.8 4.0 4.5 4.3 4.5 4.6 4.1 5.0 4.7 4.9 4.3 4.5 4.2 5.2
4. [414.8 4.0 4.7 4.3 4.3 4.9 4.0 4.2 4.4 4.7 4.5 4.5 6.0
5. >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using negative numerical values. The subset is having mag variable and its observations. The subset is stored in a variable named da. The subset extracts all the elements but exclude those elements whose index positions are mentioned from 5 to 10 inside parenthesis along with negative sign.

Double square brackets operator

The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
1. > data[['long']]
2.  [1181.62 181.03 184.10 181.66 181.96 184.31 166.10 181.93 181.74 179.59 180.69 167.00 182.11 181.66 169.92 184.95 165.96 181.50 179.78 180.31 181.16 166.32 180.16
3. [24182.00 180.28 181.49 167.51 180.79 181.47 182.37 179.24 166.74 185.05 180.80 186.00 179.33 169.23 181.28 181.40 169.33 176.78 186.10 179.82 186.04 169.41 182.30
4. [47181.70 166.32 180.08 185.25
As we can see above code snippet created a subset containing a single variable long. The argument is a variable name inside double square brackets operator.
1. > data[[3]]
2.  [1562 650  42 626 649 195  82 194 211 622 583 249 554 600 139 306  50 590 570 598 576 211 512 125 431 537 155 498 582 328 553  50 292 349  48 600 206 574 585 230
3. [41263  96 511  94 246  56 329  70 493 129
As we can see above code snippet created a subset containing a single variable dept. The argument is an index position of the variable named dept inside double square brackets operator.
1. > data[[3]][2]
2. [1650
3. >
As we can see above code snippet created a subset containing a single observation of the variable dept. The arguments are an index positions of the rows and columns of that particular observation of the variable dept inside double square brackets operator.

Single square brackets operator

The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator along with negative numerical values and negative sign to create subsets of dataset as follows,
1. > data[-(30:980),]
2.       lat   long depth mag stations
3. 1  -20.42 181.62   562 4.8       41
4. 2  -20.62 181.03   650 4.2       15
5. 3  -26.00 184.10    42 5.4       43
6. 4  -17.97 181.66   626 4.1       19
7. 5  -20.42 181.96   649 4.0       11
8. 6  -19.68 184.31   195 4.0       12
9. 7  -11.70 166.10    82 4.8       43
10. 8  -28.11 181.93   194 4.4       15
11. 9  -28.74 181.74   211 4.7       35
12. 10 -17.47 179.59   622 4.3       19
13. 11 -21.44 180.69   583 4.4       13
14. 12 -12.26 167.00   249 4.6       16
15. 13 -18.54 182.11   554 4.4       19
16. 14 -21.00 181.66   600 4.4       10
17. 15 -20.70 169.92   139 6.1       94
18. 16 -15.94 184.95   306 4.3       11
19. 17 -13.64 165.96    50 6.0       83
20. 18 -17.83 181.50   590 4.5       21
21. 19 -23.50 179.78   570 4.4       13
22. 20 -22.63 180.31   598 4.4       18
23. 21 -20.84 181.16   576 4.5       17
24. 22 -10.98 166.32   211 4.2       12
25. 23 -23.30 180.16   512 4.4       18
26. 24 -30.20 182.00   125 4.7       22
27. 25 -19.66 180.28   431 5.4       57
28. 26 -17.94 181.49   537 4.0       15
29. 27 -14.72 167.51   155 4.6       18
30. 28 -16.46 180.79   498 5.2       79
31. 29 -20.97 181.47   582 4.5       25
32. >
As we can see from the above output single square brackets operator along with negative numerical values and negative sign created a subset of quakes dataset containing all the columns and rows but excluding rows between 30 and 980 index positions.
1. > data[-c(3,1,4)]
2.        long stations
3. 1    181.62       41
4. 2    181.03       15
5. 3    184.10       43
6. 4    181.66       19
7. 5    181.96       11
8. 6    184.31       12
9. 7    166.10       43
10. 8    181.93       15
11. 9    181.74       35
12. 10   179.59       19
13. 11   180.69       13
14. 12   167.00       16
15. 13   182.11       19
16. 14   181.66       10
17. 15   169.92       94
18. 16   184.95       11
19. 17   165.96       83
20. 18   181.50       21
21. 19   179.78       13
22. 20   180.31       18
23. 21   181.16       17
24. 22   166.32       12
25. 23   180.16       18
26. 24   182.00       22
27. 25   180.28       57
28. 26   181.49       15
29. 27   167.51       18
30. 28   180.79       79
31. 29   181.47       25
32. 30   182.37       17
33. 31   179.24       21
34. 32   166.74       30
35. 33   185.05       42
36. 34   180.80       10
37. 35   186.00       10
38. 36   179.33       13
39. 37   169.23       17
40. 38   181.28       17
41. 39   181.40       17
42. 981  181.67       67
43. 982  170.56       21
44. 983  183.60       49
45. 984  183.50       22
46. 985  187.15       17
47. 986  166.93       16
48. 987  171.66       51
49. 988  170.30       27
50. 989  181.30       12
51. 990  184.53       33
52. 991  181.42       18
53. 992  181.42       27
54. 993  183.86       65
55. 994  181.37       17
56. 995  188.10       10
57. 996  179.54       22
58. 997  167.06       35
59. 998  184.20       34
60. 999  187.80       14
61. 1000 170.56      119
62. >
As we can see from the above output single square brackets operator along with negative numerical values and negative sign created a subset of quakes dataset containing all the required rows but excluding columns whose index positions are mentioned as 3,1 and 4.
1. > data[-c(2,4,1)]
2.      depth stations
3. 1      562       41
4. 2      650       15
5. 3       42       43
6. 4      626       19
7. 5      649       11
8. 6      195       12
9. 7       82       43
10. 8      194       15
11. 9      211       35
12. 10     622       19
13. 11     583       13
14. 12     249       16
15. 13     554       19
16. 14     600       10
17. 15     139       94
18. 16     306       11
19. 17      50       83
20. 18     590       21
21. 19     570       13
22. 20     598       18
23. 21     576       17
24. 22     211       12
25. 23     512       18
26. 24     125       22
27. 25     431       57
28. 26     537       15
29. 27     155       18
30. 28     498       79
31. 29     582       25
32. 30     328       17
33. 31     553       21
34. 32      50       30
35. 33     292       42
36. 34     349       10
37. 35      48       10
38. 36     600       13
39. 37     206       17
40. 38     574       17
41. 39     585       17
42. 981    577       67
43. 982     42       21
44. 983     75       49
45. 984     71       22
46. 985     60       17
47. 986    291       16
48. 987    125       51
49. 988     69       27
50. 989    614       12
51. 990    108       33
52. 991    575       18
53. 992    409       27
54. 993    243       65
55. 994    642       17
56. 995     45       10
57. 996    470       22
58. 997    248       35
59. 998    244       34
60. 999     40       14
61. 1000   165      119
62. >
Above code pull out those columns whose index positions are mentioned in the single square brackets operator along with negative sign and creates a subset of variables of excluding columns at index positions 2, 4 and 1.

The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[ creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset. For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.

The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.

Now we will discuss how to use above mentioned operators to create the subsets of specified number of variables of a dataset. We will discuss methods to create subsets using positive numerical values of dataset containing data with all the variables and observations of the datasets.

Creating subsets using negative numerical values

The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.

A subset using negative numerical values can be created using single square brackets operator preceded by dataset name and negative sign inside square brackets. Such subsets contains only those variables and observations of a dataset whose index positions are not mentioned inside square brackets. Using Single Square brackets operator preceded by dataset name and negative sign we can mention the index numbers of required number of columns we want to exclude in a resultant subset.

Now we will be using predefined dataset rock of type data frame containing four variables and 48 observations to create subsets using negative numerical values. We will be creating subsets using negative numerical values of several predefined datasets available in R as follows,
1. > str(rock)
2. 'data.frame':   48 obs. of  4 variables:
3.  \$ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
4.  \$ peri : num  2792 3893 3931 3869 3949 ...
5.  \$ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...
6.  \$ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
7. >
The subsets using negative numerical values for the above rock dataset is as follows,
1. > rock[-c(2,3)]
2.     area   perm
3. 1   4990    6.3
4. 2   7002    6.3
5. 3   7558    6.3
6. 4   7352    6.3
7. 5   7943   17.1
8. 6   7979   17.1
9. 7   9333   17.1
10. 8   8209   17.1
11. 9   8393  119.0
12. 10  6425  119.0
13. 11  9364  119.0
14. 12  8624  119.0
15. 13 10651   82.4
16. 14  8868   82.4
17. 15  9417   82.4
18. 16  8874   82.4
19. 17 10962   58.6
20. 18 10743   58.6
21. 19 11878   58.6
22. 20  9867   58.6
23. 21  7838  142.0
24. 22 11876  142.0
25. 23 12212  142.0
26. 24  8233  142.0
27. 25  6360  740.0
28. 26  4193  740.0
29. 27  7416  740.0
30. 28  5246  740.0
31. 29  6509  890.0
32. 30  4895  890.0
33. 31  6775  890.0
34. 32  7894  890.0
35. 33  5980  950.0
36. 34  5318  950.0
37. 35  7392  950.0
38. 36  7894  950.0
39. 37  3469  100.0
40. 38  1468  100.0
41. 39  3524  100.0
42. 40  5267  100.0
43. 41  5048 1300.0
44. 42  1016 1300.0
45. 43  5605 1300.0
46. 44  8793 1300.0
47. 45  3475  580.0
48. 46  1651  580.0
49. 47  5514  580.0
50. 48  9718  580.0
51. >
The above code pulls out those columns whose index positions are mentioned in the single square brackets operator along with negative sign and creates a subset of all the required rows excluding columns at index positions 2 and 3.

The structure of mtcars dataset is as follows,
1. > str(mtcars)
2. 'data.frame':   32 obs. of  11 variables:
3.  \$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
4.  \$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
5.  \$ disp: num  160 160 108 258 360 ...
6.  \$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
7.  \$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
8.  \$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
9.  \$ qsec: num  16.5 17 18.6 19.4 17 ...
10.  \$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
11.  \$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
12.  \$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
13.  \$ carb: num  4 4 1 1 2 1 4 2 2 4 ...
14. >
The subsets using negative numerical values of mtcars dataset is as follows,
1. > mtcars[-c(6538)]
2.                      mpg cyl  hp  qsec am gear carb
3. Mazda RX4           21.0   6 110 16.46  1    4    4
4. Mazda RX4 Wag       21.0   6 110 17.02  1    4    4
5. Datsun 710          22.8   4  93 18.61  1    4    1
6. Hornet 4 Drive      21.4   6 110 19.44  0    3    1
7. Hornet Sportabout   18.7   8 175 17.02  0    3    2
8. Valiant             18.1   6 105 20.22  0    3    1
9. Duster 360          14.3   8 245 15.84  0    3    4
10. Merc 240D           24.4   4  62 20.00  0    4    2
11. Merc 230            22.8   4  95 22.90  0    4    2
12. Merc 280            19.2   6 123 18.30  0    4    4
13. Merc 280C           17.8   6 123 18.90  0    4    4
14. Merc 450SE          16.4   8 180 17.40  0    3    3
15. Merc 450SL          17.3   8 180 17.60  0    3    3
16. Merc 450SLC         15.2   8 180 18.00  0    3    3
17. Cadillac Fleetwood  10.4   8 205 17.98  0    3    4
18. Lincoln Continental 10.4   8 215 17.82  0    3    4
19. Chrysler Imperial   14.7   8 230 17.42  0    3    4
20. Fiat 128            32.4   4  66 19.47  1    4    1
21. Honda Civic         30.4   4  52 18.52  1    4    2
22. Toyota Corolla      33.9   4  65 19.90  1    4    1
23. Toyota Corona       21.5   4  97 20.01  0    3    1
24. Dodge Challenger    15.5   8 150 16.87  0    3    2
25. AMC Javelin         15.2   8 150 17.30  0    3    2
26. Camaro Z28          13.3   8 245 15.41  0    3    4
27. Pontiac Firebird    19.2   8 175 17.05  0    3    2
28. Fiat X1-9           27.3   4  66 18.90  1    4    1
29. Porsche 914-2       26.0   4  91 16.70  1    5    2
30. Lotus Europa        30.4   4 113 16.90  1    5    2
31. Ford Pantera L      15.8   8 264 14.50  1    5    4
32. Ferrari Dino        19.7   6 175 15.50  1    5    6
33. Maserati Bora       15.0   8 335 14.60  1    5    8
34. Volvo 142E          21.4   4 109 18.60  1    4    2
35. >
The above code pulls out those columns whose index positions are mentioned in the single square brackets operator along with negative sign and creates a subset of all the required rows excluding columns at index positions 6, 5, 3 and 8.

Summary

In this article, I demonstrated how to create subsets of datasets using negative numerical values for analysis of dataset so as to extract relevant data. Different kinds of operators and datasets are used to create subsets of dataset using negative numerical values. Proper coding snippets along with outputs are also provided.

Recommended Free Ebook
Similar Articles