How To Perform Stratified Sampling On Dataset In R

Introduction

 
In this article, I am going to demonstrate how to create samples that are subsets using stratified sampling method and use strata function in R. Sampling is a process of selecting or extracting a subset from the whole population. It means that from the whole population you are extracting a sample or small subset or small portion of the data which aims to represent the characteristics of the whole population.
 
It is a shortcut method to investigate the whole population. Suppose there is a dataset of 1000 observations. Now you want to take a sample or extract a subset of that 1000 observations. There are different methods to extract a subset from the dataset. It all depends on the data and business requirement to look into which method is suitable to extract samples from a dataset.
 
We generally use sampling in our day to day life. For example if you visit a doctor so he/she will take a small sample of blood for the check-up of your whole body. This is how we use sampling in our day to day to day life.
 

Need for stratified sampling method

 
Suppose you want to perform any survey for a product. So the best idea would be to know from the whole population. But it could be expensive to conduct a survey among the whole population because it may take too much time and lots of resources. So in such cases we go for the sampling techniques because sampling is useful to identify a segment of people who can represent the characteristics of whole population.
 

Stratified Sampling

 
A stratum is a subset of the population that shares at least one common characteristic. Stratified sampling is performed by,
  • Identifying relevant stratums and their actual representation in the population.
  • Random sampling is then used to select a sufficient number of subjects from each stratum.
Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums. Stratified sampling reduces sampling error.
 
Syntax for Stratified sampling with equal/unequal probabilities.
  1. Strata(x, stratanames = NULL, size,  
  2.        method = c("srswor""srswr""poisson""systematic"),  
  3.        pik, description = FALSE)  
Arguments for strata function
 
x
 
A data frame or a matrix; its number of rows is n, the population size.
 
Stratanames
 
Vector of stratification variables.
 
Size
 
Vector of stratum sample sizes (in the order in which the strata are given in the input data set).
 
Method
  • Method to select units implemented are,
  • Simple random sampling without replacement ("srswor")
  • Simple random sampling with replacement ("srswr")
  • Poisson sampling ("poisson")
  • Systematic sampling ("systematic") (default is "srswor")
pik
 
Vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. If the method is "srswr" and the sample size is larger than the population size, this vector is normalized to one.
 
Description
 
A message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE.
 
Value
 
The function produces an object, which contains the following information,
 
Id
 
The identifier of the selected units.
 
Stratum
 
The unit stratum.
 
Prob
 
The final unit inclusion probability.
 
Data frame containing information on strata in the frame. The strata data frame (strata) contains a row per stratum with the following variables,
 
Stratum Identifier of the stratum (numeric)
  • N Number of population units in the stratum (numeric)
  • X1 Value of first auxiliary variable X1 in the stratum (factor)
  • Xi Value of i-th auxiliary variable Xi in the stratum (factor)
  • Xk Value of last auxiliary variable Xk in the stratum (factor)
  • M1 Mean in the stratum of the first variable Y1 (numeric)
  • Mj Mean in the stratum of the j-th variable Yt (numeric)
  • Mn Mean in the stratum of the last variable Y (numeric)
  • S1 Standard deviation in the stratum of the first variable Y (numeric)
  • Sj Standard deviation in the stratum of the j-th variable Yt (numeric)
  • Sn Standard deviation in the stratum of the last variable Y (numeric)
  • Cens Flag (1 indicates a take all straum, 0 a sampling stratum) (numeric) Default = 0
  • Cost Cost per interview in each stratum. Default = 1 (numeric)
  • DOM1 Value of domain to which the stratum belongs (factor or numeric) 
Examples
 
Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). The variable "state" has 2 categories ('nc' and 'sc'). The variable "region" has 3 categories (1, 2 and 3). The sampling frame is stratified by region within state. The income variable is randomly generated. Computes the population stratum sizes. There are 5 cells with non-zero values. One draws 5 samples (1 sample in each stratum.
 
The sample stratum sizes are 10, 5, 10, 4, 6 respectively. The method is 'srswor' (equal probability, without replacement). Extracts the observed data. Seeing result using a contigency table. The method is 'systematic' (unequal probability, without replacement). The selection probabilities are computed using the variable 'income'. Extracts the observed data see the result using a contigency table.
  1. m <- rbind(matrix(rep("nc",165), 1651, byrow=TRUE),  
  2.            matrix(rep("sc"70), 701, byrow=TRUE))  
  3. m <- cbind.data.frame(m, c(rep(1100), rep(2,50), rep(3,15),  
  4.                       rep(130), rep(240)), 1000 * runif(235))  
  5. names(m) <- c("state""region""income")  
  6.   
  7. table(m$region, m$state)  
  8. #     nc  sc  
  9.    1 100  30  
  10.    2  50  40  
  11.    3  15   0  
  12.   
  13. s <- Strata(m, c("region""state"), size=c(1051046), method="srswor")  
  14. data.frame(income=m[s$id, "income"], s)  
  15. table(s$region, s$state)  
  16.   
  17. s <- Strata(m,c("region""state"), size=c(1051046),  
  18.             method="systematic", pik=m$income)  
  19. data.frame(income=m[s$id, "income"], s)  
  20. table(s$region, s$state)  
Now we will be using mtcars dataset to demonstrate stratified sampling.
  1. install.packages("sampling")  
  2. library(sampling)  
  3. data = mtcars  
  4. data  
  5. names(data)  
  6. stratas = strata(data, c("am"),size = c(11,10), method = "srswor")
  7. stratified_data = getdata(data,stratas)  
Below is the code for taking a look at structure of stratified_data variable.
  1. str(stratified_data)  
  2. > str(stratified_data)  
  3. 'data.frame':   21 obs. of  14 variables:  
  4.  $ mpg    : num  21 21 22.8 32.4 30.4 33.9 27.3 30.4 15.8 19.7 ...  
  5.  $ cyl    : num  6 6 4 4 4 4 4 4 8 6 ...  
  6.  $ disp   : num  160 160 108 78.7 75.7 71.1 79 95.1 351 145 ...  
  7.  $ hp     : num  110 110 93 66 52 65 66 113 264 175 ...  
  8.  $ drat   : num  3.9 3.9 3.85 4.08 4.93 4.22 4.08 3.77 4.22 3.62 ...  
  9.  $ wt     : num  2.62 2.88 2.32 2.2 1.61 ...  
  10.  $ qsec   : num  16.5 17 18.6 19.5 18.5 ...  
  11.  $ vs     : num  0 0 1 1 1 1 1 1 0 0 ...  
  12.  $ gear   : num  4 4 4 4 4 4 4 5 5 5 ...  
  13.  $ carb   : num  4 4 1 1 2 1 1 2 4 6 ...  
  14.  $ am     : num  1 1 1 1 1 1 1 1 1 1 ...  
  15.  $ ID_unit: int  1 2 3 18 19 20 26 28 29 30 ...  
  16.  $ Prob   : num  0.846 0.846 0.846 0.846 0.846 ...  
  17.  $ Stratum: int  1 1 1 1 1 1 1 1 1 1 ...  
Below is the code to print stratified_data variable.
  1. print(stratified_data)  
  2.                      mpg cyl  disp  hp drat    wt  qsec vs gear carb am ID_unit      Prob Stratum  
  3. Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0    4    4  1       1 0.8461538       1  
  4. Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0    4    4  1       2 0.8461538       1  
  5. Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1    4    1  1       3 0.8461538       1  
  6. Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1    4    1  1      18 0.8461538       1  
  7. Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1    4    2  1      19 0.8461538       1  
  8. Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1    4    1  1      20 0.8461538       1  
  9. Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1    4    1  1      26 0.8461538       1  
  10. Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1    5    2  1      28 0.8461538       1  
  11. Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0    5    4  1      29 0.8461538       1  
  12. Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0    5    6  1      30 0.8461538       1  
  13. Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0    5    8  1      31 0.8461538       1  
  14. Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1    3    1  0       4 0.5263158       2  
  15. Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0    3    2  0       5 0.5263158       2  
  16. Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0    3    4  0       7 0.5263158       2  
  17. Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1    4    2  0       8 0.5263158       2  
  18. Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1    4    2  0       9 0.5263158       2  
  19. Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1    4    4  0      10 0.5263158       2  
  20. Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0    3    3  0      14 0.5263158       2  
  21. Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0    3    4  0      16 0.5263158       2  
  22. Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0    3    4  0      17 0.5263158       2  
  23. AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0    3    2  0      23 0.5263158       2  
  24. >  
Now we will be using mtcars dataset to demonstrate stratified sampling.
  1. stratas = strata(data, c("gear"),size = c(10,7,3), method = "srswor")  
  2. stratified_data = getdata(data,stratas)  
  3. Below is the code for taking a look at structure of stratified_data variable.  
  4. str(stratified_data)  
  5. > str(stratified_data)  
  6. 'data.frame':   20 obs. of  14 variables:  
  7.  $ mpg    : num  21 22.8 24.4 22.8 19.2 32.4 30.4 33.9 27.3 21.4 ...  
  8.  $ cyl    : num  6 4 4 4 6 4 4 4 4 4 ...  
  9.  $ disp   : num  160 108 147 141 168 ...  
  10.  $ hp     : num  110 93 62 95 123 66 52 65 66 109 ...  
  11.  $ drat   : num  3.9 3.85 3.69 3.92 3.92 4.08 4.93 4.22 4.08 4.11 ...  
  12.  $ wt     : num  2.88 2.32 3.19 3.15 3.44 ...  
  13.  $ qsec   : num  17 18.6 20 22.9 18.3 ...  
  14.  $ vs     : num  0 1 1 1 1 1 1 1 1 1 ...  
  15.  $ am     : num  1 1 0 0 0 1 1 1 1 1 ...  
  16.  $ carb   : num  4 1 2 2 4 1 2 1 1 2 ...  
  17.  $ gear   : num  4 4 4 4 4 4 4 4 4 4 ...  
  18.  $ ID_unit: int  2 3 8 9 10 18 19 20 26 32 ...  
  19.  $ Prob   : num  0.833 0.833 0.833 0.833 0.833 ...  
  20.  $ Stratum: int  1 1 1 1 1 1 1 1 1 1 ...  
Below is the code for taking a look at structure of stratified_data variable.
  1. print(stratified_data)  
  2.                      mpg cyl  disp  hp drat    wt  qsec vs am carb gear ID_unit      Prob Stratum  
  3. Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       2 0.8333333       1  
  4. Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    1    4       3 0.8333333       1  
  5. Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    2    4       8 0.8333333       1  
  6. Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    2    4       9 0.8333333       1  
  7. Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4      10 0.8333333       1  
  8. Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    1    4      18 0.8333333       1  
  9. Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    2    4      19 0.8333333       1  
  10. Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    1    4      20 0.8333333       1  
  11. Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    1    4      26 0.8333333       1  
  12. Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    2    4      32 0.8333333       1  
  13. Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    2    3       5 0.4666667       2  
  14. Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3      12 0.4666667       2  
  15. Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3      14 0.4666667       2  
  16. Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    4    3      15 0.4666667       2  
  17. Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    4    3      16 0.4666667       2  
  18. Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    1    3      21 0.4666667       2  
  19. Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    4    3      24 0.4666667       2  
  20. Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    2    5      27 0.6000000       3  
  21. Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    2    5      28 0.6000000       3  
  22. Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    8    5      31 0.6000000       3  
  23. >  
Above is the code to print stratified_data variable.
 

Conclusion

 
In this article, I demonstrated how to create samples using stratified sampling techniques and use strata function in R. Different arguments of strata function are well explained. Proper coding snippets are also provided.