How To Perform Stratified Sampling On Dataset In R

Abhishek Yadav
4y
34k
0
3

Article

Introduction

In this article, I am going to demonstrate how to create samples that are subsets using stratified sampling method and use strata function in R. Sampling is a process of selecting or extracting a subset from the whole population. It means that from the whole population you are extracting a sample or small subset or small portion of the data which aims to represent the characteristics of the whole population.

It is a shortcut method to investigate the whole population. Suppose there is a dataset of 1000 observations. Now you want to take a sample or extract a subset of that 1000 observations. There are different methods to extract a subset from the dataset. It all depends on the data and business requirement to look into which method is suitable to extract samples from a dataset.

We generally use sampling in our day to day life. For example if you visit a doctor so he/she will take a small sample of blood for the check-up of your whole body. This is how we use sampling in our day to day to day life.

Need for stratified sampling method

Suppose you want to perform any survey for a product. So the best idea would be to know from the whole population. But it could be expensive to conduct a survey among the whole population because it may take too much time and lots of resources. So in such cases we go for the sampling techniques because sampling is useful to identify a segment of people who can represent the characteristics of whole population.

Stratified Sampling

A stratum is a subset of the population that shares at least one common characteristic. Stratified sampling is performed by,

Identifying relevant stratums and their actual representation in the population.
Random sampling is then used to select a sufficient number of subjects from each stratum.

Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums. Stratified sampling reduces sampling error.

Syntax for Stratified sampling with equal/unequal probabilities.

Strata(x, stratanames = NULL, size,
method = c("srswor", "srswr", "poisson", "systematic"),
pik, description = FALSE)

Arguments for strata function

A data frame or a matrix; its number of rows is n, the population size.

Stratanames

Vector of stratification variables.

Size

Vector of stratum sample sizes (in the order in which the strata are given in the input data set).

Method

Method to select units implemented are,
Simple random sampling without replacement ("srswor")
Simple random sampling with replacement ("srswr")
Poisson sampling ("poisson")
Systematic sampling ("systematic") (default is "srswor")

pik

Vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. If the method is "srswr" and the sample size is larger than the population size, this vector is normalized to one.

Description

A message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE.

Value

The function produces an object, which contains the following information,

The identifier of the selected units.

Stratum

The unit stratum.

Prob

The final unit inclusion probability.

Data frame containing information on strata in the frame. The strata data frame (strata) contains a row per stratum with the following variables,

Stratum Identifier of the stratum (numeric)

N Number of population units in the stratum (numeric)
X1 Value of first auxiliary variable X1 in the stratum (factor)
Xi Value of i-th auxiliary variable Xi in the stratum (factor)
Xk Value of last auxiliary variable Xk in the stratum (factor)
M1 Mean in the stratum of the first variable Y1 (numeric)
Mj Mean in the stratum of the j-th variable Yt (numeric)
Mn Mean in the stratum of the last variable Y (numeric)
S1 Standard deviation in the stratum of the first variable Y (numeric)
Sj Standard deviation in the stratum of the j-th variable Yt (numeric)
Sn Standard deviation in the stratum of the last variable Y (numeric)
Cens Flag (1 indicates a take all straum, 0 a sampling stratum) (numeric) Default = 0
Cost Cost per interview in each stratum. Default = 1 (numeric)
DOM1 Value of domain to which the stratum belongs (factor or numeric)

Examples

Generates artificial data (a 235X3 matrix with 3 columns: state, region, income). The variable "state" has 2 categories ('nc' and 'sc'). The variable "region" has 3 categories (1, 2 and 3). The sampling frame is stratified by region within state. The income variable is randomly generated. Computes the population stratum sizes. There are 5 cells with non-zero values. One draws 5 samples (1 sample in each stratum.

The sample stratum sizes are 10, 5, 10, 4, 6 respectively. The method is 'srswor' (equal probability, without replacement). Extracts the observed data. Seeing result using a contigency table. The method is 'systematic' (unequal probability, without replacement). The selection probabilities are computed using the variable 'income'. Extracts the observed data see the result using a contigency table.

m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE),
matrix(rep("sc", 70), 70, 1, byrow=TRUE))
m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15),
rep(1, 30), rep(2, 40)), 1000 * runif(235))
names(m) <- c("state", "region", "income")
table(m$region, m$state)
# nc sc
1 100 30
2 50 40
3 15 0
s <- Strata(m, c("region", "state"), size=c(10, 5, 10, 4, 6), method="srswor")
data.frame(income=m[s$id, "income"], s)
table(s$region, s$state)
s <- Strata(m,c("region", "state"), size=c(10, 5, 10, 4, 6),
method="systematic", pik=m$income)
data.frame(income=m[s$id, "income"], s)
table(s$region, s$state)

Now we will be using mtcars dataset to demonstrate stratified sampling.

install.packages("sampling")
library(sampling)
data = mtcars
data
names(data)
stratas = strata(data, c("am"),size = c(11,10), method = "srswor")
stratified_data = getdata(data,stratas)

Below is the code for taking a look at structure of stratified_data variable.

str(stratified_data)
> str(stratified_data)
'data.frame': 21 obs. of 14 variables:
$ mpg : num 21 21 22.8 32.4 30.4 33.9 27.3 30.4 15.8 19.7 ...
$ cyl : num 6 6 4 4 4 4 4 4 8 6 ...
$ disp : num 160 160 108 78.7 75.7 71.1 79 95.1 351 145 ...
$ hp : num 110 110 93 66 52 65 66 113 264 175 ...
$ drat : num 3.9 3.9 3.85 4.08 4.93 4.22 4.08 3.77 4.22 3.62 ...
$ wt : num 2.62 2.88 2.32 2.2 1.61 ...
$ qsec : num 16.5 17 18.6 19.5 18.5 ...
$ vs : num 0 0 1 1 1 1 1 1 0 0 ...
$ gear : num 4 4 4 4 4 4 4 5 5 5 ...
$ carb : num 4 4 1 1 2 1 1 2 4 6 ...
$ am : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID_unit: int 1 2 3 18 19 20 26 28 29 30 ...
$ Prob : num 0.846 0.846 0.846 0.846 0.846 ...
$ Stratum: int 1 1 1 1 1 1 1 1 1 1 ...

Below is the code to print stratified_data variable.

print(stratified_data)
mpg cyl disp hp drat wt qsec vs gear carb am ID_unit Prob Stratum
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 4 4 1 1 0.8461538 1
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 4 4 1 2 0.8461538 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1 1 3 0.8461538 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1 1 18 0.8461538 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2 1 19 0.8461538 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1 1 20 0.8461538 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1 1 26 0.8461538 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2 1 28 0.8461538 1
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 5 4 1 29 0.8461538 1
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 5 6 1 30 0.8461538 1
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 5 8 1 31 0.8461538 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 3 1 0 4 0.5263158 2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2 0 5 0.5263158 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4 0 7 0.5263158 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 4 2 0 8 0.5263158 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 4 2 0 9 0.5263158 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 4 4 0 10 0.5263158 2
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3 0 14 0.5263158 2
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4 0 16 0.5263158 2
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4 0 17 0.5263158 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2 0 23 0.5263158 2
>

Now we will be using mtcars dataset to demonstrate stratified sampling.

stratas = strata(data, c("gear"),size = c(10,7,3), method = "srswor")
stratified_data = getdata(data,stratas)
Below is the code for taking a look at structure of stratified_data variable.
str(stratified_data)
> str(stratified_data)
'data.frame': 20 obs. of 14 variables:
$ mpg : num 21 22.8 24.4 22.8 19.2 32.4 30.4 33.9 27.3 21.4 ...
$ cyl : num 6 4 4 4 6 4 4 4 4 4 ...
$ disp : num 160 108 147 141 168 ...
$ hp : num 110 93 62 95 123 66 52 65 66 109 ...
$ drat : num 3.9 3.85 3.69 3.92 3.92 4.08 4.93 4.22 4.08 4.11 ...
$ wt : num 2.88 2.32 3.19 3.15 3.44 ...
$ qsec : num 17 18.6 20 22.9 18.3 ...
$ vs : num 0 1 1 1 1 1 1 1 1 1 ...
$ am : num 1 1 0 0 0 1 1 1 1 1 ...
$ carb : num 4 1 2 2 4 1 2 1 1 2 ...
$ gear : num 4 4 4 4 4 4 4 4 4 4 ...
$ ID_unit: int 2 3 8 9 10 18 19 20 26 32 ...
$ Prob : num 0.833 0.833 0.833 0.833 0.833 ...
$ Stratum: int 1 1 1 1 1 1 1 1 1 1 ...

Below is the code for taking a look at structure of stratified_data variable.

> print(stratified_data)
mpg cyl disp hp drat wt qsec vs am carb gear ID_unit Prob Stratum
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2 0.8333333 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 1 4 3 0.8333333 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 2 4 8 0.8333333 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 2 4 9 0.8333333 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10 0.8333333 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 1 4 18 0.8333333 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 2 4 19 0.8333333 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 1 4 20 0.8333333 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 1 4 26 0.8333333 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 2 4 32 0.8333333 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 2 3 5 0.4666667 2
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12 0.4666667 2
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14 0.4666667 2
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 4 3 15 0.4666667 2
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 4 3 16 0.4666667 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 1 3 21 0.4666667 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 4 3 24 0.4666667 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 2 5 27 0.6000000 3
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 2 5 28 0.6000000 3
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 8 5 31 0.6000000 3
>

Above is the code to print stratified_data variable.

Conclusion

In this article, I demonstrated how to create samples using stratified sampling techniques and use strata function in R. Different arguments of strata function are well explained. Proper coding snippets are also provided.