How To Create Samples Of Dataset In R

Abhishek Yadav
4y
36.9k
0
0

Article

Introduction

In this article, I am going to demonstrate how to create samples that are subsets using sample function in R. Sampling is a process of selecting or extracting a subset from the whole population. It means that from the whole population you are extracting a sample or small subset or small portion of the data which aims to represent the characteristics of whole population.

It is a shortcut method to investigate the whole population. Suppose there is a dataset of 1000 observations. Now you want to take a sample or extract a subset of that 1000 observations. There are different methods to extract a subset from the dataset. It all depends on the data and business requirement to look into which method is suitable to extract samples from dataset.

We generally use sampling in our day to day life, for example if you visit a doctor so he/she will take a small sample of blood for the check-up of your whole body. This is how we use sampling in our day to day to day life.

Need for sampling technique

Suppose if you want to perform any survey for a product. So the best idea would be to know from the whole population. But it could be expensive to conduct a survey among the whole population because it may take too much time and lots of resources.

So in such cases we go for the sampling techniques because sampling is useful to identify a segment of people who can represent the characteristics of whole population.

Random Samples

Sample takes a sample of the specified size from the elements of x using either with or without replacement. The syntax for creating a sample is as follows,

sample(x, size, replace = FALSE, prob = NULL)
sample.int(n, size = n, replace = FALSE, prob = NULL,
useHash = (!replace && is.null(prob) && size <= 2="" n="" &&=""> 1e7))

Various arguments used inside random function

Either a vector of one or more elements from which to choose, or a positive integer.

A positive number, the number of items to choose from.

size

A non-negative integer giving the number of items to choose.

Replace

Replace argument enables sample function to retrieve a particular value just once from a dataset.

prob

A vector of probability weights for obtaining the elements of the vector being sampled.

UseHash

Logic indicating if the hash-version of the algorithm should be used. Can only be used for replace = FALSE, prob = NULL, and size <= n/2, and really should be used for large n, as useHash=FALSE will use memory proportional to n.

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).

Otherwise x can be any R object for which length and subsetting by integers make sense: S3 or S4 methods for these operations will be dispatched as appropriate. For sample the default for size is the number of items inferred from the first argument, so that sample(x) generates a random permutation of the elements of x (or 1:x).

It is allowed to ask for size = 0 samples with n = 0 or a length-zero x, but otherwise n > 0 or positive length(x) is required. Non-integer positive numerical values of n or x will be truncated to the next smallest integer, which has to be no larger than .Machine$integer.max.

The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true, Walker's alias method (Ripley, 1987) is used when there are more than 200 reasonably probable values: this gives results incompatible with those from R < 2.2.0.

If replace is false, these probabilities are applied sequentially, that is the probability of choosing the next item is proportional to the weights amongst the remaining items. The number of nonzero weights must be at least size in this case. sample.int is a bare interface in which both n and size must be supplied as integers. Argument n can be larger than the largest integer of type integer, up to the largest representable integer in type double. Only uniform sampling is supported. Two random numbers are used to ensure uniform sampling of large integers.

Extracting Data Using Sampling function

print(sample(1:3))
print(sample(1:3, size=3, replace=FALSE)) # same as previous line
print(sample(c(2,5,3), size=4, replace=TRUE)
print(sample(1:2, size=10, prob=c(1,3), replace=TRUE))
[1] 3 1 2
[1] 2 1 3
[1] 2 5 2 2
[1] 2 2 2 1 1 2 2 2 1 2

By default sample() randomly reorders the elements passed as the first argument. This means that the default size is the size of the passed array.replace=TRUE makes sure that no element occurs twice. The last line uses a weighed random distribution instead of a uniform one. One out of four numbers are 1, the out of four are 3.

Arguments of sample function

Size

This is the size of the returned list. If replace is disabled size must be no bigger than the length of the first argument.

Replace

If this is true a sample may contain an element several times while another element might not occur at all.

print(sample(c(2,5,3), size=3, replace=FALSE))
print(sample(c(2,5,3), size=3, replace=TRUE))
[1] 2 3 5
[1] 2 3 3

Allowing some elements to occur more than once lets you get a sample longer than the first argument.

Creating samples of dataset

In order to perform statistical analysis samples of dataset are needed to be created in R. Samples of dataset can be created simply as a subsets of dataset. Samples of dataset can be created using predefined sample() function in R. To create a sample, a dataset object of type vector can be provided as an input to the sample() function in R. A sample() function contains different kinds of arguments which can be used to mention the number of samples we want as a subset from the given dataset.

For example if we want to generate random numbers from 3 to 10 and we want to generate random numbers 8 times that is 8 results, then we can make use of predefined sample() function in R as follows,

> sample(3:10, 8, replace = TRUE)
[1] 3 9 4 5 7 4 6 8

As we can see from the code above, predefined sample function is returning 8 numbers that fall in the range of 3 to 10. Sample function can return a single element several times using argument replace value as true.

Sample() function returns randomly generated numbers, so if same function is executed several times then each and every time it will generate different output. In most of the cases, this is an accurate way to generate samples containing same values. Such output occurs normally, when code of sample function is generated and tested. To generate the same values every time sample function is executed, we can mention seed value as an argument inside seed() function.

The order of random numbers can be restored to a familiar condition using the seed value provide inside seed function in R. R generates pseudo‐random numbers instead of actual random numbers. An algorithm generates certain numbers that looks like random numbers called pseudo‐random sequence. Same pseudo‐random sequence can be generated for a pseudo‐random process if the value of seed is set to 1. R generates the present condition of the random number generator, if the seed function is not used and value of seed is not set to 1.

R generates a random seed to initialize the random number generator at the beginning, upon calling seed function each and every time, R initiates from the next value in the random number generator stream.

To mention starting value of seed, set.seed() function can be used to mention starting value of seed. The set.seed() function contains integer value as an argument as follows,

> set.seed(1)
> sample(1:6, 10, replace = TRUE)
[1] 2 3 4 6 2 6 6 4 4 1

Different values will be generated, when we try to generate different samples by setting the seed value of set.seed function to 0 as follows,

> sample(1:6, 10, replace = TRUE)
[1] 2 2 5 3 5 3 5 6 3 5

To restore a random number generator, we can set seed value of set.seed() function as 1 as follows,

> set.seed(1)
> sample(1:6, 10, replace = TRUE)
[1] 2 3 4 6 2 6 6 4 4 1

As we can see from the above output, if set.seed() function value is set to 1 then results identical to the previous output generated using set.seed(1) function will be generated.

Now we will be using predefined iris datset of R to generate different samples of iris dataset. In this case, we can use argument replace without setting the value as FALSE as false is the predefined value of the argument replace therefore there is no need to mention replace value as false as follows,

> set.seed(123)
> index <- sample(nrow(iris), 5)
> index
[1] 44 118 61 130 138
> iris[index, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
44 5.0 3.5 1.6 0.6 setosa
118 7.7 3.8 6.7 2.2 virginica
61 5.0 2.0 3.5 1.0 versicolor
130 7.2 3.0 5.8 1.6 virginica
138 6.4 3.1 5.5 1.8 virginica

Conclusion

In this article, I demonstrated how to create samples using sample function in R. Different arguments of sample function are well explained. Proper coding snippets are also provided.