Introduction
There are many predefined functions available in R which can be used for analyzing data through some statistical functions. These functions are available in the R packages. Various statistical functions such as mean, median and mode are available in R for analysis of data. As input, these functions take in vector and return the result. In this article, I will demonstrate how to calculate the mean of variables of a dataset.
Calculating mean
The mean of a particular variable in a dataset is obtained by calculating the sum of all the observations of a particular variable of a dataset and dividing by the total number of the observations of a variable. There is a predefined function available in R called mean() function which can be used to calculate the mean of all the variables in a dataset.
There are different syntaxes available to calculate the mean of a variable in a dataset which are as follows,
- mean(df)
- mean(df, trim = 0.1)
- mean(df,na.rm = TRUE)
Now to calculate mean I will be using predefined datasets available in R package. We will be using mtcars dataset to calculate the mean of different variables available in the dataset mtcars.
- > data = mtcars
- > data
- mpg cyl disp hp drat wt qsec vs am gear carb
- Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
- Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
- Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
- Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
- Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
- Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
- Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
- Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
- Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
- Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
- Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
- Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
- Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
- Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
- Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
- Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
- Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
- Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
- Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
- Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
- Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
- AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
- Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
- Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
- Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
- Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
- Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
- Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
- Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
- Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
- Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Now we will calculate the mean of variables of mtcars dataset.
- df = mtcars
- mean(df$mpg)
- > mean(data$mpg)
- [1] 20.09062
In the above code, the syntax for calculating the mean of mpg variable of mtcars dataset has been defined. The dataset has been assigned to the variable df and then predefined mean function is used, the function has mpg variable as its argument.
- mean(df$cyl)
- > mean(data$cyl)
- [1] 6.1875
In the above code, the syntax for calculating the mean of mpg variable of mtcars dataset has been defined. The dataset has been assigned to the variable df and then predefined mean function is used, the function has cyl variable as its argument.
- mean(df$disp)
- > mean(data$disp)
- [1] 230.7219
In the above code, the syntax for calculating the mean of mpg variable of mtcars dataset has been defined. The dataset has been assigned to the variable df and then predefined mean function is used, the function has disp variable as its argument.
- mean(df$hp)
- > mean(data$hp)
- [1] 146.6875
- >
In the above code, the syntax for calculating the mean of mpg variable of mtcars dataset has been defined. The dataset has been assigned to the variable df and then predefined mean function is used, the function has hp variable as its argument.
We can also calculate the mean of the vectors as follows:
- a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, 15)
-
- vec <- mean(a)
- print(vec)
- It will generate the following output,
- > a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, 15)
- > vec <- mean(a)
- > print(vec)
- [1] 11.54545
- >
Using the above code, we have created a vector named a having 11 values. Then we calculated the mean of the values of the vector. The name of the vector is passed as an argument to the mean function and mean of the vector named a is calculated and assigned to the variable vec.
Trim argument
To remove certain number of observations from the variables and sort them in ascending order, we can include trim argument into the mean() function to calculate the mean of the observations.
Let us implement the mean() function using the trim argument as follows,
- > df1 = data$mpg
- > df1
- [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
- > calc <- mean(df1,trim=0.3)
- > calc
- [1] 19.17857
As we can see after using the trim argument the observations are sorted and mean is calculated after the removal of 3 values from top and bottom of the mpg variable.
The mean obtained without using trim argument is as follows,
We can also calculate the mean of the vectors by including trim argument as follows,
- a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, 15)
- res <- mean(a, trim = 0.2)
- print(res)
It will generate the following output,
- > a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, 15)
- > res <- mean(a, trim = 0.2)
- > res
- [1] 9.285714
We have created a vector named a and calculated the mean of the vector. In the mean function, trim argument is used whose value is set to 0.2 which will remove two values each from left and right of the vector.
Calculating mean by removing missing values
If there are missing values present in the observations of the variable then upon calculating the mean, it will return NA. To create missing values in a variable we can use the below syntax,
- > data[2,4] = NA
- > df2 = data$hp
- > df2
- [1] 110 NA 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
As we can see the dataset named data contains a variable named hp whose second observation is set to a null value. Upon calculating the mean of the hp variable, it will return NA.
Removal of missing values
We can calculate the mean of the variable by removing missing values from the variable by using the na.rm = True parameter inside the mean() function. The value of the parameter na.rm is set to True which indicates that NA values should be removed.
The below code will remove missing values as follows,
- > rs2 = mean(df2,na.rm = TRUE)
- > rs2
- [1] 147.871
- > a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, NA)
-
- mean <- mean(a)
- print(mean)
The above code will return the following output,
- > a <- c(9, 6, 2, 43, 21, 3, 55, -31, 9, -5, NA)
- > mean <- mean(a)
- > print(mean)
- [1] NA
- >
Removing NA values and calculating the mean
- Res1 <- mean(x,na.rm = TRUE)
- print(res1)
The above code will generate the following output,
- > Res1 <- mean(a,na.rm = TRUE)
- > Res1
- [1] 11.2
- >
As we can see a vector named a has been created, which is having NA value as well, upon calculating the mean, it will return mean as NA. Then we have included the parameter na.rm =True to remove NA from vector and then mean is calculated.
Summary
In this article, I demonstrated how to calculate the mean of variables of a dataset. Different ways of calculating a mean is also demonstrated. Proper coding snippets are provided.