A Deep Dive Into R Programming

Introduction

 
R has become the leading choice for data science professionals and statisticians. The popularity of R has increased substantially over the years when it comes to data analysis. R is a GNU project, which was initially developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and the source code for R software environment is written primarily in C and Fortran. The founders decided to name the programming language R, based on the first letter of their names. The language is both similar and different in many ways when compared to the language S, developed by Bell Labs. R is considered to be a different implementation of S. Most of the code written for S runs in R as well.
 
 
Some of the features of R are:
  • Just like any other programming language, the programming constructs that makeup R are well defined which includes variables, a condition making statements, loops, functions, data types, and so on.
  • R provides the data structures like vectors, matrices, arrays, and data frames that users can use for performing statistical analysis and creating graphs.
  • R supports object-oriented programming.
  • R has mature, effective data handling and a storage facility. We can import data from CSV, MS Excel, and other data sources, which will be stored and can be used to analyze the data. We do not require an external DB.
  • There are a lot of tools available to perform the data analysis within the R environment.
  • R can be used to generate statistical graphs, which will help in deriving business intelligence. It has advanced graphics and plotting abilities. R Plot is an interface available in R Tools for Visual Studio, which provides an advanced graphic display.
     
     
    Image Source- r-bloggers.com
Defining R and its features might look pretty vague. Let’s start and get our hands dirty.
 
This article is divided into two main sections,
  • Setting up R Environment and R Tools in Visual Studio IDE
  • Understanding the power of R - Analyze and derive the conclusion from data, using R
Once this is completed, you will get an idea of setting up the R environment locally and it will help you get started with R programming. Let’s head to the first section of the article.
 
Setting Up R Environment and R Tools in Visual Studio
 
 
R tools for Visual Studio were released in March as a public preview release. This will help you to work with R programming in VS. However, in order to set up R Tools in Visual Studio, there is a prerequisite – R language engine should be installed in the local machine, or else we will get the error, shown below: 
 
 
In order to set up the environment, first, we will:
  • Install Microsoft R Open. It is an R language Engine and 
  • Install R tools in Visual Studio, which will help us to work with the data, using R programming.
Hence, let’s install Microsoft R Open, which will install the R language engine in the local machine. You can download Microsoft R Open from here.
 
 
Depending on the platform, you can chose to download the appropriate R Open executable. I have downloaded the Windows 7 Platform executable.
 
 
Double click the exe and click next.
 
 
Click Next. This will start installing Microsoft R Open.
 
 
Specify the destination location or leave the default location as it is.
 
 
Select the components required to be installed. Ideally, leave all of them checked.
 
 
Click next.
 
 
This will start the installation. Wait for the installation to complete.
 
 
Click finish. This would complete the R language Engine installation in the local machine. Once completed, it will provide us the R console, where we can implement R Programming.
 
 
Double click the icon and it opens up R console.
 
 
We can test its functionality by using normal arithmetic operations. However, the console is less interactive. Hence, let’s spin up Visual Studio IDE and install R Tools for Visual Studio (RTVS), which is a highly flexible and a mature environment to implement R Programming. You can get the executable from here.
 
 
Click the executable and run it.
 
 
Close any Visual Studio session, that is active for a smooth installation.
 
 
Click install.
 
 
This will start the installation of R Tools in Visual Studio.
 
 
The setup is completed. Let’s head to Visual Studio to check out the new addition.
 
 
In the tab next to the test, we have the new tab for R Tools. Click data science settings so that the session opens in the data scientist profile.
 
 
Click yes. This will reset the Visual Studio layout to the snapshot, shown below:
 
 
We have the R Interactive Window on the left side, where we will be doing the programming part. The variable Explorer on the top right end is where we can analyze the loaded variables and import the data from an external data source. R Plot at the bottom right corner is used to display the graphical representations. We can switch back to the default Visual Studio Layout once, we are done with R Programming.
 
Understand the Power of R - Analyze and derive a conclusion from data using R
 
Let’s use R programming to dig into bulk data and derive the results for our specific queries. Here, I am using dummy student details, which are in CSV format as the input and I will try to derive the answers for the data-related questions. We will be entering the commands in R Interactive Window on the left side of the Window.
 
First, we have to set the working directory, which can be done using setwd method.
 
 
Now, let’s load the data into the R Tools environment in Visual Studio. We can use read CSV command to import the data from CSV. I have placed the R CSV file, which contains the student details, in the working directory. This file looks as shown below:
 
 
Now, let’s load the CSV file into Visual Studio. We can read from other data sources like MS Excel as well.
  1. StudentDetails<- read.csv("R.csv")  
  2. print(StudentDetails)  
Once the command is executed, it will print out the tabular data, as shown below:
 
 
The first row is the sequential serial number. The rest of the columns are loaded as it is from the CSV. We have a global variables window in the right side. Once the load is completed, it will be loaded with the data, which we can browse and explore.
 
 
Just below the Variable Explorer, R Plot is there, which is used for the graphical representation of the charts.
 
We have student details of 100 students. Now, let’s quickly do some analysis of the student details data and derive the answers to the questions, using R Programming.
 
What is the maximum mark in Java?
 
Java is one of the subjects. Let’s try to derive the maximum score among 100 students.
  1. MaxJava<- max(StudentDetails$Java)  
  2. print(paste("The highest score in Java is",MaxJava,sep=":"))  
Max is the method which is used to get the maximum value from the collection.
 
StudentDetails$Java means we are querying the Java Column present within the StudentDetails variable structure. MaxJava is the variable that will hold the value. In order to concatenate two strings, we can use the paste function, which has the syntax as shown below:
 
Paste(“First String”,”Second String”,sep=”JoiningCharater”);
 
 
Count of all those who got the max mark in Java
  1. JavaToppers<- subset(StudentDetails, Java == max(StudentDetails$Java));  
  2. ToppersCount<- nrow(JavaToppers)  
  3. print(paste("Total number of top scorers in Java", ToppersCount, sep = ":"))  
Subset is used to derive a subset of the rows from the main data set, based on a matching condition(Max mark in Java). Subsequently, we have now used it to get the number of rows present in the subset. Toppers Count is the variable, that will hold the final value.
 
 
Details of all those who got the max mark in Java
  1. JavaToppers<- subset(StudentDetails, Java == max(StudentDetails$Java))  
  2. print(JavaToppers)  
Here, we are using a subset function to get the subset of the rows from the main data set that matches the condition and display it as it is. Java Toppers is the variable that will hold the final value.
 
 
Average score of a subject
  1. MeanPython<- mean(StudentDetails$Python)  
  2. print(paste("The Average score in Python is", MeanPython, sep = ":"))  
Here, we have used mean as the method to calculate average of StudentDetails$Python (ie: Python column present within StudentDetails dataset).MeanPython is the variable that will hold the final value.
 
 
Male Female Classification
  1. MaleRows<- subset(StudentDetails, Sex == "Male");  
  2. MaleCount<- nrow(MaleRows)  
  3. print(paste("Number of Male Students", MaleCount, sep = " - "))  
  4. FemaleRows<- subset(StudentDetails, Sex == "Female");  
  5. FemaleCount<- nrow(FemaleRows)  
  6. print(paste("Number of Female Students", FemaleCount, sep = " - "))  
Here, we have used subset function to get the subset of the rows that matches a condition. Afterward, we have used nrow to get the count of the rows within the subset.
 
 
Student from the city of Darlington
  1. DarlingtonStudents<- subset(StudentDetails, City == "Darlington")  
  2. print(DarlingtonStudents)  
Just like the queries shown above, we have used a subset here as well, except that the condition is different.
 
 
Find the sum of subjects and list 3 overall toppers
  1. StudentDetails$Sum<- StudentDetails$Java + StudentDetails$C + StudentDetails$Ruby + StudentDetails$Python  
  2. head(StudentDetails[order(StudentDetails$Sum, decreasing = T),], n = 3)  
Here, we are summing up the scores in Java, C , Ruby, and Python and assigning it to a new column SUM, which is not really present in the import table.
 
StudentDetails$Sum<- Some Value will create a new column in the table and assign the value to the column. Finally, we are ordering the table in the descending order and use the Head method to get the top rows.NowN=3 will fetch only the first three rows.
 
 
Group By Subject Toppers
  1. JavaToppers<- head(StudentDetails[order(StudentDetails$Java, decreasing = T),], n = 4)  
  2. RubyToppers<- head(StudentDetails[order(StudentDetails$Ruby, decreasing = T),], n = 4)  
  3. PythonToppers<- head(StudentDetails[order(StudentDetails$Python, decreasing = T),], n = 4)  
  4. print("The details of Java Toppers:")  
  5. print(JavaToppers);  
  6. print("The details of Ruby Toppers:")  
  7. print(RubyToppers);  
  8. print("The details of Python Toppers:")  
  9.   
  10. print(PythonToppers);   
Here, we are sorting Java Score in the descending order and getthe the first 4 rows, using Head function, and assigning it to JavaToppers variable. Similarly, we are doing it for the other subjects as well.
 
 
Thus, we have seen how we can use R programming for data analysis. This is just a kick start. R is powerful and can work with complex data. I am attaching the CSV I have used for the demo, along with the article. Feel free to get your hands dirty, playing around with it. 


Similar Articles