R Tutorial (1)

13 minute read

It’s been a while since we last took a look at the R programming language. While I don’t see R becoming my main programming language (I’ll always be a Pythonista by heart), I decided it would still be nice to have R in my arsenal for statistical computing. Also, it’s always a fun challenge to learn and try to slowly master a new language.This post will serve as my personal source of reference.

It’s also worth mentioning that this document was written in R Markdown, which seems to be a mix of markdown and Jupyter Notebook. It is very similar to Jupyter in that it allows users to interweave text with bits of code—perfect for blogging purposes. I’m still getting used to RStudio and R Markdown, but we will see how it goes. Let’s jump right in!

Setup

There are several basic commands that are useful when setting up and working on a R project. For example, to obtain the location of the current working directory, simply type

getwd()

## [1] "/Users/jaketae/Documents/Jake Tae/R"

We can also set the working directory. I don’t want to change the working directory here, so instead I will execute a dummy command.

# setwd("some_location")
setwd(getwd())

To see the list of variables stored in the environment, use ls(), which is just R’s version of the linux command.

ls()

## character(0)

To remove all stored variables,

rm(list=ls())

Basics

This section is titled “Basics”, but we are going to skip over basic arithematic operations, just because they are boring. Here, I document certain perks of the R language that may be useful to know about.

R is slightly different from other programming languages in that slicing works differently, i.e. both the lower and upper bound are inclusive.

x <- c(1:10)
x

##  [1]  1  2  3  4  5  6  7  8  9 10

We can identify the type of an object with the class function; length, the length function.

class(x)

## [1] "integer"

length(x)

## [1] 10

If some data is one-hot encoded, and we want R to interpret data as binary instead of numeric, we can cast it using as.factor.

a <- c(0, 1, 0, 1, 1)
class(a)

## [1] "numeric"

as.factor(a)

## [1] 0 1 0 1 1
## Levels: 0 1

R is powerful because it supports vectorized operations by default, much like NumPy in Python. For example,

x + 10

##  [1] 11 12 13 14 15 16 17 18 19 20

Notice that all elements were modified despite the absence of an explicit for loop. By the same token, R supports boolean-based indexing, which is also related to its vectorized nature.

x > 5

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

One important point to note about vectors is that they cannot hold objects of different classes. For example, you will see that R casts all objects to become characters when different data types are passed as arguments.

v <- c(T, 1, 2, 3, 'character')
v

## [1] "TRUE"      "1"         "2"         "3"         "character"

Data Frames

Let’s look at some sample data. Boston is a data frame that contains housing prices in Boston suburbs. For instructive purposes, we’ll be fiddling with this toy dataset. We will save it in memory to prevent R from loading it each time.

library(MASS)
table <- Boston

Let’s take a look at the summary of the dataset.

summary(table)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Sometimes, however, the information retrieved by str may be more useful.

str(table)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The head command is a handy little tool that gives us a peek view of the data.

head(table, 5)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2

Equivalently, we could have sliced the table.

table[1:5, ]

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2

The dollar sign is a key syntax in R that makes data extraction from tables extremely easy.

head(table$crim)

## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985

We can calculate the mean of a specified column as well.

mean(table$crim)

## [1] 3.613524

Plot

The easiest way to create a plot is to use the plot function. Let’s begin by considering a plot of the sine function.

x <- seq(-pi, pi, 0.1)
y <- sin(x)
plot(x, y)

Let’s improve this plot with some visual additions.

plot(x, y, main="Sine", xlab = "x", ylab="sin(x)", type="l", col="skyblue")

That looks slightly better.

Plotting can also be performed with data frames. cars is a built-in dataset in R that we will use here for demonstrative purposes.

plot(cars)

We can also create a pairplot, which shows the distributional relationship between each columns in the table. Intuitively, I understand it as something like a visual analogue of a symmetric matrix, with each cell showing the distribution according to the row and column variables.

pairs(table)

Note that the plot function is versatile. We can specify which columns to plot, as well as set the labels of the plot to be created. For example,

with(
  table,
  plot(medv, 
       crim, 
       main="Crime Rate versus Median House Value", 
       xlab="median value of owner-occupied", 
       ylab="crime rate")
)

Equivalently, we could have used this command:

plot(crim~medv, data=table, main="Crime Rate versus Median House Value", xlab="median value of owner-occupied", ylab="crime rate")

Apply Functions

lapply

Let’s start with what I think is the easiet one: lapply. In Python terms, this would be something like np.vectorize. Here is a very quick demo with a dummy example.

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <- unlist(lapply(movies, tolower))
movies_lower

## [1] "spyderman" "batman"    "vertigo"   "chinatown"

The unlist function was used to change the list into a vector. The gist of lapply is that it receives as input some dataframe, list or vector, and applies the given function to each element of that iterable. A similar effect could be achieved with a loop, but the vectorized nature of lapply makes it a more attractive option.

sapply

sapply does the same thing as unlist(lapply(X, FUN)). In other words,

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <- sapply(movies, tolower)
movies_lower

##   SPYDERMAN      BATMAN     VERTIGO   CHINATOWN 
## "spyderman"    "batman"   "vertigo" "chinatown"

Note that we can use sapply to dataframes as well. For instance,

sapply(table, mean)

##         crim           zn        indus         chas          nox 
##   3.61352356  11.36363636  11.13677866   0.06916996   0.55469506 
##           rm          age          dis          rad          tax 
##   6.28463439  68.57490119   3.79504269   9.54940711 408.23715415 
##      ptratio        black        lstat         medv 
##  18.45553360 356.67403162  12.65306324  22.53280632

In this case, mean is applied to each column in table.

apply

The apply function is a vectorized way of processing tabular data. If you are familiar with Pandas, you will quickly notice that Pandas shamelessly borrowed this function from R. Let’s take a look at what apply can do.

apply(X=cars, MARGIN=2, FUN=mean, na.rm=TRUE)

## speed  dist 
## 15.40 42.98

Notice that apply basically ran down the data and computed the mean of each available numerical column. The na.ra=True is an optional argument that is passed onto FUN, which is mean. Without this specification, R will complain that there are missing data in the table given, if any.

Of course, we can try other functions instead of mean. This time, let’s try using the quantile function.

apply(table, MARGIN=2, quantile, probs=c(0.25, 0.5, 0.75), na.rm=TRUE)

##         crim   zn indus chas   nox     rm    age      dis rad tax ptratio
## 25% 0.082045  0.0  5.19    0 0.449 5.8855 45.025 2.100175   4 279   17.40
## 50% 0.256510  0.0  9.69    0 0.538 6.2085 77.500 3.207450   5 330   19.05
## 75% 3.677083 12.5 18.10    0 0.624 6.6235 94.075 5.188425  24 666   20.20
##        black  lstat   medv
## 25% 375.3775  6.950 17.025
## 50% 391.4400 11.360 21.200
## 75% 396.2250 16.955 25.000

And with that, we can receive an instant IQR summary of the data for each numerical column in the data.

If you’re thinking that apply is similar to sapply and lapply we’ve looked so far, you’re not wrong. apply, at least to me, seems to be a more complex command capable of both row and column-based vectorization. It is also different in that it can only be applied to tabular data, not list or vectors (if that were the case, then the MARGIN argument would be unncessary).

tapply

tapply is slightly tricker than the ones we have seen above, as it is not just a vectorized operation applied to a single set of data. Instead, tapply is capable of splitting data up into categories according to a second axis. Let’s see what this means with an example:

tapply(iris$Sepal.Width, iris$Species, mean)

##     setosa versicolor  virginica 
##      3.428      2.770      2.974

As you can see, tapply segments the Sepal.Width column according to Species, then returns the mean for each segmentation. This is going to be incredibly useful in identifying hidden patterns in data.

Charts

In this section, we will take a look at how to create charts and visualizations, using only the default loaded library in R.

Bar Plot

Bar plots can be created using–yes, you guessed it–the barplot command. Let’s remind ourselves that a bar plot is a visualization of the frequencey for each category or a categorical variable.

barplot(table(iris$Species))

One peculiarity that you might have noticed is that we wrapped the dataset with table. This is because barplot receives a frequency table as input. To get an idea of what this frequencey table looks like, let’s create a relative frequencey table.

freq <- table(iris$Species) / length(iris$Species)
freq

## 
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333

Now let’s try prettifying the bar plot with some small customizations. Note that the las argument rotates the values on the y-axis.

barplot(freq, main="Percentage of Iris Species", xlab="Species", ylab="%", las=1)

Pie Chart

It’s really easy to move from a bar plot to a pie chart, since they are just different ways of visualizing the same information. In particular, we can use the pie command.

pie(freq, main="Percentage of Iris Species")
box()

Box Plot

A box plot is a way of visualizing the five number summary, which to recap consists of the minimum, first, quartile, median, third quartile, and the maximum of a given dataset. Let’s quickly draw a vanilla box plot using the boxplot command, with some minimal labeling.

boxplot(cars$speed, main="Box Plot Demo", ylim=c(0, 30), ylab="Speed", las=1)

We can get a bit more sophisticated by segmenting the data by some other axis, much like we did for tapply. This can be achieved in R by the ~ operator. Concretely,

boxplot(iris$Sepal.Length~iris$Species, xlab="Species", ylab="Sepal Length", main="Sepal Length by Iris Species", las=1)

Just as a reminder, this is what we get with a tapply function. Notice that the results shown by the box plot is more inclusive in that it also provides information on the IQR aside from just the mean.

tapply(iris$Sepal.Length, iris$Species, mean)

##     setosa versicolor  virginica 
##      5.006      5.936      6.588

Histogram

Creating histograms is not so much different form the other types of visualizations we have seen so far. To create a histogram, we can use the hist command.

hist(table$medv, freq=FALSE, ylim=c(0, 0.07), main='Median Value of Housing Prices', xlab='Median Value', las=1)

The freq argument clarifies whether we want proportions as fractions or the raw count.

We can also add a density curve over the histogram to get an approximation of the distribution of the data.

hist(table$medv, freq=FALSE, ylim=c(0, 0.07), main='Median Value of Housing Prices', xlab='Median Value', las=1)
lines(density(table$medv), col=2, lwd=2)

Scatter Plot

Scatter plots can be created in R via the plot command.

Let’s check if there exists a linear relationship between the variables of interest in the car dataframe.

cor(cars$speed, cars$dist)

## [1] 0.8068949

Pearson’s correlation suggests that there does appear to be a linear relationship. Let’s verify that this is indeed the case by creating a scatter plot.

plot(cars$speed, cars$dist, xlab='Speed', ylab='Dist', main='Speed vs Dist', las=1)
lines(smooth.spline(cars$speed, cars$dist))

Note that we have already seen this graph previously, when we were discussing the basics of graphing in an earlier section. Several modifications have been made to that graph, namely specifying the variables that go into the x and y axis, as well as some labeling and titling. We’ve also added a spline, which can be considered a form of regression line that explains the pattern in the data.

Conclusion

This tutorial got very long, but hopefully it gave you a review (or a preview) of what the R programming language is like and what you can do with it. As it is mainly a statistical computing language, it is geared towards many aspects of data science, and it is no coincidence that R is one of the most widely used language in this field, coming second after Python.

In the upcoming R tutorials, we will take a look at some other commands that might be useful for data analysis. Stay tuned for more!

Tags:

Categories:

Updated: