R Tutorial (1)
It’s been a while since we last took a look at the R programming language. While I don’t see R becoming my main programming language (I’ll always be a Pythonista by heart), I decided it would still be nice to have R in my arsenal for statistical computing. Also, it’s always a fun challenge to learn and try to slowly master a new language.This post will serve as my personal source of reference.
It’s also worth mentioning that this document was written in R Markdown, which seems to be a mix of markdown and Jupyter Notebook. It is very similar to Jupyter in that it allows users to interweave text with bits of code—perfect for blogging purposes. I’m still getting used to RStudio and R Markdown, but we will see how it goes. Let’s jump right in!
Setup
There are several basic commands that are useful when setting up and working on a R project. For example, to obtain the location of the current working directory, simply type
getwd()
## [1] "/Users/jaketae/Documents/Jake Tae/R"
We can also set the working directory. I don’t want to change the working directory here, so instead I will execute a dummy command.
# setwd("some_location")
setwd(getwd())
To see the list of variables stored in the environment, use ls()
,
which is just R’s version of the linux command.
ls()
## character(0)
To remove all stored variables,
rm(list=ls())
Basics
This section is titled “Basics”, but we are going to skip over basic arithematic operations, just because they are boring. Here, I document certain perks of the R language that may be useful to know about.
R is slightly different from other programming languages in that slicing works differently, i.e. both the lower and upper bound are inclusive.
x <- c(1:10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
We can identify the type of an object with the class
function; length,
the length
function.
class(x)
## [1] "integer"
length(x)
## [1] 10
If some data is one-hot encoded, and we want R to interpret data as
binary instead of numeric, we can cast it using as.factor
.
a <- c(0, 1, 0, 1, 1)
class(a)
## [1] "numeric"
as.factor(a)
## [1] 0 1 0 1 1
## Levels: 0 1
R is powerful because it supports vectorized operations by default, much like NumPy in Python. For example,
x + 10
## [1] 11 12 13 14 15 16 17 18 19 20
Notice that all elements were modified despite the absence of an
explicit for
loop. By the same token, R supports boolean-based
indexing, which is also related to its vectorized nature.
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
One important point to note about vectors is that they cannot hold objects of different classes. For example, you will see that R casts all objects to become characters when different data types are passed as arguments.
v <- c(T, 1, 2, 3, 'character')
v
## [1] "TRUE" "1" "2" "3" "character"
Data Frames
Let’s look at some sample data. Boston
is a data frame that contains
housing prices in Boston suburbs. For instructive purposes, we’ll be
fiddling with this toy dataset. We will save it in memory to prevent R
from loading it each time.
library(MASS)
table <- Boston
Let’s take a look at the summary of the dataset.
summary(table)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Sometimes, however, the information retrieved by str
may be more
useful.
str(table)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The head
command is a handy little tool that gives us a peek view of
the data.
head(table, 5)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
Equivalently, we could have sliced the table.
table[1:5, ]
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
The dollar sign is a key syntax in R that makes data extraction from tables extremely easy.
head(table$crim)
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985
We can calculate the mean of a specified column as well.
mean(table$crim)
## [1] 3.613524
Plot
The easiest way to create a plot is to use the plot
function. Let’s
begin by considering a plot of the sine function.
x <- seq(-pi, pi, 0.1)
y <- sin(x)
plot(x, y)
Let’s improve this plot with some visual additions.
plot(x, y, main="Sine", xlab = "x", ylab="sin(x)", type="l", col="skyblue")
That looks slightly better.
Plotting can also be performed with data frames. cars
is a built-in
dataset in R that we will use here for demonstrative purposes.
plot(cars)
We can also create a pairplot, which shows the distributional relationship between each columns in the table. Intuitively, I understand it as something like a visual analogue of a symmetric matrix, with each cell showing the distribution according to the row and column variables.
pairs(table)
Note that the plot
function is versatile. We can specify which columns
to plot, as well as set the labels of the plot to be created. For
example,
with(
table,
plot(medv,
crim,
main="Crime Rate versus Median House Value",
xlab="median value of owner-occupied",
ylab="crime rate")
)
Equivalently, we could have used this command:
plot(crim~medv, data=table, main="Crime Rate versus Median House Value", xlab="median value of owner-occupied", ylab="crime rate")
Apply Functions
lapply
Let’s start with what I think is the easiet one: lapply
. In Python
terms, this would be something like np.vectorize
. Here is a very quick
demo with a dummy example.
movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <- unlist(lapply(movies, tolower))
movies_lower
## [1] "spyderman" "batman" "vertigo" "chinatown"
The unlist
function was used to change the list into a vector. The
gist of lapply
is that it receives as input some dataframe, list or
vector, and applies the given function to each element of that iterable.
A similar effect could be achieved with a loop, but the vectorized
nature of lapply
makes it a more attractive option.
sapply
sapply
does the same thing as unlist(lapply(X, FUN))
. In other
words,
movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <- sapply(movies, tolower)
movies_lower
## SPYDERMAN BATMAN VERTIGO CHINATOWN
## "spyderman" "batman" "vertigo" "chinatown"
Note that we can use sapply
to dataframes as well. For instance,
sapply(table, mean)
## crim zn indus chas nox
## 3.61352356 11.36363636 11.13677866 0.06916996 0.55469506
## rm age dis rad tax
## 6.28463439 68.57490119 3.79504269 9.54940711 408.23715415
## ptratio black lstat medv
## 18.45553360 356.67403162 12.65306324 22.53280632
In this case, mean
is applied to each column in table
.
apply
The apply
function is a vectorized way of processing tabular data. If
you are familiar with Pandas, you will quickly notice that Pandas
shamelessly borrowed this function from R. Let’s take a look at what
apply
can do.
apply(X=cars, MARGIN=2, FUN=mean, na.rm=TRUE)
## speed dist
## 15.40 42.98
Notice that apply
basically ran down the data and computed the mean of
each available numerical column. The na.ra=True
is an optional
argument that is passed onto FUN
, which is mean
. Without this
specification, R will complain that there are missing data in the table
given, if any.
Of course, we can try other functions instead of mean
. This time,
let’s try using the quantile
function.
apply(table, MARGIN=2, quantile, probs=c(0.25, 0.5, 0.75), na.rm=TRUE)
## crim zn indus chas nox rm age dis rad tax ptratio
## 25% 0.082045 0.0 5.19 0 0.449 5.8855 45.025 2.100175 4 279 17.40
## 50% 0.256510 0.0 9.69 0 0.538 6.2085 77.500 3.207450 5 330 19.05
## 75% 3.677083 12.5 18.10 0 0.624 6.6235 94.075 5.188425 24 666 20.20
## black lstat medv
## 25% 375.3775 6.950 17.025
## 50% 391.4400 11.360 21.200
## 75% 396.2250 16.955 25.000
And with that, we can receive an instant IQR summary of the data for each numerical column in the data.
If you’re thinking that apply
is similar to sapply
and lapply
we’ve looked so far, you’re not wrong. apply
, at least to me, seems to
be a more complex command capable of both row and column-based
vectorization. It is also different in that it can only be applied to
tabular data, not list or vectors (if that were the case, then the
MARGIN
argument would be unncessary).
tapply
tapply
is slightly tricker than the ones we have seen above, as it is
not just a vectorized operation applied to a single set of data.
Instead, tapply
is capable of splitting data up into categories
according to a second axis. Let’s see what this means with an example:
tapply(iris$Sepal.Width, iris$Species, mean)
## setosa versicolor virginica
## 3.428 2.770 2.974
As you can see, tapply
segments the Sepal.Width
column according to
Species
, then returns the mean for each segmentation. This is going to
be incredibly useful in identifying hidden patterns in data.
Charts
In this section, we will take a look at how to create charts and visualizations, using only the default loaded library in R.
Bar Plot
Bar plots can be created using–yes, you guessed it–the barplot
command. Let’s remind ourselves that a bar plot is a visualization of
the frequencey for each category or a categorical variable.
barplot(table(iris$Species))
One peculiarity that you might have noticed is that we wrapped the
dataset with table
. This is because barplot
receives a frequency
table as input. To get an idea of what this frequencey table looks like,
let’s create a relative frequencey table.
freq <- table(iris$Species) / length(iris$Species)
freq
##
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
Now let’s try prettifying the bar plot with some small customizations.
Note that the las
argument rotates the values on the y-axis.
barplot(freq, main="Percentage of Iris Species", xlab="Species", ylab="%", las=1)
Pie Chart
It’s really easy to move from a bar plot to a pie chart, since they are
just different ways of visualizing the same information. In particular,
we can use the pie
command.
pie(freq, main="Percentage of Iris Species")
box()
Box Plot
A box plot is a way of visualizing the five number summary, which to
recap consists of the minimum, first, quartile, median, third quartile,
and the maximum of a given dataset. Let’s quickly draw a vanilla box
plot using the boxplot
command, with some minimal labeling.
boxplot(cars$speed, main="Box Plot Demo", ylim=c(0, 30), ylab="Speed", las=1)
We can get a bit more sophisticated by segmenting the data by some other
axis, much like we did for tapply
. This can be achieved in R by the
~
operator. Concretely,
boxplot(iris$Sepal.Length~iris$Species, xlab="Species", ylab="Sepal Length", main="Sepal Length by Iris Species", las=1)
Just as a reminder, this is what we get with a tapply
function. Notice
that the results shown by the box plot is more inclusive in that it also
provides information on the IQR aside from just the mean.
tapply(iris$Sepal.Length, iris$Species, mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
Histogram
Creating histograms is not so much different form the other types of
visualizations we have seen so far. To create a histogram, we can use
the hist
command.
hist(table$medv, freq=FALSE, ylim=c(0, 0.07), main='Median Value of Housing Prices', xlab='Median Value', las=1)
The freq
argument clarifies whether we want proportions as fractions
or the raw count.
We can also add a density curve over the histogram to get an approximation of the distribution of the data.
hist(table$medv, freq=FALSE, ylim=c(0, 0.07), main='Median Value of Housing Prices', xlab='Median Value', las=1)
lines(density(table$medv), col=2, lwd=2)
Scatter Plot
Scatter plots can be created in R via the plot
command.
Let’s check if there exists a linear relationship between the variables
of interest in the car
dataframe.
cor(cars$speed, cars$dist)
## [1] 0.8068949
Pearson’s correlation suggests that there does appear to be a linear relationship. Let’s verify that this is indeed the case by creating a scatter plot.
plot(cars$speed, cars$dist, xlab='Speed', ylab='Dist', main='Speed vs Dist', las=1)
lines(smooth.spline(cars$speed, cars$dist))
Note that we have already seen this graph previously, when we were discussing the basics of graphing in an earlier section. Several modifications have been made to that graph, namely specifying the variables that go into the x and y axis, as well as some labeling and titling. We’ve also added a spline, which can be considered a form of regression line that explains the pattern in the data.
Conclusion
This tutorial got very long, but hopefully it gave you a review (or a preview) of what the R programming language is like and what you can do with it. As it is mainly a statistical computing language, it is geared towards many aspects of data science, and it is no coincidence that R is one of the most widely used language in this field, coming second after Python.
In the upcoming R tutorials, we will take a look at some other commands that might be useful for data analysis. Stay tuned for more!