R Tutorial (3)
A few days ago, I saw a friend who posted an Instagram story looking for partners to study R with. I jumped at the opportunity without hesitation—based on my experience these past six months, I knew all too well that studying alone is a lonely, difficult process. The hardest part about it is keeping oneself accountable and continuing a long streak without losing momentum. So I said that I’d love to join him and his crew.
What we do as a group is nothing grandiose: we simply keep a log of what we’re studying and answer questions that others might have when they come up in our group chat. While the studying is mostly done on one’s own, the fact that we keep a semi-public record of where we are in terms of our study should hopefully motivate all of us to keep making progress until the end. As for me, my goal is to finish the book R for Data Science, which I had meant to read but never went past chapter 1, mostly because I got carried away by other things.
Enough of the prologue, here’s a summary of what I’ve learned so far by from the book.
Introduction
ggplot2
is a powerful visualization package in R, much like matplotlib
in Python. I’m not proficient enough in ggplot2
to make a direct comparison, but I’ve heard very positive things about EDA with R, so I’m excited to learn and have an additional tool under my belt.
Setup
Let’s first load the tidyverse
library to get started.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Dataset
We will be dealing with the mpg
data frame, which is built into ggplot2
. Since we’ve already loaded ggplot2
via tidyverse
, we can take a look at the data frame simply by typing its name.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manua… f 21 29 p comp…
## 3 audi a4 2 2008 4 manua… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manua… f 18 26 p comp…
We will also be using the diamonds
data set, also from ggplot2
. Let’s take a look.
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Basic Syntax
Let’s cut to the chase and take a look at a very simple example of a ggplot
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Things might look a bit confusing at first, but here is a brief rundown of the syntax:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
The obvious part is the declaration of data that we do inside the ggplot
function. Here, we simplify specify what data frame we are going to be using. Then, we add <GEOM_FUNCTION>
s to the canvas. This is somewhat akin to calling ax.plot
and ax.scatter
in Python, where <GEOM_FUNCTION>
is like plot
, scatter
, bar
, or other variations,
The mapping = aes(<MAPPINGS>)
is in a sense a set structure in R. As the name implies, mapping
maps various visualization attributes to the data. These attributes include basic things like x
and y
, as well as other aspects like color
, alpha
, or shape
. For example, we might do something like
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
In this case, the problem is that R only supports 6 different ticks or shapes by default, but we have 7 classes, making it impossible to render every data point. Nonetheless, it demonstrates how we can toggle additional options within mapping
and aes
.
Scoping
We can also make use of scoping to reduce redundanceis. For example, consider the following graph declaration.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This is cool, but notice that we are writing repated code for the mappings. Instead, we can use global scoping under the ggplot
function and streamline the code as follows:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This is no rocket science: all that happened is that we moved the mapping arguments upward to ggplot
, so that we no longer have to specify the mapping for each GEOM_FUNCTION
as we had done previously. This not only helps save time, but is also easier to maintain and read.
Facets
We can also create subplots that separate out each plot for an axis or dimension of data. This can sound a bit abstract at first, and indeed I did have some trouble understanding what faceting meant when I first read the relevant portion, but it’s surprisingly simple. The executive summary is that facets can be considered as a row of plots extracted from a pair plot. Enough talking, let’s take a look.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class)
As you can see, instead of having all data points in one graph, facetting allows us to divide up the data according to some axis, such as ~ class
in this case. This might help us discover hidden trends that are not as obvious if the data were to be viewed in aggregate.
We can also facet according to multiple axes instead of just one. The syntax is not so different from the previous example. The biggest difference is that instead of using facet_wrap
, we use facet_grid
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
Here, we see the distribution of hwy
according to two axes, drv
and cyl
. Intuiting these facet graphs can get a bit more difficult as we start faceting around multiple axes, but simply think of it this way: instead of considering the data as a whole, we segment the data into certain groups according to their respective axeses or categories.
Colors
Fanciness is definitely not what defines a good visualization, but some degree of vibrance certainly helps portray information, if used correctly. Let’s experiment with some colors.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
By specifying fill
, we see that, as expected, the fill of each bar in the bar plot have been painted according to cut
. This is good, but it doesn’t exactly add new information. We can perhaps get a bit more creative and add an additional dimension of information by specifying fill
to be something other than cut
, which is already handled by
x
. For instance,
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
Now things look a bit more interesting. Here, we not only see information on count
, but we also see the composition or distribution of clarity
level for each count
. This has certainly added a layer of information.
Position
We can also specify a positional arguments to modify the looks of the graph a bit further according to our tastes and needs. For example,position = "fill"
makes the graphs such that it will fill the canvas.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
This information is informative in that it tells us that the higher the clarity of a dimaon, the more likely it is to be in a certain grade of cut. Namely, the yellow IF
clarity diamonds seems to belong to ideal
the most.
ggplot(data = diamonds) + geom_bar(
mapping = aes(x = cut, fill = clarity),
position = "dodge"
)
There are other interesting options as well. For example, position = "doge"
places overlapping objects next to each other. Some other interesting options for scatter plots include "jitter"
. For the purposes of this notebook, however, we won’t go over every option there is: it suffices to demonstrate the role and functionality of the position
argument in R.
Coordinates
The default coordinate system for ggplot2
is, as is the case with many other visualization packages, a cartesian coordinate. However, we can often apply transformations to alter the coordinate system. For example, consider the following bar plot:
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = F,
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar
We applied some miscellaneous touches to the configuration of the graph, but the gist of it is what we have already seen: a bar plot with coloring.
How can we make this graph more interesting? One way is to apply various transformations to the coordinates of the graph. For instance, let’s try flipping the axes:
bar + coord_flip()
Here, we used the coord_flip
function to literally flip the coordinates of the graph. This transformation can become particularly useful when the text labels of the data we are dealing with get very long.
We can also transform the bar chart into a pie chart by moving to a polar coordinate from the cartesian.
bar + coord_polar()
I personally find this visualization incredibly appealing. Just a comment in passing.
Visualization Syntax
In this section, we’ve looked at various ways of creating visualizations and graphs. Using this accumulated knowledge, we can now update the basic syntax of ggplot2
we’ve discussed in the previous section. Recall our basic template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
We can now add more bells and whistles to this formula:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
This contains a lot of information that we have dealt with so far, with the exception of the <STAT>
portion, which was dealt in the book but not in this notebook. I decided to leave that portion out because it appears to be a more intricate system that I might be interested as an intermediate user of ggplot2
. As of now, the default statistical transformations configured for each <GEOM_FUNCTION>
should suffice for most use cases.
Conclusion
ggplot2
is a powerful visualization library with many useful functions. Although R’s vanilla plotting functions such as barplot
or hist
, which we explored in this previous
post are useful in their own right, ggplot2
offers more customizability and a wealth of functions that make it much more attractive for production.
I hope to continue this series as I get through R for Data Science with my study buddies. I’ve realized that studying new programming languages, such as C and R, during quarantine period is a good way to stay motivated and productive during what could potentially be dull, grey hours.
See you in the next post!