Open a web browser and go to http://cran.r-project.org and download and install it
Also helpful to install RStudio (download from http://rstudio.com)
In R, type install.packages("ggplot2")
to install the ggplot2 package.
Download materials from http://tutorials.iq.harvard.edu/R/Rgraphics.zip
Extract the zip file containing the materials to your desktop
Workshop notes are available in .hmtl format. Open a file browser, navigate to your desktop and open Rgraphics.html
Class Structure and Organization:
This is an intermediate R course:
ggplot2
graphics–other packages will not be coveredMy goal: by the end of the workshop you will be able to reproduce this graphic from the Economist:
img
ggplot2
?Advantages of ggplot2
grammar of graphics
(Wilkinson, 2005)That said, there are some things you cannot (or should not) do With ggplot2:
The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
Housing prices
Let’s look at housing prices.
housing <- read.csv("dataSets/landdata-states.csv")
head(housing[1:5])
State region Date Home.Value Structure.Cost
1 AK West 2010.25 224952 160599
2 AK West 2010.50 225511 160252
3 AK West 2009.75 225820 163791
4 AK West 2010.00 224994 161787
5 AK West 2008.00 234590 155400
6 AK West 2008.25 233714 157458
ggplot2
VS Base GraphicsCompared to base graphics, ggplot2
data.frame
)ggplot2
VS Base for simple graphsBase graphics histogram example:
hist(housing$Home.Value)
img
ggplot2
histogram example:
library(ggplot2)
ggplot(housing, aes(x = Home.Value)) +
geom_histogram()
Base wins!
ggplot2
Base graphics VS ggplot
for more complex graphs:Base colored scatter plot example:
plot(Home.Value ~ Date,
data=subset(housing, State == "MA"))
points(Home.Value ~ Date, col="red",
data=subset(housing, State == "TX"))
legend(1975, 400000,
c("MA", "TX"), title="State",
col=c("black", "red"),
pch=c(1, 1))
img
ggplot2
colored scatter plot example:
ggplot(subset(housing, State %in% c("MA", "TX")),
aes(x=Date,
y=Home.Value,
color=State))+
geom_point()
img
ggplot2
wins!
In ggplot land aesthetic means “something you can see”. Examples include:
Each type of geom accepts only a subset of all aesthetics–refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes()
function.
geom
)Geometric objects are the actual marks we put on a plot. Examples include:
geom_point
, for scatter plots, dot plots, etc)geom_line
, for time series, trend lines, etc)geom_boxplot
, for, well, boxplots!)A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the +
operator
You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")
or simply type geom_<tab>
in any good R IDE (such as Rstudio or ESS) to see a list of functions starting with geom_
.
Now that we know about geometric objects and aesthetic mapping, we can make a ggplot. geom_point
requires mappings for x and y, all others are optional.
hp2001Q1 <- subset(housing, Date == 2001.25)
ggplot(hp2001Q1,
aes(y = Structure.Cost, x = Land.Value)) +
geom_point()
img
ggplot(hp2001Q1,
aes(y = Structure.Cost, x = log(Land.Value))) +
geom_point()
img
A plot constructed with ggplot
can have more than one geom. In that case the mappings established in the ggplot()
call are plot defaults that can be added to or overridden. Our plot could use a regression line:
hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))
p1 + geom_point(aes(color = Home.Value)) +
geom_line(aes(y = pred.SC))
img
Not all geometric objects are simple shapes–the smooth geom includes a line and a ribbon.
p1 +
geom_point(aes(color = Home.Value)) +
geom_smooth()
img
Each geom
accepts a particualar set of mappings–for example geom_text()
accepts a labels
mapping.
p1 +
geom_text(aes(label=State), size = 3)
img
## install.packages("ggrepel")
library("ggrepel")
p1 +
geom_point() +
geom_text_repel(aes(label=State), size = 3)
img
Note that variables are mapped to aesthetics with the aes()
function, while fixed aesthetics are set outside the aes()
call. This sometimes leads to confusion, as in this example:
p1 +
geom_point(aes(size = 2),# incorrect! 2 is not a variable
color="red") # this is fine -- all points red
img
Other aesthetics are mapped in the same way as x and y in the previous example.
p1 +
geom_point(aes(color=Home.Value, shape = region))
img
The data for the exercises is available in the dataSets/EconomistData.csv
file. Read it in with
dat <- read.csv("dataSets/EconomistData.csv")
head(dat)
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) + geom_point()
X Country HDI.Rank HDI CPI Region
1 1 Afghanistan 172 0.398 1.5 Asia Pacific
2 2 Albania 70 0.739 3.1 East EU Cemt Asia
3 3 Algeria 96 0.698 2.9 MENA
4 4 Angola 148 0.486 2.0 SSA
5 5 Argentina 45 0.797 3.0 Americas
6 6 Armenia 86 0.716 2.6 East EU Cemt Asia
Original sources for these data are http://www.transparency.org/content/download/64476/1031428 http://hdrstats.undp.org/en/indicators/display_cf_xls_indicator.cfm?indicator_id=103106&lang=en
These data consist of Human Development Index and Corruption Perception Index scores for several countries.
Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:
Each geom
has a default statistic, but these can be changed. For example, the default statistic for geom_bar
is stat_bin
:
args(geom_histogram)
args(stat_bin)
function (mapping = NULL, data = NULL, stat = "bin", position = "stack",
..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE)
NULL
function (mapping = NULL, data = NULL, geom = "bar", position = "stack",
..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL,
breaks = NULL, closed = c("right", "left"), pad = FALSE,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
NULL
Arguments to stat_
functions can be passed through geom_
functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.
For example, here is the default histogram of Home.Value:
p2 <- ggplot(housing, aes(x = Home.Value))
p2 + geom_histogram()
The binwidth looks reasonable by default, but we can change it by passing the
binwidth
argument to the stat_bin
function:
p2 + geom_histogram(stat = "bin", binwidth=4000)
img
Sometimes the default statistical transformation is not what you need. This is often the case with pre-summarized data:
housing.sum <- aggregate(housing["Home.Value"], housing["State"], FUN=mean)
rbind(head(housing.sum), tail(housing.sum))
State Home.Value
1 AK 147385.14
2 AL 92545.22
3 AR 82076.84
4 AZ 140755.59
5 CA 282808.08
6 CO 158175.99
46 VA 155391.44
47 VT 132394.60
48 WA 178522.58
49 WI 108359.45
50 WV 77161.71
51 WY 122897.25
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar()
Error: stat_count() must not be used with a y aesthetic.
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar()
Error: stat_count() must not be used with a y aesthetic.
What is the problem with the previous plot? Basically we take binned and summarized data and ask ggplot to bin and summarize it again (remember, geom_bar
defaults to stat = stat_count
); obviously this will not work. We can fix it by telling geom_bar
to use a different statistical transformation function:
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar(stat="identity")
img
geom_smooth
.geom_smooth
, but use a linear model for the predictions. Hint: see ?stat_smooth
.geom_line
. Hint: change the statistical transformation.?loess
.Aesthetic mapping (i.e., with aes()
) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x)
you don’t say what shapes should be used. Similarly, aes(color = z)
doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2
scales include
Scales are modified with a series of functions using a scale_<aesthetic>_<type>
naming scheme. Try typing scale_<tab>
to see a list of scale modification functions.
The following arguments are common to most scales in ggplot2:
Specific scale functions may have additional arguments; for example, the scale_color_continuous
function has arguments low
and high
for setting the colors at the low and high end of the scale.
Start by constructing a dotplot showing the distribution of home values by Date and State.
p3 <- ggplot(housing,
aes(x = State,
y = Home.Price.Index)) +
theme(legend.position="top",
axis.text=element_text(size = 6))
(p4 <- p3 + geom_point(aes(color = Date),
alpha = 0.5,
size = 1.5,
position = position_jitter(width = 0.25, height = 0)))
img
Now modify the breaks for the x axis and color scales
p4 + scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"))
img
Next change the low and high values to blue and red:
p4 +
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = "blue", high = "red")
img
library(scales)
p4 +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = muted("blue"), high = muted("red"))
img
ggplot2 has a wide variety of color scales; here is an example using scale_color_gradient2
to interpolate between three different colors.
p4 +
scale_color_gradient2(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = muted("blue"),
high = muted("red"),
mid = "gray60",
midpoint = 1994)
img
Scale | Types | Examples |
---|---|---|
scale_color_ |
identity |
scale_fill_continuous |
scale_fill_ |
manual |
scale_color_discrete |
scale_size_ |
continuous |
scale_size_manual |
discrete |
scale_size_discrete |
|
scale_shape_ |
discrete |
scale_shape_discrete |
scale_linetype_ |
identity |
scale_shape_manual |
manual |
scale_linetype_discrete |
|
scale_x_ |
continuous |
scale_x_continuous |
scale_y_ |
discrete |
scale_y_discrete |
reverse |
scale_x_log |
|
log |
scale_y_reverse |
|
date |
scale_x_date |
|
datetime |
scale_y_datetime |
|
Note that in RStudio you can type scale_
followed by TAB to get the whole list of available scales.
?scale_color_manual
.ggplot2
parlance for small multiplesggplot2
offers two functions for creating small multiples:
facet_wrap()
: define subsets as the levels of a single grouping variablefacet_grid()
: define subsets as the crossing of two grouping variablesp5 <- ggplot(housing, aes(x = Date, y = Home.Value))
p5 + geom_line(aes(color = State))
img
There are two problems here–there are too many states to distinguish each one by color, and the lines obscure one another.
We can remedy the deficiencies of the previous plot by faceting by state rather than mapping state to color.
(p5 <- p5 + geom_line() +
facet_wrap(~State, ncol = 10))
img
There is also a facet_grid()
function for faceting in two dimensions.
The ggplot2
theme system handles non-data plot elements such as
Built-in themes include:
theme_gray()
(default)theme_bw()
theme_classc()
p5 + theme_linedraw()
img
p5 + theme_light()
img
Specific theme elements can be overridden using theme()
. For example:
p5 + theme_minimal() +
theme(text = element_text(color = "turquoise"))
img
All theme options are documented in ?theme
.
You can create new themes, as in the following example:
theme_new <- theme_bw() +
theme(plot.background = element_rect(size = 1, color = "blue", fill = "black"),
text=element_text(size = 12, family = "Serif", color = "ivory"),
axis.text.y = element_text(colour = "purple"),
axis.text.x = element_text(colour = "red"),
panel.background = element_rect(fill = "pink"),
strip.background = element_rect(fill = muted("orange")))
p5 + theme_new
img
The most frequently asked question goes something like this: I have two variables in my data.frame, and I’d like to plot them as separate points, with different color depending on which variable it is. How do I do that?
housing.byyear <- aggregate(cbind(Home.Value, Land.Value) ~ Date, data = housing, mean)
ggplot(housing.byyear,
aes(x=Date)) +
geom_line(aes(y=Home.Value), color="red") +
geom_line(aes(y=Land.Value), color="blue")
#
img
library(tidyr)
home.land.byyear <- gather(housing.byyear,
value = "value",
key = "type",
Home.Value, Land.Value)
ggplot(home.land.byyear,
aes(x=Date,
y=value,
color=type)) +
geom_line()
img
Economist
GraphGraph source: http://www.economist.com/node/21541178
Building off of the graphics you created in the previous exercises, put the finishing touches to make it as close as possible to the original economist graph.
img