You should use ggplot to make most of your figures, because it:
I use ggplot2 to generate almost all my figures, with the exception of some spatial figures and a few specialized plots. My experience has been that ggplot2 often chokes on larger raster or shape files. But, there are some great resources for visualizing spatial data using ggplot2 (e.g., ggmap and an example from Casey O’Hara).
My awesome cheatsheet which I thought was the greatest until….
The official cheatsheet came out!!! They are both great…you should get both of them.
I highly recommend Winston Chang’s book (which provides a great overview of ggplot2) and his website.
rvisualization.com has some really nice examples!
We will use my cheatsheet as a reference to make a scatterplot. We will use a sample dataset from the package gcookbook
.
# loading packages and data
#install.packages('gcookbook') # install if you don't have the package
library(ggplot2)
library(gcookbook) # source of example data
library(knitr) # functions for knitting Rmd documents to html
library(RColorBrewer)
hw <- heightweight
kable(head(hw))
sex | ageYear | ageMonth | heightIn | weightLb |
---|---|---|---|---|
f | 11.92 | 143 | 56.3 | 85.0 |
f | 12.92 | 155 | 62.3 | 105.0 |
f | 12.75 | 153 | 63.3 | 108.0 |
f | 13.42 | 161 | 59.0 | 92.0 |
f | 15.92 | 191 | 62.5 | 112.5 |
f | 14.25 | 171 | 62.5 | 112.0 |
In this step, you will use the ggplot
and aes
functions to assign variables in your dataset to the x and y axes, and if desired, to other aesthetics such as, size, color, and labels.
This does not do any plotting! It just tells ggplot how to assign the data.
ggplot(hw, aes(x = ageYear, y = weightLb))
This step tells ggplot what type of figure to make. In this case, we will use geom_point which is a scatterplot.
ggplot(hw, aes(x = ageYear, y = weightLb)) +
geom_point()
Tada!! You now have a basic plot, but you will probably want to modify a few things.
This section describes how to make some routine changes, such as:
ggplot(hw, aes(x = ageYear, y = weightLb)) +
geom_point(size = 4, shape = 15, alpha = 0.3) + #changing point size, shape, color
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
stat_smooth(method = lm) + # default is loess spline, can remove 95% confidence interval, se=FALSE
geom_hline(yintercept = 150, color = 'orange', linetype=3, size=1) +
theme_bw()
Changing the color to correspond to a third variable (sex, in this case):
ggplot(hw, aes(x = ageYear, y = weightLb, color=sex)) +
geom_point(size = 4, shape = 15, alpha = 0.3) + #changing point size, shape, color
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
stat_smooth(method = lm, size = 1) + # default is loess spline, can remove 95% confidence interval, se=FALSE
geom_hline(yintercept = 150, color = 'orange', linetype=3, size=1) +
theme_bw()
Creating separate plots for males and females using faceting. The arrangement of the plots can be controlled using row_variable ~ column_variable
(use a period to indicate no variable):
ggplot(hw, aes(x = ageYear, y = weightLb, color=sex)) +
geom_point(size = 4, shape = 15, alpha = 0.3) + #changing point size, shape, color
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
stat_smooth(method = lm, se = FALSE) + # default is loess spline
facet_grid(. ~ sex) +
theme_bw()
Here is a variation on the theme:
ggplot(hw, aes(x = ageYear, y = weightLb, color=sex)) +
geom_point(size = 4, shape = 15, alpha = 0.3) + #changing point size, shape, color
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
stat_smooth(method = lm, se = FALSE) + # default is loess spline
facet_grid(sex ~ ., scales = 'free') +
theme_bw()
ggplot(hw, aes(x = weightLb)) +
geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Ways to modify the figure:
ggplot(hw, aes(x = weightLb)) +
geom_histogram(fill="gray") + ## use 'fill' (color only refers to the outline)
labs(y = "Number of people", x = "weight (lb)") +
theme_bw()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The box portion of this figure represents the range of values where 50% of the data occurs. The midline is the median. The whiskers extend to \(1.5*box\), and outliers beyond the whiskers are points.
In this case, I demonstrate how to assign the plot a name (bp
) so we can add elements more easily.
bp <- ggplot(hw, aes(x = sex, y = weightLb)) +
geom_boxplot(fill="gray") +
labs(y = "Weight (lb)", x = "") +
theme_bw()
bp
I like seeing the data used to create the boxplot, this is easy to add as points:
bp +
geom_point() # overlay true data, jitter to see all points
There are a lot of overlapping points which makes it difficult to discern the true density of points. I will use the geom_jitter
function, which is similar to geom_point
, but randomly jitters the points.
bp +
geom_jitter()
This looks pretty good but is more scattered than I would like. The degree of scatter can be controlled:
bp +
geom_jitter(position = position_jitter(width = .05), alpha = 0.5)
This function can be used to create a variety of styles depending on the arguments that are used.
## create a dataset:
data <- expand.grid(pet = c('dog', 'cat', 'hamster'), gender=c('m', 'f'))
data$size <- c(45, 10, 1, 40, 8, 2)
ggplot(data, aes(x=pet, y=size, fill=gender)) +
geom_bar(stat="identity")
ggplot(data, aes(x=pet, y=size, fill=gender)) +
geom_bar(stat="identity", position="dodge")
ggplot(data, aes(x=pet, y=size, color=gender, group=gender)) +
geom_line() +
geom_point(size=5)
It is possible to make your own theme. This is useful when you want your plots to have a consistent appearance, and you don’t want to repeat a lot of code for each figure.
I created a theme that I often use for scatterplots in publications. One issue with the default ggplot2 figures is that when they are saved the axes labels can appear very small. I increased the size of the labels to make them more readable.
I keep my theme on Github so I can access it from anywhere.
Another theme idea: I like the general appearance of the figures at rvisualization.com. The background is minimalistic which puts the emphasis on the data. A good project would be creating a theme based on their code.
source('https://raw.githubusercontent.com/OHI-Science/ohiprep/master/src/R/scatterTheme.txt')
ggplot(hw, aes(x = ageYear, y = weightLb)) +
geom_point(size = 4, shape = 15, alpha = 0.3) + #changing point size, shape, color
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
stat_smooth(method = lm) +
scatterTheme
## save the figure
ggsave('example.png', width = 6, height = 6)
The ggplot default colors aren’t always the prettiest, and most of the time, you will want to change them.
One thing about ggplot2 that confused me for a while is that both color
and fill
are used to define color. For the most part, ‘color’ is used to color lines and outlines of polygons (e.g., histograms, bar plots, shapes 21-25), and ‘fill’ is used to color the rest of the area. However, point shapes are treated a bit differently and color
is used to color the entire point.
I recommend using established color palettes, such as those from RColorBrewer.
display.brewer.all()
To select a particular ColorBrewer palette:
myCols <- brewer.pal(11, "Spectral") #choose the number of colors you want from the palette
myCols
## [1] "#9E0142" "#D53E4F" "#F46D43" "#FDAE61" "#FEE08B" "#FFFFBF" "#E6F598"
## [8] "#ABDDA4" "#66C2A5" "#3288BD" "#5E4FA2"
This returns a vector of colors in hexadecimal (the color language used by R).
There are many ways to assign colors in ggplot2 (so many that it can be rather confusing). I am only going to describe the methods I have found work best for me.
The first thing to consider is whether the variable you want to represent with color is discrete (e.g., categories, such as gender or eye color) or continuous (e.g., weight or height).
If you have a discrete variable, the best bets are to use scale_color_brewer
or scale_color_manual
(or, alternatively ‘scale_fill_brewer’ or ‘scale_fill_manual’ if you are trying to color the inside of a polygon shape). scale_color_brewer
provides a nice shortcut if you are going to use one of the Color Brewer palettes and do not care how the colors are assigned. scale_color_manual
provides a lot of flexibility for assigning colors to particular categories.
sp <- ggplot(hw, aes(x = ageYear, y = weightLb, color=sex)) +
geom_point(size = 4, shape = 19) +
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
theme_bw()
# using a Colour Brewer palette:
sp +
scale_colour_brewer(palette = "Set1")
In this case, I want females to be purple and males green:
sp +
scale_color_manual(values = c("purple", "darkgreen"),
limits = c("f", "m"),
labels =c("females", "males"))
Sometimes you have a continuous variable you want to display as categories. In the following example, instead of mapping color to sex, we map it to height. But, we want height to be displayed as categories (tall, medium, short).
# figure out quantiles I want to use:
quantile(hw$heightIn)
## 0% 25% 50% 75% 100%
## 50.500 58.725 61.500 64.300 72.000
# use cut function to make the breaks and labels:
sp <- ggplot(hw, aes(x = ageYear, y = weightLb,
color=cut(heightIn,
breaks = c(-Inf, 58.7, 64.3, Inf),
labels = c("small", "medium", "large")))) +
geom_point(size = 4, shape = 19) +
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
theme_bw()
# Default:
sp
# Some changes using scale_color_manual:
# Determine hex codes for ColorBrewer YlOrRd palette
brewer.pal(9, "YlOrRd")
## [1] "#FFFFCC" "#FFEDA0" "#FED976" "#FEB24C" "#FD8D3C" "#FC4E2A" "#E31A1C"
## [8] "#BD0026" "#800026"
sp +
scale_color_manual(values = c('#FED976', '#FD8D3C', '#E31A1C'), # colors
limits =c('small', 'medium', 'large'), # categories that map to colors
name = 'size', # legend title
labels = c('smallish', 'med', 'very large')) # legend category names
There are three general options I tend to use based on whether I want a 2 color gradient palette, 3 color diverging palette, or 4+ color gradient.
Here is the default:
sp <- ggplot(hw, aes(x = ageYear, y = weightLb, color=heightIn)) +
geom_point(size = 4, shape = 19) +
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
theme_bw()
sp
A two color gradient palette:
sp +
scale_color_gradient(low = 'yellow', high = 'red')
A three color diverging palette:
sp +
scale_color_gradient2(low = 'yellow', mid = 'grey', high = 'red', midpoint = 65)
## Warning: Non Lab interpolation is deprecated
Multiple color scale:
sp +
scale_color_gradientn(colours = rev(brewer.pal(11, "Spectral")))
If the text that you want to add corresponds to a variable in your data, you should use geom_text
.
## add some labels to the hw data
hw$name <- c("chad", "lee", "pierce", 'niles')
sp <- ggplot(hw, aes(x = ageYear, y = weightLb, color=heightIn)) +
geom_point(size = 4, shape = 19) +
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
scale_color_gradientn(colours = rev(brewer.pal(11, "Spectral"))) +
theme_bw()
# The general command:
sp +
geom_text(aes(label=name))
This always takes a bit of work to get right:
sp +
geom_text(aes(label = name), color = 'black', size = 3, vjust = 1.5) # vjust adds a value in vertical direction
It is usually better to only display a subset of the points. In some cases, you might be able to simply subset the data:
sp +
geom_text(data=subset(hw, ageYear > 16), aes(x = ageYear, y = weightLb, label = name),
color = 'black', size = 3, vjust = 1.5) # vjust adds a value in
Sometimes it is necessary to make a variable with the names you want displayed:
hw$name2 <- NA
hw$name2[c(2,5,10,15)] <- "niles"
kable(head(hw))
sex | ageYear | ageMonth | heightIn | weightLb | name | name2 |
---|---|---|---|---|---|---|
f | 11.92 | 143 | 56.3 | 85.0 | chad | NA |
f | 12.92 | 155 | 62.3 | 105.0 | lee | niles |
f | 12.75 | 153 | 63.3 | 108.0 | pierce | NA |
f | 13.42 | 161 | 59.0 | 92.0 | niles | NA |
f | 15.92 | 191 | 62.5 | 112.5 | chad | niles |
f | 14.25 | 171 | 62.5 | 112.0 | lee | NA |
sp <- ggplot(hw, aes(x = ageYear, y = weightLb, color=heightIn)) +
geom_point(size = 4, shape = 19) +
labs(x = 'Age (yr)', y = "Weight (lb)", title = "Older people weigh more!") +
scale_color_gradientn(colours = rev(brewer.pal(11, "Spectral"))) +
geom_text(aes(label = name2), color = 'black', size = 5, vjust = 1.5) +
theme_bw()
sp
## Warning: Removed 232 rows containing missing values (geom_text).
It is also possible to add text to a particular location on the plot using expressions. For example, we can add the R2 value to the plot.
# figure out what the R2 value is:
mod <- lm(weightLb ~ ageYear, data=hw)
summary(mod)
##
## Call:
## lm(formula = weightLb ~ ageYear, data = hw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.751 -10.370 -1.970 7.751 49.397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.3356 9.2118 -0.688 0.492
## ageYear 7.8513 0.6699 11.720 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.06 on 234 degrees of freedom
## Multiple R-squared: 0.3699, Adjusted R-squared: 0.3672
## F-statistic: 137.4 on 1 and 234 DF, p-value: < 2.2e-16
sp +
annotate("text", x = 16, y = 60, label = 'R^2==0.37', parse = TRUE, fontface = 'bold')
## Warning: Removed 232 rows containing missing values (geom_text).
You can also use annotate to add things other than “text”, such as line segments, rectangles, arrows, etc.
qplot: Don’t bother!!! qplot is a ggplot shortcut function that basically combines the ggplot and geom functions. Basically, it makes using ggplot2 more confusing because you have to learn two ways of creating a plot.
If the ‘color’ aesthetic doesn’t seem to be working correctly, try ‘fill’.
Do not bother with the official Hadley Wickham book on ggplot2 (I love Hadley Wickham too, but this book is out of date and was never that helpful to begin with).
Use the “group” aesthetic when you want separate lines/analyses for different subsets of data. This is often needed for geom_line plots. Oddly, sometimes this aesthetic is necessary…and other times you can get by without it. (NOTE: the grouping variable must be categorical)
When plotting points, use shape=19 (the default appears jagged in bitmap files!)