In this session we will use the popular ggplot2 package to visualize data that has already been transformed into a tidy tabular format that is suitable for use in creating plots.
A tidy dataset is a data frame (or table) for which the following are true:
Load the core packages from the tidyverse, including ggplot2.
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
[Instructors note: draw attention to the packages that are loaded, relate these to this and other sections of the course.]
With the many hundreds of R packages available it is inevitable that there will be functions with the same name being used by more than one package.
For example, filter function in dplyr has masked filter function from the stats package that already loaded as part of base R.
?filter
Navigate to each help section and show how the signatures differ and explain how using the one you hadn’t intended will likely cause your code to fail.
Use R studio’s file completion to show that the filter function we get by default.
To use the filter
function from the stats
package we need to include the stats
namespace prefix as follows:
stats::filter(presidents, rep(1, 3))
Read in the patients data frame using read_tsv
from the readr
package.
patients <- read_tsv("patient-data-cleaned.txt")
## Parsed with column specification:
## cols(
## ID = col_character(),
## Name = col_character(),
## Sex = col_character(),
## Smokes = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Birth = col_date(format = ""),
## State = col_character(),
## Grade = col_double(),
## Died = col_logical(),
## Score = col_double(),
## Date.Entered.Study = col_date(format = ""),
## Age = col_double(),
## BMI = col_double(),
## Overweight = col_logical()
## )
read_tsv
imports tab-delimited files (tsv = tab-separated values).
There is an equivalent function in readr
called read_csv
for importing CSV files.
readr
vs base R equivalentsWhy use the readr
functions compared with the more familiar read.table
, read.delim
and read.csv
functions from base R?
Let’s have a look at the patients dataset before trying to visualize it.
View in RStudio by clicking on it in the Environment tab or by typing
View(patients)
It is a special type of data frame called a tibble.
class(patients)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Printing a tibble shows each of the column types.
Print the patients data frame at the console.
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Score <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
This dataset required quite a bit of cleaning to get into the state where we can do some proper exploratory data analysis and visualization.
Later we will look at issues with real-world datasets and the need for cleaning these in the part of the course on tidying data and some of the tools in the tidyverse collection of packges that help in doing so.
Reminder: a “tidy” dataset in one in which
What are the variables in the patients dataset?
What are the observations?
What types of variables do we have?
read_tsv
has read these in as <chr>
, <fct>
, <dbl>
, <date>
and <lgl>
. What do these mean?
class(patients$Sex)
## [1] "character"
class(patients$Height)
## [1] "numeric"
class(patients$Overweight)
## [1] "logical"
What types should each variable be?
Categorical: Sex, Smokes, State
Continuous: Height, Weight, BMI
Do any need to be changed?
What about Grade?
What about Age?
We also have some columns read in as logical and date variables. Logical variables are treated by ggplot2 as categorial.
Most of the time ggplot2 will treat variables in the way you’d hope, e.g. creating a bar plot showing the number of patients in each grade even though this is currently an integer. But sometimes it will cause problems so it’s a good idea to sort this sort of thing out right from the outset.
R uses a special type of variable called a factor to handle categorical values. Factors are variables that have a fixed and known set of possible values – these are referred to as the levels of a factor, e.g. Female and Male for the Sex variable in the patients dataset.
Let’s turn some of our columns that should be categorical variables into factors.
patients$Sex <- factor(patients$Sex)
patients$Sex
## [1] Male Male Male Male Female Female Female Female Male Male
## [11] Female Female Male Female Female Male Male Male Male Male
## [21] Male Female Male Female Female Male Male Male Male Male
## [31] Female Male Male Female Male Female Male Female Male Female
## [41] Male Female Female Female Female Female Female Female Male Female
## [51] Female Female Female Female Male Female Female Female Male Male
## [61] Female Male Female Male Male Female Male Female Male Female
## [71] Female Male Male Female Male Female Male Male Female Female
## [81] Female Female Male Female Female Female Female Male Female Male
## [91] Male Male Female Male Female Female Female Female Female Female
## Levels: Female Male
levels(patients$Sex)
## [1] "Female" "Male"
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <fct> <chr> <dbl> <dbl> <date> <chr> <dbl> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Score <dbl>,
## # Date.Entered.Study <date>, Age <dbl>, BMI <dbl>, Overweight <lgl>
Rather than repeat the above line for each of several columns we can use the mutate_at
function from dplyr
to do this in a single operation. We will cover this later in the course.
patients <- mutate_at(patients, vars(Sex, Smokes, State, Grade, Age), factor)
patients
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <fct> <fct> <dbl> <dbl> <date> <fct> <fct> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Score <dbl>,
## # Date.Entered.Study <date>, Age <fct>, BMI <dbl>, Overweight <lgl>
First we’ll create a scatter plot of height vs weight.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight))
We specified 3 things to create this plot.
In this case the type of plot is a geom_point
, ggplot2’s function for a scatter plot, and the height and weight variables are mapped to the x and y coordinates, i.e. the positions of points on the scatter plot.
Other aesthetics could include the size, shape and colour of those points.
Look up the geom_point
help in RStudio.
?geom_point
In the geom_point
help page scroll down to the aesthetics section to see what aesthetics are actually required (x and y).
What about those other aesthetics?
Colours are usually fun – let’s try to use a colour aesthetic.
We need to set a mapping of a variable to colour just like we did for x and y.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, colour = Sex))
What about colouring based on a continuous variable?
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, colour = BMI))
Let’s try some of the other aesthetics.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, shape = Sex))
Some aesthetics can only be used with categorical variables, e.g. shape.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, shape = BMI))
## Error: A continuous variable can not be mapped to shape
Not all aesthetics are useful, at least not for this plot.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, size = Sex))
## Warning: Using size for a discrete variable is not advised.
The warning hints that we should be using a continuous variable for size instead.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, size = BMI))
The transparency of the points can be set to reflect a variable using the alpha
aesthetic.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight, alpha = BMI))
Colours are more commonly used than shapes, sizes or transparency.
Virtually every aspect of the plots we’ve created can be customized. We will learn how to change the colours used a bit later.
The “gg” in ggplot2 stands for “grammar of graphics”. This grammar is a coherent system for describing and building graphs.
The basic template is:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
This template can be used to create any of the different types of graph possible with ggplot2.
Look up ggplot2 package help in RStudio to see what other geoms are available (find the ggplot2 package in the Packages tab normally found in the bottom right hand corner).
Let’s create a bar chart of patient’s ages.
geom_bar
is the geom we need and it requires a single aesthetic mapping of the categorical variable of interest to x
.
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age))
The dark grey bars are a big ugly - what if we want each bar to be a different colour?
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age, colour = Age))
Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar
to see what other aesthetic we should have used.
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age, fill = Age))
What happens if we fill with something other than age, e.g. sex?
We get a stacked bar plot.
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age, fill = Sex))
Note the similarity in what we did here to what we did with the scatter plot - there is a common grammar.
What if want all the bars to be the same colour but not dark grey, e.g. blue?
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age, fill = "blue"))
That doesn’t look right - why not?
You can set the aesthetics to a fixed value but this needs to be outside the mapping.
ggplot(data = patients) +
geom_bar(mapping = aes(x = Age), fill = "blue")
Setting this inside the aes()
mapping told ggplot2 to map the colour aesthetic to some variable in the data frame that actually is a constant value “blue” for all observations.
Consider again the ggplot2 grammar:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Why do we have the ggplot part and geom part joined by a “+” symbol?
The ‘+’ operator has been overridden by ggplot2 to add layers and other components (scales, themes, etc.) to the plot.
We can have multiple geoms in the same plot. Each is a different layer.
Let’s create a plot with two geoms.
ggplot(data = patients) +
geom_point(mapping = aes(x = Height, y = Weight)) +
geom_smooth(mapping = aes(x = Height, y = Weight))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What is geom_smooth doing? Look up the help page.
Note that the shaded area represents the standard error bounds on the fitted model.
There is some annoying duplication of code here. We can avoid this by putting the mappings in the ggplot
function instead.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Aesthetics set in the ggplot
function are treated as global mappings for the plot and are inherited by all geoms added to the plot.
Let’s say we still want to colour the points by sex.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Well, that wasn’t expected, although it’s pretty neat. We only wanted to apply the colour aesthetic to the points.
We can add to the aesthetics in the geom function to apply them just to that geom.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point(mapping = aes(colour = Sex)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Or if you have got your scatter plot just right, and you’d rather not interfere with it while trying out adding another layer, you can use set the inherit.aes
option to FALSE
.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
geom_smooth(mapping = aes(x = Height, y = Weight), inherit.aes = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Let’s explore some other types of plots.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_line()
Doesn’t make much sense for this dataset but good for time series data.
Use multiple geoms (layering) to have both points and lines.
ggplot(data = patients, mapping = aes(x = Birth, y = BMI)) +
geom_point() +
geom_line()
ggplot2 seems to be doing something clever with dates here.
ggplot(data = patients, mapping = aes(x = Birth, y = BMI, colour = Sex)) +
geom_point() +
geom_line()
ggplot(data = patients) +
geom_histogram(mapping = aes(x = Height))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The warning message hints at picking a more optimal number of bins.
ggplot(data = patients) +
geom_histogram(mapping = aes(x = Height), binwidth = 2.5)
Or can set the number of bins.
ggplot(data = patients) +
geom_histogram(mapping = aes(x = Height), bins = 15)
These histograms are not aesthetically very pleasing - how about some better aesthetics?
ggplot(data = patients) +
geom_histogram(mapping = aes(x = Height), bins = 15, colour = "firebrick", fill = "grey70")
The help page for geom_histogram talks about an alternative called geom_freqpoly.
When might this be preferred?
ggplot(data = patients) +
geom_freqpoly(mapping = aes(x = BMI, colour = Smokes), bins = 10)
ggplot(data = patients) +
geom_boxplot(mapping = aes(x = Smokes, y = Weight))
See geom_boxplot
help to explain how the box and whiskers are constructed.
BMI is perhaps a better thing to compare because the Weight is correlated with Height and Sex.
ggplot(data = patients) +
geom_boxplot(mapping = aes(x = Smokes, y = BMI))
Compare males and females separately.
ggplot(data = patients) +
geom_boxplot(mapping = aes(x = Sex, y = BMI, colour = Smokes))
Can we see the points at the same time?
ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
geom_boxplot() +
geom_point()
The geom_point
help points to geom_jitter
as more suitable when one of the variables is categorical.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
geom_boxplot() +
geom_jitter()
Reduce the spread/jitter.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
geom_boxplot() +
geom_jitter(width = 0.25)
Variation on box plot in which density is shown along the sides of the “violin”.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI)) +
geom_violin()
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, fill = Smokes)) +
geom_violin(trim = FALSE)
Deliberately not covering this here as it is required for a couple of the exercises and we want you to work it out for yourself using the help pages and what you’ve learned so far about the common grammar for all ggplots.
See separate R markdown document.
Let’s the build the plot up in stages.
plot <- ggplot(data = patients)
What is the plot object we’ve just created?
Look at the environment pane in RStudio (usually in top right corner). The plot object is a list of 9.
is.list(plot)
## [1] TRUE
It’s actually a special type of list.
class(plot)
## [1] "gg" "ggplot"
What are the 9 things in the list?
names(plot)
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
So far we’ve constructed ggplots from data, layers (geoms) and aesthetic mappings. We’ll be looking at how to customize plots later, including changing the labels, scales and theme.
plot$data
## # A tibble: 100 x 15
## ID Name Sex Smokes Height Weight Birth State Grade Died
## <chr> <chr> <fct> <fct> <dbl> <dbl> <date> <fct> <fct> <lgl>
## 1 AC/A… Mich… Male Non-S… 183. 76.6 1972-02-06 Geor… 2 FALSE
## 2 AC/A… Derek Male Non-S… 179. 80.4 1972-06-15 Colo… 2 FALSE
## 3 AC/A… Todd Male Non-S… 169. 75.5 1972-07-09 New … 2 FALSE
## 4 AC/A… Rona… Male Non-S… 176. 94.5 1972-08-17 Colo… 1 FALSE
## 5 AC/A… Chri… Fema… Non-S… 164. 71.8 1973-06-12 Geor… 2 TRUE
## 6 AC/A… Dana Fema… Smoker 158. 69.9 1973-07-01 Indi… 2 FALSE
## 7 AC/A… Erin Fema… Non-S… 162. 68.8 1972-03-26 New … 1 FALSE
## 8 AC/A… Rach… Fema… Non-S… 166. 70.4 1973-05-11 Colo… 1 FALSE
## 9 AC/A… Rona… Male Non-S… 181. 76.9 1971-12-31 Geor… 1 FALSE
## 10 AC/A… Bryan Male Non-S… 167. 79.1 1973-07-19 New … 2 FALSE
## # … with 90 more rows, and 5 more variables: Score <dbl>,
## # Date.Entered.Study <date>, Age <fct>, BMI <dbl>, Overweight <lgl>
What does our plot look like at this point?
plot
No layers or mapping yet.
plot$mapping
## Aesthetic mapping:
## <empty>
plot$layers
## list()
Lets add an aesthetic mapping.
plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight))
plot
ggplot2 has automatically added scales for x and y based on the range of height and weight values. Still nothing plotted as we haven’t yet specified the type of plot (geom) to add as a layer.
plot$mapping
## Aesthetic mapping:
## * `x` -> `Height`
## * `y` -> `Weight`
plot$layers
## list()
plot <- plot + geom_point()
plot
plot$layers
## [[1]]
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Each geom is associated with a statistical transformation, called a stat in ggplot2. This is an algorithm used to calculate new values for the graph. In this case, we are creating a scatter plot and the x and y values don’t need to be transformed, just plotted on the x and y axes, hence stat_identity
which leaves the data unchanged.
But recall the bar charts we created earlier. How did ggplot2 know what height to make each bar? It counted the number of observations for each categorical value (level) using stat_count
.
We’ll touch on statistical transformations again a bit later.
A useful feature of ggplot2 is faceting. This allows you to split your plot into subplots, or facets, based on one or more categorical variables. Each subplot displays a subset of the data.
There are two faceting functions, facet_grid
and facet_wrap
.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Smokes)) +
geom_point()
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_wrap(~ Smokes)
The ~ Smokes
argument provided to facet_wrap
is known as a formula in R. Formulae are used to build models in R.
Faceting is perhaps more useful when there are several categories helping to distinguish between them by seeing the data in separate subplots.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
geom_point()
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
geom_point() +
facet_wrap(~ Grade)
We can tell ggplot2 to fit the subplots onto a single row.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
geom_point() +
facet_wrap(~ Grade, nrow = 1)
Or we can force it to use a specified number of columns.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Grade)) +
geom_point() +
facet_wrap(~ Grade, ncol = 3)
Here you can see that it ‘wraps’ plots across more than one row if necessary.
We can facet on more than one variable.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_wrap(~ Sex + Smokes)
Faceting on two variables can also be done as a grid using facet_grid
and arguably produces clearer labelling of the subplots.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_grid(Sex ~ Smokes)
Compare this with using a single plot and aesthetics.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Smokes, shape = Sex)) +
geom_point()
It is possible to only facet on one variable in either the rows or columns dimension using facet_grid
.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_grid(. ~ Sex)
This is the same as using facet_wrap
with nrow = 1
so not terribly useful but if you want the plots to be stacked in a single column with labels down the sides then you can do the following.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_grid(Sex ~ .)
It is also possible to facet on more more than two variables with facet_wrap although usually this leads to very cluttered plots; it is probably better to stick to using one or two variables for faceting.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point() +
facet_wrap(~ Sex + Smokes + Age)
We can use other aesthetic mappings for grouping data alongside faceting.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
facet_wrap(~ Smokes)
See separate R markdown document.
ggplot(data = patients, mapping = aes(x = Height, y = Weight)) +
geom_point(mapping = aes(colour = Sex)) +
geom_smooth(method = "lm") +
labs(
title = "Patients from ECSA12 study",
subtitle = "Linear relationship between height and weight",
x = "Height (cm)",
y = "Weight (kg)"
)
Take a look at the x and y scales in the above plot. ggplot2 has chosen the x and y scales and where to put breaks and ticks.
plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point(mapping = aes(colour = Sex))
names(plot)
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
One of the components of the plot is called scales
.
ggplot2 automatically adds default scales behind the scenes equivalent to the following:
plot <- ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
Note that we have three aesthetics and ggplot2 adds a scale for each.
plot$mapping
## Aesthetic mapping:
## * `x` -> `Height`
## * `y` -> `Weight`
## * `colour` -> `Sex`
The x and y variables are continuous hence ggplot2 adds a continuous scale for each. Colour is a discrete variable in this case so ggplot2 adds a discrete scale for colour.
Generalizing, the scales that are required follow the naming scheme:
scale_<NAME_OF_AESTHETIC>_<NAME_OF_SCALE>
Look at the help pages to see what we can change about the y axis.
?scale_y_continuous
We’re going to modify the scales for the following plot.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous()
First, we’ll change the breaks, i.e. where ggplot2 puts ticks and numeric labels, on the y axis.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous(breaks = seq(60, 100, by = 5))
We can also adjust the extents of the y axis.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous(limits = c(60, 100))
We can change the minor breaks, e.g. to add more lines that act as guides. These are shown as thin white lines when using the default theme (we’ll take a look at alternative themes a bit later).
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous(limits = c(60, 100), minor_breaks = seq(60, 100, by = 1))
Or we can remove the minor breaks entirely.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous(limits = c(60, 100), minor_breaks = NULL)
Maybe we don’t want any breaks at all.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_y_continuous(breaks = NULL)
By default the scales are expanded by 5% of the range on either side. We can add more space round the edges.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_x_continuous(expand = expand_scale(mult = 0.1)) +
scale_y_continuous(expand = expand_scale(mult = 0.1))
We can move draw the axis on the other side of the plot – not sure why you’d want to do this but with ggplot2 just about anything is possible.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_x_continuous(position = "top")
Use scale_colour_manual
to manually set the colours for a categorical variable.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_colour_manual(values = c("green", "purple"))
ggplot2 provides a set of colour palettes from ColorBrewer.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = State)) +
geom_point() +
scale_colour_brewer(palette = "Set1")
Look at the help page for scale_colour_brewer
to see what other colour palettes are available, then try some.
?scale_colour_brewer
Interestingly you can set other attributes other than just the colours at the same time.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_colour_brewer(palette = "Set1", name = NULL, labels = c("women", "men"))
What you are doing here is setting your own set of mappings from levels in the data to aesthetic values.
You can select the colours used in a gradient for a continuous variable.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
geom_point() +
scale_colour_gradient(low = "black", high = "blue")
In some cases it might make sense to specify two colour gradients either side of a mid-point.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
geom_point() +
scale_colour_gradient2(low = "red", mid = "grey60", high = "blue", midpoint = median(patients$BMI))
As before we can override the default labels and other aspects of the BMI scale within the scale function.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = BMI)) +
geom_point() +
scale_colour_gradient(
low = "lightblue", high = "darkblue",
name = "Body mass index",
breaks = seq(20, 30, by = 2.5)
)
Quite often we want to reorder the categories. For example, we might want males to be displayed on the left hand side in the following plot.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1")
Categorical variables are treated as special types of variables in R known as factors. A factor has levels, e.g. Female and Male for the Sex variable in the patients dataset.
ggplot2 displays categories in the order given by the levels of the factor. If a character variable is presented to ggplot2 as if it were a factor, ggplot2 converts it into a factor temporarily in order to create the plot, and the categories will be in alphabetical order.
We can reorder the categories by explicitly creating a factor with the levels set in our preferred order.
patients$Sex
## [1] Male Male Male Male Female Female Female Female Male Male
## [11] Female Female Male Female Female Male Male Male Male Male
## [21] Male Female Male Female Female Male Male Male Male Male
## [31] Female Male Male Female Male Female Male Female Male Female
## [41] Male Female Female Female Female Female Female Female Male Female
## [51] Female Female Female Female Male Female Female Female Male Male
## [61] Female Male Female Male Male Female Male Female Male Female
## [71] Female Male Male Female Male Female Male Male Female Female
## [81] Female Female Male Female Female Female Female Male Female Male
## [91] Male Male Female Male Female Female Female Female Female Female
## Levels: Female Male
patients$Sex <- factor(patients$Sex, levels = c("Male", "Female"))
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1")
The legend in this plot is not really necessary. We’ll see how to remove this in the next section on themes.
For plots where a legend is required, changing the order of levels in a factor will also change the ordering in the legend.
ggplot(data = patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_colour_brewer(palette = "Set1")
Themes can be used to customize non-data elements of your plot.
The black and white theme produces plots that are more suitable for articles that will be printed.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1") +
theme_bw()
What other themes are there? Use RStudio auto-complete to see, try some of these and see how they look. The library ggthemes
provides a few more.
library(ggthemes)
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1") +
theme_gdocs()
The theme function can be used to modify individual components of a theme.
For example, the following will turn off the legend.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "none")
Or we could place the legend somewhere else.
ggplot(data = patients, mapping = aes(x = Sex, y = BMI, colour = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.4) +
scale_colour_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "bottom")
Use the help page for the theme
function to see what else can be changed (quite a lot!).
?theme
ggplot(patients, mapping = aes(x = Height, y = Weight, colour = Sex)) +
geom_point() +
scale_colour_brewer(palette = "Set1") +
theme_bw() +
theme(
axis.title = element_text(size = 10, colour = "darkblue"),
axis.text = element_text(size = 8),
legend.title = element_blank()
)
Many graphs, like scatterplots, plot the raw values from your dataset; other graphs, like bar charts, calculate new values to plot.
geom_bar
, geom_histogram
, geom_freqpoly
) bin your data and then plot bin countsgeom_smooth
) fit a model to your data and then plot predictions from the modelgeom_boxplot
) compute a robust summary of the distribution and display a specially formatted boxThe algorithm used to calculate new values for a graph is called a stat
Take a look at the help page for geom_bar
.
?geom_bar
Note that it uses a “count” stat.
Sometimes you might want to override a stat, e.g. when you want to display a bar chart but already have counts.
state_counts <- count(patients, State)
state_counts
## # A tibble: 6 x 2
## State n
## <fct> <int>
## 1 California 10
## 2 Colorado 15
## 3 Georgia 16
## 4 Indiana 18
## 5 New Jersey 20
## 6 New York 21
ggplot(data = state_counts, mapping = aes(x = State, y = n)) +
geom_bar(stat = "identity")
Compare this with the following bar chart for which we have observations for each patient rather than a summary of the numbers of each state.
ggplot(data = patients, mapping = aes(x = State)) +
geom_bar(stat = "count")
By default geom_bar
applies the statistical transformation, stat_count
, in much the same way as we did above to create the state_counts
summary table.
Earlier we created the following stacked bar chart by setting the fill
aesthetic in a geom_bar
to a categorical variable. Let’s revisit this.
ggplot(data = patients, mapping = aes(x = Age, fill = Sex)) +
geom_bar()
We can control the way stacking is performed by specifying a position adjustment using the position
argument.
For example, to place the bars for the separate categories side-by-side we can use position = "dodge"
.
ggplot(data = patients, mapping = aes(x = Age, fill = Sex)) +
geom_bar(position = "dodge")
Try some of the other values for the position adjustment, e.g. "fill"
and "identity"
,
Another type of position adjustment is "jitter"
. This is much more useful for scatter plots than for bar charts. We already used geom_jitter
to separate points added to a box plot and using position = "jitter"
in geom_point()
is similar.
position = "jitter"
adds a small amount of random noise to each point and is especially useful for data sets that contain multiple observations with the same x and y values, e.g. where these are discrete. The patients dataset isn’t really a good example of where this would be needed so let’s take a look at the mpg
data frame that ggplot2 provides as a test dataset.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
The mpg
data frame has 234 observations but there are clearly not 234 points on this scatter plot; there are more than one observation with the same value of displ
and hwy
.
position = "jitter"
separates the overlapping points.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(position = "jitter")
Use ggsave
to save the last plot you displayed.
ggsave("myplot.png")
Alternatively, ggsave
can be passed an explicit plot object.
plot <- ggplot(data = patients, mapping = aes(x = Smokes, y = BMI, colour = Smokes)) +
geom_boxplot() +
geom_jitter(alpha = 0.4, width = 0.25) +
facet_wrap(~ Sex) +
scale_colour_brewer(palette = "Set1") +
theme(legend.position = "none")
ggsave("myplot.png", plot)
The type of plot file, e.g. pdf, svg, png, etc., and the dimensions can be specified, e.g.
ggsave("myplot.pdf", width = 8, height = 6)
See separate R markdown document.