library(tidyverse)
library(ggforce)
# library(tidylog)
# Set the default font to be a bit larger:
theme_set(theme_gray(base_size = 18))
21 R Graphics Exercise
21.1 The Grammer of Graphics
This slide set by Garrick Aden-Buie provides a nice introduction to plotting with ggplot2:
A Gentle Guide to the Grammer of Graphics with ggplot2
21.2 Load Libraries
21.3 Exercise 1
Read in and set up the data set b
, a cleaned version of our simulated data set:
<- read.csv("data/study1.csv")
a $ind <- seq_along(a$t)
a<- a[-c(1001:1004),]
b $g.f <- factor(b$g)
b$geno <- paste(b$all1,b$all2,sep="/") b
Using ggplot and data set b
, plot ind
vs. t
, coloring by case-control status (trait
). What do you observe about the data?
ggplot(data=b, aes(x=ind, y=t, color=trait)) +
geom_point()
21.4 Exercise 2
Using ggplot, plot ind
vs. t
, coloring by case-control status (trait
) and faceting by geno
. What do you observe about the data?
ggplot(data=b, aes(x=ind, y=t, color=trait)) +
geom_point() +
facet_grid(~ geno)
21.5 Facetting
Facetting can be done using the classic interface, where formula notation is used to indicate rows (before the ~
) and columns (after the ~
). According to the facet_grid
documentation
“the dot in the formula is used to indicate there should be no faceting on this dimension (either row or column)”
So this will facet in columns by geno
:
facet_grid(~ geno)
This will facet in rows by geno
:
facet_grid(geno ~ .)
The current recommended notation for facetting instead clearly names the rows and columns that you’d like to facet on.
This will facet in columns by geno
:
facet_grid(cols = vars(geno))
This will facet in rows by geno
:
facet_grid(rows = vars(geno))
Try these various ways of facetting out.
<- ggplot(data=b, aes(x=ind, y=t, color=trait)) +
g geom_point()
+
g facet_grid(. ~ geno)
+ facet_grid(geno ~ .) g
+ facet_grid(cols = vars(geno)) g
+ facet_grid(rows = vars(geno)) g
21.6 Always plot your data
library(tidyverse)
<- read_tsv("data/example.tsv") d
New names:
Rows: 142 Columns: 26
── Column specification
──────────────────────────────────────────────────────── Delimiter: "\t" dbl
(26): x...1, y...2, x...3, y...4, x...5, y...6, x...7, y...8, x...9, y.....
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `x` -> `x...1`
• `y` -> `y...2`
• `x` -> `x...3`
• `y` -> `y...4`
• `x` -> `x...5`
• `y` -> `y...6`
• `x` -> `x...7`
• `y` -> `y...8`
• `x` -> `x...9`
• `y` -> `y...10`
• `x` -> `x...11`
• `y` -> `y...12`
• `x` -> `x...13`
• `y` -> `y...14`
• `x` -> `x...15`
• `y` -> `y...16`
• `x` -> `x...17`
• `y` -> `y...18`
• `x` -> `x...19`
• `y` -> `y...20`
• `x` -> `x...21`
• `y` -> `y...22`
• `x` -> `x...23`
• `y` -> `y...24`
• `x` -> `x...25`
• `y` -> `y...26`
<- rep(c("x","y"), 13)
n1 <- c("","",rep("_",24))
n2 <- c("", "", c(sort(rep(c(1:12), 2))))
n3 names(d) <- paste0(n1,n2,n3)
names(d)
[1] "x" "y" "x_1" "y_1" "x_2" "y_2" "x_3" "y_3" "x_4" "y_4"
[11] "x_5" "y_5" "x_6" "y_6" "x_7" "y_7" "x_8" "y_8" "x_9" "y_9"
[21] "x_10" "y_10" "x_11" "y_11" "x_12" "y_12"
21.7 Similar regression lines
These three data sets have very similar regression lines:
summary(lm(x ~ y, data=d)) %>% coef()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.17563819 2.87986960 19.5063131 9.435087e-42
y -0.03991951 0.05250204 -0.7603419 4.483288e-01
summary(lm(x_1 ~ y_1, data=d)) %>% coef()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.31108156 2.87906158 19.5588319 7.158847e-42
y_1 -0.04269949 0.05249244 -0.8134407 4.173467e-01
summary(lm(x_3 ~ y_3, data=d)) %>% coef()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.18271411 2.87924135 19.5130270 9.107718e-42
y_3 -0.04012859 0.05249468 -0.7644316 4.458966e-01
ggplot(d,aes(x=x,y=y)) + geom_point() +
geom_smooth(method="lm") + ggtitle("Linear regression of y ~ x")
Now try this:
ggplot(d,aes(x=x_1,y=y_1)) + geom_point() +
geom_smooth(method="lm")
ggplot(d,aes(x=x_1,y=y_1)) + geom_point() +
geom_smooth(method="lm") + ggtitle("Linear regression of y_1 ~ x_1")
`geom_smooth()` using formula = 'y ~ x'
And now try this:
ggplot(d,aes(x=x_3,y=y_3)) + geom_point() +
geom_smooth(method="lm")
21.7.1 Always plot your data!
ggplot(d,aes(x=x_3,y=y_3)) + geom_point() +
geom_smooth(method="lm") + ggtitle("Linear regression of y_3 ~ x_3")
`geom_smooth()` using formula = 'y ~ x'
21.8 Always plot your data
<- read_tsv("data/BoxPlots.tsv")
f # Delete the first column
<- f[,-1]
f head(f)
# A tibble: 6 × 5
left lines normal right split
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -9.77 -9.77 -9.76 -9.76 -9.77
2 -9.76 -9.74 -9.72 -9.05 -9.77
3 -9.75 -9.77 -9.68 -8.51 -9.77
4 -9.77 -9.77 -9.64 -8.24 -9.77
5 -9.76 -9.77 -9.6 -8.82 -9.77
6 -9.77 -9.76 -9.56 -8.07 -9.76
Stacking vectors concatenates multiple vectors into a single vector along with a factor indicating where each observation originated.
head(stack(f),2)
values ind
1 -9.769107 left
2 -9.763145 left
Now try this:
ggplot(stack(f), aes(x = ind, y = values)) +
geom_boxplot()
21.9 Identical box plots
These data have essentially identical box plots.
21.10 Boxplots
While the box plots are identical, box plots may not tell the whole story.
Let’s try violin plots instead:
ggplot(stack(f), aes(x = ind, y = values)) +
geom_violin()
A violin plot is a mirrored density plot.
21.11 Non-identical violin plots
21.12 Sina plots
Sidiropoulos N, Sohi SH, Pedersen TL, Porse BT, Winther O, Rapin N, Bagger FO. SinaPlot: An Enhanced Chart for Simple and Truthful Representation of Single Observations Over Multiple Classes. Journal of Computational and Graphical Statistics. Taylor & Francis; 2018 Jul 3;27(3):673–676. DOI: https://doi.org/10.1080/10618600.2017.1366914
library(ggforce)
ggplot(stack(f), aes(x = ind, y = values)) +
geom_violin() + geom_sina()
21.13 Sina plots
21.14 Sina plots
method == "counts"
: The borders are defined by the number of samples that occupy the same bin.
ggplot(stack(f), aes(x = ind, y = values)) +
geom_violin() + geom_sina(method="count")
21.15 Sina plots
21.16 Raincloud plots
Raincloud plots can be created by using the geom_rain
geometry from the ggrain
R package.
“These”raincloud plots” can visualize raw data, probability density, and key summary statistics such as median, mean, and relevant confidence intervals in an appealing and flexible format with minimal redundancy.”
Allen M, Poggiali D, Whitaker K, Marshall TR, Van Langen J, Kievit RA. Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Res. 2021 Jan 21;4:63. PMID: 31069261 PMCID: PMC6480976 DOI: https://doi.org/10.12688/wellcomeopenres.15191.2
https://github.com/RainCloudPlots/RainCloudPlots
Try it out.
library(ggrain)
Registered S3 methods overwritten by 'ggpp':
method from
heightDetails.titleGrob ggplot2
widthDetails.titleGrob ggplot2
ggplot(stack(f), aes(x = ind, y = values)) + geom_rain()
21.17 Drawing multiple graphs
Sometimes we’d like to draw multiple plots, looping across variables. Doing this within an R Markdown or Quarto Markdown document using ggplot2
is tricky. See https://dplyr.tidyverse.org/articles/programming.html and https://r4ds.hadley.nz/functions.html#plot-functions for details.
Here’s one way to do this - this example code will generate two scatter plots:
<- c("x","x_1")
x.names <- c("y", "y_1")
y.names for (i in 1:2) {
<- sym(x.names[i])
x.nam <- sym(y.names[i])
y.nam print(ggplot(data=d, aes(x = {{ x.nam }},
y = {{ y.nam }})) +
geom_point())
}
21.18 Writing ggplot functions
See https://r4ds.hadley.nz/functions.html#plot-functions
library(palmerpenguins)
<- function(fill, ...) {
PlDensity ggplot(penguins %>% filter(!is.na(bill_length_mm)),
aes(bill_length_mm, fill = {{ fill }})) +
geom_density(alpha = 0.5) +
facet_wrap(vars(...))
}
Example from: https://twitter.com/yutannihilat_en/status/1574387230025875457?s=20&t=FLbwErwEKQKWtKIGufDLIQ
PlDensity(species)
PlDensity(island, sex) %>% print() %>% suppressWarnings()
PlDensity(sex, island, year) %>% print() %>% suppressWarnings()
21.19 Exercise 3
Consider this example code:
<- function(df, var, binwidth) {
histogram |>
df ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}
From: https://twitter.com/hadleywickham/status/1574373127349575680?s=20&t=FLbwErwEKQKWtKIGufDLIQ
When applied to the quantitative trait t
from the data frame b
, this generates this histogram:
histogram(b, t, 0.1)
21.19.1 Exercise
After reading the example above, extend the histogram
function to allow facetting and use it to draw a histogram of the quantitative trait t
facetted by geno
using the data set b
that we set up above.
- See https://r4ds.hadley.nz/functions.html#plot-functions
- Use the
{ var }
approach
Hadley Wickham states:
You have to use the vars() syntax
foo <- function(x) {
ggplot(mtcars) +
aes(x = mpg, y = disp) +
geom_point() +
facet_wrap(vars({{ x }}))
}
Tweet: https://twitter.com/hadleywickham/status/1574380137524887554?s=20&t=FLbwErwEKQKWtKIGufDLIQ
<- function(df, var, binwidth, grp) {
histogram |>
df ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth) +
facet_wrap(vars({{ grp }}))
}
histogram(b, t, 0.1)
histogram(b, t, 0.1, geno)
21.20 Source of data
Illustrative data sets from https://www.research.autodesk.com/publications/same-stats-different-graphs/