library(tidyverse)
# library(tidylog)
18 R Tidyverse Exercise
18.1 Load Libraries
Load the tidyverse
packages
18.2 Untidy data
Letβs use the World Health Organization TB data set from the tidyr
package
<- tidyr::who
who dim(who)
[1] 7240 60
head(who[,1:6] %>% filter(!is.na(new_sp_m014)))
# A tibble: 6 Γ 6
country iso2 iso3 year new_sp_m014 new_sp_m1524
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF AFG 1997 0 10
2 Afghanistan AF AFG 1998 30 129
3 Afghanistan AF AFG 1999 8 55
4 Afghanistan AF AFG 2000 52 228
5 Afghanistan AF AFG 2001 129 379
6 Afghanistan AF AFG 2002 90 476
See the help page for who
for more information about this data set.
In particular, note this description:
βThe data uses the original codes given by the World Health Organization. The column names for columns five through 60 are made by combining new_ to a code for method of diagnosis (rel = relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary) to a code for gender (f = female, m = male) to a code for age group (014 = 0-14 yrs of age, 1524 = 15-24 years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older).β
So new_sp_m014
represents the counts of new TB cases detected by a positive pulmonary smear in males in the 0-14 age group.
18.3 Tidy data
Tidy data: Have each variable in a column.
Question: Are these data tidy?
No these data are not tidy because aspects of the data that should be variables are encoded in the name of the variables.
These aspects are
- test type.
- sex of the subjects.
- age range of the subjects.
Question: How would we make these data tidy?
Consider this portion of the data:
head(who[,1:5] %>% filter(!is.na(new_sp_m014) & new_sp_m014>0), 1)
# A tibble: 1 Γ 5
country iso2 iso3 year new_sp_m014
<chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan AF AFG 1998 30
We would replace the new_sp_m014
with the following four columns:
type sex age n
sp m 014 30
This would place each variable in its own column.
18.4 Gather
<- tibble(
stocks time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
head(stocks)
# A tibble: 6 Γ 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.643 1.33 -4.22
2 2009-01-02 -0.284 0.465 7.19
3 2009-01-03 0.578 0.939 3.59
4 2009-01-04 0.529 2.73 0.0280
5 2009-01-05 0.197 -1.46 -4.24
6 2009-01-06 0.0254 0.241 4.87
%>% gather("stock", "price", -time) %>% head() stocks
# A tibble: 6 Γ 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X -0.643
2 2009-01-02 X -0.284
3 2009-01-03 X 0.578
4 2009-01-04 X 0.529
5 2009-01-05 X 0.197
6 2009-01-06 X 0.0254
18.5 Pivot_longer
%>% pivot_longer(c(X,Y,Z), names_to= "stock", values_to = "price") %>%
stocks head()
# A tibble: 6 Γ 3
time stock price
<date> <chr> <dbl>
1 2009-01-01 X -0.643
2 2009-01-01 Y 1.33
3 2009-01-01 Z -4.22
4 2009-01-02 X -0.284
5 2009-01-02 Y 0.465
6 2009-01-02 Z 7.19
18.6 WHO TB data
Question: How would we convert this to tidy form?
head(who[,1:6] %>% filter(!is.na(new_sp_m014)))
# A tibble: 6 Γ 6
country iso2 iso3 year new_sp_m014 new_sp_m1524
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF AFG 1997 0 10
2 Afghanistan AF AFG 1998 30 129
3 Afghanistan AF AFG 1999 8 55
4 Afghanistan AF AFG 2000 52 228
5 Afghanistan AF AFG 2001 129 379
6 Afghanistan AF AFG 2002 90 476
<- who %>% pivot_longer(starts_with("new"), names_to = "demo", values_to = "n") %>% filter(!is.na(n))
who.long head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
Question: How would we split demo
into variables?
head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
Look at the variable naming scheme:
names(who) %>% grep("m014",., value=TRUE)
[1] "new_sp_m014" "new_sn_m014" "new_ep_m014" "newrel_m014"
Question: How should we adjust the demo
strings so as to be able to easily split all of them into the desired variables?
<- who.long %>%
who.long mutate(demo = str_replace(demo, "newrel", "new_rel"))
grep("m014",who.long$demo, value=TRUE) %>% unique()
[1] "new_sp_m014" "new_sn_m014" "new_ep_m014" "new_rel_m014"
Question: After adjusting the demo
strings, how would we then separate them into the desired variables?
Hint: Use separate_wider_position()
and separate_wider_delim()
.
<- who.long %>%
who.long.v1 separate(demo, into = c("new", "type", "sexagerange"), sep="_") %>%
separate(sexagerange, into=c("sex","age_range"), sep=1) %>%
select(-new)
head(who.long.v1)
# A tibble: 6 Γ 8
country iso2 iso3 year type sex age_range n
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AF AFG 1997 sp m 014 0
2 Afghanistan AF AFG 1997 sp m 1524 10
3 Afghanistan AF AFG 1997 sp m 2534 6
4 Afghanistan AF AFG 1997 sp m 3544 3
5 Afghanistan AF AFG 1997 sp m 4554 5
6 Afghanistan AF AFG 1997 sp m 5564 2
Note that separate()
has been superseded in favour of separate_wider_position()
and separate_wider_delim()
. So here we use those two functions instead of separate()
:
<- who.long %>%
who.long.v2 separate_wider_delim(demo, names = c("new", "type", "sexagerange"), delim="_") %>%
separate_wider_position(sexagerange, widths=c("sex"=1,"age_range"=4), too_few="align_start" ) %>%
select(-new)
head(who.long.v2)
# A tibble: 6 Γ 8
country iso2 iso3 year type sex age_range n
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AF AFG 1997 sp m 014 0
2 Afghanistan AF AFG 1997 sp m 1524 10
3 Afghanistan AF AFG 1997 sp m 2534 6
4 Afghanistan AF AFG 1997 sp m 3544 3
5 Afghanistan AF AFG 1997 sp m 4554 5
6 Afghanistan AF AFG 1997 sp m 5564 2
18.7 Conclusion
Now our untidy data are tidy.
head(who.long)
# A tibble: 6 Γ 6
country iso2 iso3 year demo n
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
18.8 Acknowledgment
This exercise was modeled, in part, on this exercise:
https://people.duke.edu/\~ccc14/cfar-data-workshop-2018/CFAR_R_Workshop_2018_Exercisees.html