10  R Character Exercise

10.1 Load Libraries

library(tidyverse)
# library(tidylog)
library(knitr)

10.2 Useful RStudio cheatsheet

See the “String manipulation with stringr cheatsheet” at

https://rstudio.github.io/cheatsheets/html/strings.html

10.3 Scenario 1

You are working with three different sets of collaborators: 1) the clinical group that did the field work and generated the anthropometric measurements; 2) the medical laboratory that measured blood pressure in a controlled environment; and 3) the molecular laboratory that generated the genotypes.

clin <- read.table(file = "data/clinical_data.txt", header=TRUE)
kable(clin)
ID height
1 152
104 172
2112 180
2543 163
lab <- read.table(file = "data/lab_data.txt", header = TRUE)
kable(lab)
ID SBP
SG0001 120
SG0104 111
SG2112 125
SG2543 119
geno <- read.table(file = "data/genotype_data.txt", header = TRUE)
kable(geno)
Sample rs1212
TaqMan-SG0001-190601 G/C
TaqMan-SG0104-190602 G/G
TaqMan-SG2112-190603 C/C
TaqMan-Sg2543-190603 C/G

10.4 Discussion Questions

10.4.1 Question 1

The clinical group, which measured height, used integer IDs, but the medical group, which measured the blood pressure, decided to prefix the integer IDs with the string ‘SG’ (so as to distinguish them from other studies that were also using integer IDs). So ID ‘1’ was mapped to ID ‘SG0001’.

The clin data frame
ID height
1 152
104 172
2112 180
2543 163

Discuss how, using R commands, you would reformat the integer IDs to be in the format “SGXXXX”. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.

Hint: Use the formatC function.

10.4.1.1 Interactive WebR chunk

You can interactively run R within this WebR chunk by clicking the Run code tab. Note that this is a limited version of R which runs within your web browser.

Note

This Run code WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.

10.4.2 Answer 1

clin$SUBJECT_ID <- paste0("SG", formatC(clin$ID, width = 4, flag = "0000"))
kable(clin)
ID height SUBJECT_ID
1 152 SG0001
104 172 SG0104
2112 180 SG2112
2543 163 SG2543
# Or here's an alternative using the 'sub' command:
sub("00","SG",formatC(clin$ID, flag="0000", width=6)) 
[1] "SG0001" "SG0104" "SG2112" "SG2543"
# Or can be done using a `case_when`: 
case_when(
  clin$ID < 10 ~ paste0("SG000",clin$ID),
  clin$ID < 100 ~ paste0("SG00",clin$ID),
  clin$ID < 1000 ~ paste0("SG0",clin$ID),
  clin$ID < 10000 ~ paste0("SG",clin$ID)
)
[1] "SG0001" "SG0104" "SG2112" "SG2543"

10.4.3 Question 2

Discuss how, using R commands, you would reformat the “SGXXXX” IDs to be integer IDs. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.

The lab data frame
ID SBP
SG0001 120
SG0104 111
SG2112 125
SG2543 119

Hint: Use either the gsub command or the str_replace_all command from the stringr package.

Warning

To read in and load the data within the WebR environment, be sure to run all of the WebR chunks in order. For example, to usefully run R code in this WebR chunk here, you first need to run the WebR chunk above in Question 1.

10.4.4 Answer 2

lab$ID2 <- as.numeric(gsub("SG","",lab$ID))
kable(lab)
ID SBP ID2
SG0001 120 1
SG0104 111 104
SG2112 125 2112
SG2543 119 2543
lab$ID2 <- NA
lab$ID2 <- str_replace_all(lab$ID, pattern = "SG", replacement = "") %>% as.numeric()
kable(lab)
ID SBP ID2
SG0001 120 1
SG0104 111 104
SG2112 125 2112
SG2543 119 2543

10.4.5 Question 3

The genotype group used IDs in the style “TaqMan-SG0001-190601”, where the first string is “TaqMan” and the ending string is the date of the genotyping experiment.

Discuss how, using R commands, you would extract an “SGXXXX” style ID from the “TaqMan-SG0001-190601” style IDs. Write down your ideas in the next section, and, if you have time, try them out within an R chunk.

Note that one of the IDs has a lower case ‘g’ in it - how would you correct this, using R commands?

The geno data frame
Sample rs1212
TaqMan-SG0001-190601 G/C
TaqMan-SG0104-190602 G/G
TaqMan-SG2112-190603 C/C
TaqMan-Sg2543-190603 C/G

Hint: Use either the str_split_fixed function from the stringr package or the separate function from the tidyr package.

10.4.6 Answer 3

a <- str_split_fixed(geno$Sample, pattern = "-",n=3)
a
     [,1]     [,2]     [,3]    
[1,] "TaqMan" "SG0001" "190601"
[2,] "TaqMan" "SG0104" "190602"
[3,] "TaqMan" "SG2112" "190603"
[4,] "TaqMan" "Sg2543" "190603"
geno$ID <- toupper(a[,2])
kable(geno)
Sample rs1212 ID
TaqMan-SG0001-190601 G/C SG0001
TaqMan-SG0104-190602 G/G SG0104
TaqMan-SG2112-190603 C/C SG2112
TaqMan-Sg2543-190603 C/G SG2543

The separate function from the tidyr package is also useful:

geno %>% 
  separate(Sample, into=c("Tech","ID2","Suffix"), sep="-") %>% 
  mutate(ID2=toupper(ID2))
    Tech    ID2 Suffix rs1212     ID
1 TaqMan SG0001 190601    G/C SG0001
2 TaqMan SG0104 190602    G/G SG0104
3 TaqMan SG2112 190603    C/C SG2112
4 TaqMan SG2543 190603    C/G SG2543

The separate function is being superseded in favor of separate_wider_delim and separate_wider_position. In this case, separate_wider_delim is applicable.

geno %>% 
  separate_wider_delim(cols=Sample, delim = "-", names=c("Tech","ID2","Suffix")) %>% 
  mutate(ID2=toupper(ID2))
# A tibble: 4 × 5
  Tech   ID2    Suffix rs1212 ID    
  <chr>  <chr>  <chr>  <chr>  <chr> 
1 TaqMan SG0001 190601 G/C    SG0001
2 TaqMan SG0104 190602 G/G    SG0104
3 TaqMan SG2112 190603 C/C    SG2112
4 TaqMan SG2543 190603 C/G    SG2543

10.5 Scenario 2

A replication sample has been measured, and that is using IDs in the style “RP5XXX”.

joint <- read.table(file = "data/joint_data.txt", header = TRUE)
kable(joint)
ID SBP
SG0001 120
SG0104 111
SG2112 125
SG2543 119
RP5002 121
RP5012 118
RP5113 112
RP5213 142

10.5.1 Question 4

Discuss how you would use R commands to split the ‘joint’ data frame into an ‘SG’ and ‘RP’ specific piece? Write down your ideas in the next section, and, if you have time, try them out within an R chunk.

The joint data frame
ID SBP
SG0001 120
SG0104 111
SG2112 125
SG2543 119
RP5002 121
RP5012 118
RP5113 112
RP5213 142

10.5.2 Answer 4

grep(pattern = "SG",joint$ID)
[1] 1 2 3 4
grep(pattern = "RP", joint$ID)
[1] 5 6 7 8
joint.SG <- joint[grep(pattern = "SG",joint$ID), ]
joint.RP <- joint[grep(pattern = "RP", joint$ID), ]
kable(joint.SG)
ID SBP
SG0001 120
SG0104 111
SG2112 125
SG2543 119
kable(joint.RP)
ID SBP
5 RP5002 121
6 RP5012 118
7 RP5113 112
8 RP5213 142
# Reset row names
rownames(joint.RP) <- NULL
kable(joint.RP)
ID SBP
RP5002 121
RP5012 118
RP5113 112
RP5213 142