19 R Recoding Reshaping Exercise
19.1 Key points
Here are some key points regarding recoding and reshaping data in R:
- Count the number of times
ID2
is duplicatedsum(duplicated(b$ID2))
- List all rows with a duplicated
c1
valuef %>% group_by(c1) %>% filter(n()>1)
- Recode data using
left_join
- Pivot data from long to wide
pivot_wider
- Pivot data from wide to long
pivot_longer
- Useful table commands
table()
addmargins(table())
prop.table(table(), margin)
19.2 Download example data
19.3 Recoding data
Different parts of the data may have different ID systems
Clinical ID
Laboratory ID
Genotyping service ID
Need a dictionary or key to translate one type of key into another key
19.4 A dictionary
- A dictionary defines a one-to-one correspondence between keys and values
- keys must be unique
19.5 Recoding data
Good practice to output old IDs as well as new IDs into the output files
Permits checking
Have to convey results back to your collaborators using their ID system
Suppose we have been given these data:
==> study2_ped.txt <==
ID1 sex aff2
1 M 1
2 M 2
==> study2_pheno.txt <==
ID2 sex t
E544 M 3.34263153909733
E853 M 5.35786611210859
==> study2_snp.txt <==
ID2 SNPID2 all1 all2
E544 Aff-S-3212091 A T
E853 Aff-S-1032132 A A
The clinicians used an integer for the Person IDs:
==> study2_ped.txt <==
ID1 sex aff2
1 M 1
2 M 2
The serum assay laboratory used a different set of Person IDs starting with the letter ’E’:
==> study2_pheno.txt <==
ID2 sex t
E544 M 3.34263153909733
E853 M 5.35786611210859
While the genotyping lab also used ’E’ Person IDs, they did not use ’rs’ SNP IDs:
==> study2_snp.txt <==
ID2 SNPID2 all1 all2
E544 Aff-S-3212091 A T
E853 Aff-S-1032132 A A
Keys
- To use these data together, we need to translate Person IDs and SNP IDs so that all of our files are using the same IDs (e.g. so they all speak the same language).
- To do this, we need translation keys:
==> study2_key1.txt <==
ID1 ID2
1 E544
2 E853
==> study2_key2.txt <==
rsID SNPID2
rs35814900 Aff-S-3212091
rs28370510 Aff-S-1032132
19.6 Duplicates
Note above that we said that the keys must be unique.
Question: How would you check in R that the keys are unique? For example, how would you check for duplicates in the ID2 column of ‘study2_key1.txt’?
Answer: In R, the duplicated function can be used to check for duplicates
19.6.1 Counting duplicates
To count the number of duplicated ID’s, we can take advantage of the fact that a TRUE
value behaves as a 1
and a FALSE
value behaves as a 0
when a logicial variable is used in a numeric computation.
19.6.2 Checking for duplicates
How do we return every row that contains a duplicate?
Another way to list all of the rows containing a duplicated ‘c1’ value:
Yet another way to list all of the rows containing a duplicated ‘c1’ value using functions from the ‘tidyverse’ package:
19.6.3 Key points: Duplicates
Here are some key points regarding detecting duplicates:
- Count the number of times
ID2
is duplicatedsum(duplicated(b$ID2))
- List all rows with a duplicated
c1
valuef %>% group_by(c1) %>% filter(n()>1)
19.7 Project 1 Data
In the ds
data frame we have the synthetic yet realistic data we will be using in Project 1.
In the dd
data frame we have the corresponding data dictionary.
19.8 Exercise 1: duplicated values
Skill: Checking for duplicated IDs
Using the ds
data frame from Project 1, check if there are any duplicated sample_id
’s using the duplicated
command. If so, count how many duplicated sample_id
’s there are.
Construct a table of the number of times each sample_id
is duplicated:
Note that it is important to be aware of missing IDs. So when constructing tables of counts using the table
command, the useNA
argument controls if the table includes counts of NA
values.
How many sample_id
’s are NA
’s?
Check if there are any duplicated subject_id
s
We can check if there are any duplicated subject_id
’s by counting how many duplicates there are.
19.9 Checking for duplicates
How do we return every row that contains a duplicate?
This approach only does not return every row that contains a duplicated ID
:
19.10 Counting the number of occurences of the ID
19.11 Count sample_id
duplicates
Using Tidyverse commands, count how many times each sample_id
occcurs in the ds
data frame, reporting the counts in descending order, from highest to lowest.
19.12 Checking for duplicates
Here we list all of the rows containing a duplicated ‘ID’ value using functions from the ‘tidyverse’ package:
19.12.1 How to list all duplicates
Use Tidyverse commands to list (1) all duplicates for sample_id
and (2) all duplicates for subject_id
. Sort the results by the ID.
19.12.2 Sample ID
19.12.3 Subject ID
19.13 Reshaping data
19.13.1 Download example data
19.13.2 Long & wide PLINK data
For large-scale data, PLINK is very useful for converting from ‘long’ to ‘wide’
Long: ped, per, rsID, allele 1, allele 2
Wide: ped, per, SNP1 allele 1, SNP1 allele 2, SNP2 allele 1, SNP2 allele 2, …
CAUTION: When reading long data, PLINK does not warn about multiple genotypes for the same person, but rather uses the final one read in.
Long format data: one row per genotype
19.13.3 reshape
So here we have two SNPs typed per person, and we want to convert from ’long’ format to ’wide’ format:
Wide format data: one row per individual
Check that it worked correctly:
19.13.4 pivot_wider
Using pivot_wider
from tidyverse
:
19.13.5 Key points
Here are some key points regarding reshaping data in R:
- Pivot data from long to wide
pivot_wider
- Pivot data from wide to long
pivot_longer
19.14 Exercise 2: Reshaping data
Skill: Reshaping data
Select only three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample” from the ds
data frame, and then reshape from ‘long’ format to ‘wide’ format using pivot_wider
, taking time as the “Sample_trimester”.
19.14.1 Comment
View b2
via the View(b2)
command in RStudio - it nicely put all the different gestational age observations into one list for each sample_id
x Sample_trimester
combination.
19.15 Exercise 3: Aggregating data
Skill: Aggregating data
Make a table showing the proportion of blacks and whites that are controls and cases.
Construct more readable tables with labels using xtabs
19.15.2 xtabs
table with labels
Create a count cross table using Tidyverse commands
Create a proportion cross table using Tidyverse commands
19.16 Exercise 4: Summarizing within groups
Skill: Summarizing within groups
Apply the summary
command to the “Gestationalage_sample” within each “Sample_trimester” group.
Note: With split(x, f)
, any missing values in f
are dropped together with the corresponding values of x
.
19.17 Recoding data
19.17.1 Recoding data using look-up tables
Approach 1
- Implement our dictionaries using look-up tables
- Use a named vector.
Here’s an example of how to do this:
- Use the information in
study2_key1.txt
to set up a dictionarydictPer
that maps ‘E’ person IDs to integer person IDs.
- Read in study2_pheno.txt, and using
dictPer
, translate the person IDs and write it out again with the translated person IDs.
19.17.2 Recoding data using left joins
Approach 2
- Implement our dictionaries using left joins
Here’s an example of how to do this:
- Read in study2_pheno.txt, and using
dictPer
, translate the person IDs and write it out again with the translated person IDs.
19.18 Exercise 5: Recoding data
Approach 1
- Implement our dictionaries using look-up tables
- Use a named vector.
Skill:: Recoding IDs using a dictionary
Create a new subject ID column named “subjectID” where you have used the DictPer
named vector to recode the original “subject_id” IDs into integer IDs.
Note: If the DictPer
named vector is not available, be sure to load the Project 1 data above using the commands in the ‘Project 1 Data’ section.
Approach 2
- Implement our dictionaries using left joins
19.18.1 Comment
I usually prefer to use a merge command like left_join
to merge in the new IDs into my data frame.
19.19 Exercise 6: Filtering rows
Skill: Filtering rows.
Create a data frame tri1
containing the records for Trimester 1, and a second data frame tri2
containing the records for Trimester 2.
19.20 Exercise 7: Selecting columns
Skill: Selecting columns
Update tri1
and tri2
to only contain the three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”
19.15.1 Comment:
The
margin
parameter of theprop.table
command has to be specified in order to get the desired answer: “1 indicates rows, 2 indicates columns.