20 R Recoding Reshaping Exercise

20.1 Key points

Here are some key points regarding recoding and reshaping data in R:

Count the number of times ID2 is duplicated
- sum(duplicated(b$ID2))
List all rows with a duplicated c1 value
- f %>% group_by(c1) %>% filter(n()>1)
Recode data using left_join
Pivot data from long to wide
- pivot_wider
Pivot data from wide to long
- pivot_longer
Useful table commands
- table()
- addmargins(table())
- prop.table(table(), margin)

20.2 Download example data

20.3 Recoding data

Different parts of the data may have different ID systems
Clinical ID
Laboratory ID
Genotyping service ID
Need a dictionary or key to translate one type of key into another key

20.4 A dictionary

A dictionary defines a one-to-one correspondence between keys and values
- keys must be unique

20.5 Recoding data

Good practice to output old IDs as well as new IDs into the output files
Permits checking
Have to convey results back to your collaborators using their ID system

Suppose we have been given these data:

    ==> study2_ped.txt <== 
    ID1     sex     aff2
    1       M       1
    2       M       2
    ==> study2_pheno.txt <==
    ID2     sex     t
    E544    M       3.34263153909733
    E853    M       5.35786611210859
    ==> study2_snp.txt <==
    ID2     SNPID2          all1    all2
    E544    Aff-S-3212091   A       T
    E853    Aff-S-1032132   A       A

The clinicians used an integer for the Person IDs:

    ==> study2_ped.txt <== 
    ID1     sex     aff2
    1       M       1
    2       M       2

The serum assay laboratory used a different set of Person IDs starting with the letter ’E’:

    ==> study2_pheno.txt <==
    ID2     sex     t
    E544    M       3.34263153909733
    E853    M       5.35786611210859

While the genotyping lab also used ’E’ Person IDs, they did not use ’rs’ SNP IDs:

    ==> study2_snp.txt <==
    ID2     SNPID2          all1    all2
    E544    Aff-S-3212091   A       T
    E853    Aff-S-1032132   A       A

Keys

To use these data together, we need to translate Person IDs and SNP IDs so that all of our files are using the same IDs (e.g. so they all speak the same language).
To do this, we need translation keys:

==> study2_key1.txt <==
ID1     ID2
1       E544
2       E853

==> study2_key2.txt <==
rsID    SNPID2
rs35814900      Aff-S-3212091
rs28370510      Aff-S-1032132

20.6 Duplicates

Note above that we said that the keys must be unique.

Question: How would you check in R that the keys are unique? For example, how would you check for duplicates in the ID2 column of ‘study2_key1.txt’?

Answer: In R, the duplicated function can be used to check for duplicates

20.6.1 Counting duplicates

To count the number of duplicated ID’s, we can take advantage of the fact that a TRUE value behaves as a 1 and a FALSE value behaves as a 0 when a logicial variable is used in a numeric computation.

20.6.2 Checking for duplicates

How do we return every row that contains a duplicate?

Another way to list all of the rows containing a duplicated ‘c1’ value:

Yet another way to list all of the rows containing a duplicated ‘c1’ value using functions from the ‘tidyverse’ package:

20.6.3 Key points: Duplicates

Here are some key points regarding detecting duplicates:

Count the number of times ID2 is duplicated
- sum(duplicated(b$ID2))
List all rows with a duplicated c1 value
- f %>% group_by(c1) %>% filter(n()>1)

20.7 Project 1 Data

In the ds data frame we have the synthetic yet realistic data we will be using in Project 1.

In the dd data frame we have the corresponding data dictionary.

20.8 Exercise 1: duplicated values

Skill: Checking for duplicated IDs

Using the ds data frame from Project 1, check if there are any duplicated sample_id’s using the duplicated command. If so, count how many duplicated sample_id’s there are.

Expand to see solution

Construct a table of the number of times each sample_id is duplicated:

Expand to see solution

Note that it is important to be aware of missing IDs. So when constructing tables of counts using the table command, the useNA argument controls if the table includes counts of NA values.

How many sample_id’s are NA’s?

Check if there are any duplicated subject_ids

Expand to see solution

We can check if there are any duplicated subject_id’s by counting how many duplicates there are.

20.9 Checking for duplicates

How do we return every row that contains a duplicate?

This approach only does not return every row that contains a duplicated ID:

20.10 Counting the number of occurences of the ID

20.11 Count `sample_id` duplicates

Using Tidyverse commands, count how many times each sample_id occcurs in the ds data frame, reporting the counts in descending order, from highest to lowest.

Expand to see solution

20.12 Checking for duplicates

Here we list all of the rows containing a duplicated ‘ID’ value using functions from the ‘tidyverse’ package:

20.12.1 How to list all duplicates

Use Tidyverse commands to list (1) all duplicates for sample_id and (2) all duplicates for subject_id. Sort the results by the ID.

Expand to see solution

20.12.2 Sample ID

20.12.3 Subject ID

20.13 Reshaping data

20.13.1 Download example data

20.13.2 Long & wide PLINK data

For large-scale data, PLINK is very useful for converting from ‘long’ to ‘wide’
Long: ped, per, rsID, allele 1, allele 2
Wide: ped, per, SNP1 allele 1, SNP1 allele 2, SNP2 allele 1, SNP2 allele 2, …

CAUTION: When reading long data, PLINK does not warn about multiple genotypes for the same person, but rather uses the final one read in.

Long format data: one row per genotype

20.13.3 `reshape`

So here we have two SNPs typed per person, and we want to convert from ’long’ format to ’wide’ format:

Wide format data: one row per individual

Check that it worked correctly:

20.13.4 `pivot_wider`

Using pivot_wider from tidyverse:

20.13.5 Key points

Here are some key points regarding reshaping data in R:

Pivot data from long to wide
- pivot_wider
Pivot data from wide to long
- pivot_longer

20.14 Exercise 2: Reshaping data

Skill: Reshaping data

Select only three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample” from the ds data frame, and then reshape from ‘long’ format to ‘wide’ format using pivot_wider, taking time as the “Sample_trimester”.

Expand to see solution

20.14.1 Comment

View b2 via the View(b2) command in RStudio - it nicely put all the different gestational age observations into one list for each sample_id x Sample_trimester combination.

20.15 Exercise 3: Aggregating data

Skill: Aggregating data

Make a table showing the proportion of blacks and whites that are controls and cases.

Expand to see solution

20.15.1 Comment:

The margin parameter of the prop.table command has to be specified in order to get the desired answer: “1 indicates rows, 2 indicates columns.

Construct more readable tables with labels using xtabs

Expand to see solution

20.15.2 `xtabs` table with labels

Create a count cross table using Tidyverse commands

Expand to see solution

Create a proportion cross table using Tidyverse commands

Expand to see solution

20.16 Exercise 4: Summarizing within groups

Skill: Summarizing within groups

Apply the summary command to the “Gestationalage_sample” within each “Sample_trimester” group.

Expand to see solution

Note: With split(x, f), any missing values in f are dropped together with the corresponding values of x.

20.17 Recoding data

20.17.1 Recoding data using look-up tables

Approach 1

Implement our dictionaries using look-up tables
- Use a named vector.

Here’s an example of how to do this:

Use the information in study2_key1.txt to set up a dictionary dictPer that maps ‘E’ person IDs to integer person IDs.

Read in study2_pheno.txt, and using dictPer, translate the person IDs and write it out again with the translated person IDs.

20.17.2 Recoding data using left joins

Approach 2

Implement our dictionaries using left joins

Here’s an example of how to do this:

Read in study2_pheno.txt, and using dictPer, translate the person IDs and write it out again with the translated person IDs.

20.18 Exercise 5: Recoding data

Approach 1

Implement our dictionaries using look-up tables
- Use a named vector.

Skill:: Recoding IDs using a dictionary

Create a new subject ID column named “subjectID” where you have used the DictPer named vector to recode the original “subject_id” IDs into integer IDs.

Note: If the DictPer named vector is not available, be sure to load the Project 1 data above using the commands in the ‘Project 1 Data’ section.

Expand to see solution

Approach 2

Implement our dictionaries using left joins

20.18.1 Comment

I usually prefer to use a merge command like left_join to merge in the new IDs into my data frame.

Expand to see solution

20.19 Exercise 6: Filtering rows

Skill: Filtering rows.

Create a data frame tri1 containing the records for Trimester 1, and a second data frame tri2 containing the records for Trimester 2.

Expand to see solution

20.20 Exercise 7: Selecting columns

Skill: Selecting columns

Update tri1 and tri2 to only contain the three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”

Expand to see solution

20.1 Key points

20.2 Download example data

20.3 Recoding data

20.4 A dictionary

20.5 Recoding data

20.6 Duplicates

20.6.1 Counting duplicates

20.6.2 Checking for duplicates

20.6.3 Key points: Duplicates

20.7 Project 1 Data

20.8 Exercise 1: duplicated values

20.9 Checking for duplicates

20.10 Counting the number of occurences of the ID

20.11 Count sample_id duplicates

20.12 Checking for duplicates

20.12.1 How to list all duplicates

20.12.2 Sample ID

20.12.3 Subject ID

20.13 Reshaping data

20.13.1 Download example data

20.13.2 Long & wide PLINK data

20.13.3 reshape

20.13.4 pivot_wider

20.13.5 Key points

20.14 Exercise 2: Reshaping data

20.14.1 Comment

20.15 Exercise 3: Aggregating data

20.15.1 Comment:

20.15.2 xtabs table with labels

20.16 Exercise 4: Summarizing within groups

20.17 Recoding data

20.17.1 Recoding data using look-up tables

20.17.2 Recoding data using left joins

20.18 Exercise 5: Recoding data

20.18.1 Comment

20.19 Exercise 6: Filtering rows

20.20 Exercise 7: Selecting columns

20.11 Count `sample_id` duplicates

20.13.3 `reshape`

20.13.4 `pivot_wider`

20.15.2 `xtabs` table with labels