19  R Recoding Reshaping Exercise

19.1 Key points

Here are some key points regarding recoding and reshaping data in R:

  • Count the number of times ID2 is duplicated
    • sum(duplicated(b$ID2))
  • List all rows with a duplicated c1 value
    • f %>% group_by(c1) %>% filter(n()>1)
  • Recode data using left_join
  • Pivot data from long to wide
    • pivot_wider
  • Pivot data from wide to long
    • pivot_longer
  • Useful table commands
    • table()
    • addmargins(table())
    • prop.table(table(), margin)

19.2 Load Libraries

19.3 Project 1 Data

In the ds data frame we have the synthetic yet realistic data we will be using in Project 1.

In the dd data frame we have the corresponding data dictionary.

19.4 Exercise 1: duplicated values

Skill: Checking for duplicated IDs

Check if there are any duplicated sample_id’s using the duplicated command. If so, count how many duplicated sample_id’s there are.

Construct a table of the number of times each sample_id is duplicated:

Note that it is important to be aware of missing IDs. So when constructing tables of counts using the table command, the useNA argument controls if the table includes counts of NA values.

How many sample_id’s are NA’s?

Check if there are any duplicated subject_ids

We can check if there are any duplicated subject_id’s by counting how many duplicates there are.

19.5 Checking for duplicates

How do we return every row that contains a duplicate?

This approach only does not return every row that contains a duplicated ID:

19.6 Counting the number of occurences of the ID

19.7 Count sample_id duplicates

Using Tidyverse commands, count how many times each sample_id occcurs in the ds data frame, reporting the counts in descending order, from highest to lowest.

19.8 Checking for duplicates

Here we list all of the rows containing a duplicated ‘ID’ value using functions from the ‘tidyverse’ package:

19.8.1 How to list all duplicates

Use Tidyverse commands to list (1) all duplicates for sample_id and (2) all duplicates for subject_id. Sort the results by the ID.

19.8.2 Sample ID

19.8.3 Subject ID

19.9 Exercise 2: Reshaping data

Skill: Reshaping data

Select only three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”, and then reshape from ‘long’ format to ‘wide’ format using pivot_wider, taking time as the “Sample_trimester”.

19.9.1 Comment

View b2 via the View(b2) command in RStudio - it nicely put all the different gestational age observations into one list for each sample_id x Sample_trimester combination.

19.10 Exercise 3: Aggregating data

Skill: Aggregating data

Make a table showing the proportion of blacks and whites that are controls and cases.

19.10.1 Comment:

The margin parameter of the prop.table command has to be specified in order to get the desired answer: “1 indicates rows, 2 indicates columns.

Construct more readable tables with labels using xtabs

19.10.2 xtabs table with labels

Create a count cross table using Tidyverse commands

Create a proportion cross table using Tidyverse commands

19.11 Exercise 4: Summarizing within groups

Skill: Summarizing within groups

Apply the summary command to the “Gestationalage_sample” within each “Sample_trimester” group.

Note: With split(x, f), any missing values in f are dropped together with the corresponding values of x.

19.12 Exercise 5: Recoding data

Approach 1

  • Implement our dictionaries using look-up tables
    • Use a named vector.

Skill:: Recoding IDs using a dictionary

Create a new subject ID column named “subjectID” where you have used the DictPer named vector to recode the original “subject_id” IDs into integer IDs.

19.13 Recoding data

Approach 2

  • Implement our dictionaries using left joins

19.13.1 Comment

I usually prefer to use a merge command like left_join to merge in the new IDs into my data frame.

19.14 Exercise 6: Filtering rows

Skill: Filtering rows.

Create a data frame tri1 containing the records for Trimester 1, and a second data frame tri2 containing the records for Trimester 2.

19.15 Exercise 7

Skill: Selecting columns

Update tri1 and tri2 to only contain the three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”