19 R Recoding Reshaping Exercise
19.1 Key points
Here are some key points regarding recoding and reshaping data in R:
- Count the number of times
ID2
is duplicatedsum(duplicated(b$ID2))
- List all rows with a duplicated
c1
valuef %>% group_by(c1) %>% filter(n()>1)
- Recode data using
left_join
- Pivot data from long to wide
pivot_wider
- Pivot data from wide to long
pivot_longer
- Useful table commands
table()
addmargins(table())
prop.table(table(), margin)
19.2 Load Libraries
19.3 Project 1 Data
In the ds
data frame we have the synthetic yet realistic data we will be using in Project 1.
In the dd
data frame we have the corresponding data dictionary.
19.4 Exercise 1: duplicated values
Skill: Checking for duplicated IDs
Check if there are any duplicated sample_id
’s using the duplicated
command. If so, count how many duplicated sample_id
’s there are.
Construct a table of the number of times each sample_id
is duplicated:
Note that it is important to be aware of missing IDs. So when constructing tables of counts using the table
command, the useNA
argument controls if the table includes counts of NA
values.
How many sample_id
’s are NA
’s?
Check if there are any duplicated subject_id
s
We can check if there are any duplicated subject_id
’s by counting how many duplicates there are.
19.5 Checking for duplicates
How do we return every row that contains a duplicate?
This approach only does not return every row that contains a duplicated ID
:
19.6 Counting the number of occurences of the ID
19.7 Count sample_id
duplicates
Using Tidyverse commands, count how many times each sample_id
occcurs in the ds
data frame, reporting the counts in descending order, from highest to lowest.
19.8 Checking for duplicates
Here we list all of the rows containing a duplicated ‘ID’ value using functions from the ‘tidyverse’ package:
19.8.1 How to list all duplicates
Use Tidyverse commands to list (1) all duplicates for sample_id
and (2) all duplicates for subject_id
. Sort the results by the ID.
19.8.2 Sample ID
19.8.3 Subject ID
19.9 Exercise 2: Reshaping data
Skill: Reshaping data
Select only three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”, and then reshape from ‘long’ format to ‘wide’ format using pivot_wider
, taking time as the “Sample_trimester”.
19.9.1 Comment
View b2
via the View(b2)
command in RStudio - it nicely put all the different gestational age observations into one list for each sample_id
x Sample_trimester
combination.
19.10 Exercise 3: Aggregating data
Skill: Aggregating data
Make a table showing the proportion of blacks and whites that are controls and cases.
Construct more readable tables with labels using xtabs
19.10.2 xtabs
table with labels
Create a count cross table using Tidyverse commands
Create a proportion cross table using Tidyverse commands
19.11 Exercise 4: Summarizing within groups
Skill: Summarizing within groups
Apply the summary
command to the “Gestationalage_sample” within each “Sample_trimester” group.
Note: With split(x, f)
, any missing values in f
are dropped together with the corresponding values of x
.
19.12 Exercise 5: Recoding data
Approach 1
- Implement our dictionaries using look-up tables
- Use a named vector.
Skill:: Recoding IDs using a dictionary
Create a new subject ID column named “subjectID” where you have used the DictPer
named vector to recode the original “subject_id” IDs into integer IDs.
19.13 Recoding data
Approach 2
- Implement our dictionaries using left joins
19.13.1 Comment
I usually prefer to use a merge command like left_join
to merge in the new IDs into my data frame.
19.14 Exercise 6: Filtering rows
Skill: Filtering rows.
Create a data frame tri1
containing the records for Trimester 1, and a second data frame tri2
containing the records for Trimester 2.
19.15 Exercise 7
Skill: Selecting columns
Update tri1
and tri2
to only contain the three columns “sample_id”, “Sample_trimester”, “Gestationalage_sample”
19.10.1 Comment:
The
margin
parameter of theprop.table
command has to be specified in order to get the desired answer: “1 indicates rows, 2 indicates columns.