20 R Merging Exercise
20.1 Merging Best Practice
- Always be careful when merging.
- Always check for duplicated IDs before doing the merge.
- Always check that your ID columns do not contain any missing values.
- Check that the values in the ID columns (e.g., the keys) match.
- Can use an
anti_join
to check this.
- Can use an
- Inconsistencies in the values of the keys can be hard to fix.
- Always check the dimensions, before and after the merge, to make sure the merged object has the expected number of rows and columns.
- Always explicitly name the keys you are merging on.
- When using tidyverse
join
commands, load thetidylog
R package in order to turn on very useful additional feedback.
20.2 Load Libraries
20.3 Input data
Let’s load the synthetic simulated Project 1 data and associated data dictionary:
20.4 Select a subset of subject-level fields
Set up a data frame ‘a’ that has these subject-level fields: “subject_id” “maternal_age_delivery” “case_control_status” “prepregnancy_BMI”
20.5 Unique records
The data were given to us in a way that repeated subject-level information, once for each sample from each individual subject.
From your data frame ‘a’ select only the unique records, creating data frame b
.
20.5.1 Comment
It is better to apply unique
to the whole data frame, not just to the subject_id
column, as that ensures that you are selecting whole records that are unique across all of their columns.
Note that the dplyr
R package provides the distinct
command, which keeps only unique/distinct rows from a data frame. It is faster than the unique
command.
20.6 Check that the subject_id
’s are now not duplicated
Are the subject_id
’s unique?
20.7 Create random integer IDs
Create a new column ID
containing randomly chosen integer IDs; this is necessary to de-identify the data. To do this, use the sample
command, sampling integers from 1 to the number of rows in data frame b
.
This could also be done using the sample-int()
function:
20.8 Merge in new phenotype information
The PI has sent you new trait data for your subjects.
Carefully merge this in using tidyverse commands.
If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.
20.9 Always be careful when merging.
- Always check for duplicated IDs before doing the merge.
- Always check that your ID columns do not contain any missing values.
- Check that the values in the ID columns (e.g., the keys) match.
- Can use an ‘anti_join’ to check this.
- Inconsistencies in the values of the keys can be hard to fix.
- Always check the dimensions to make sure the merged object has the expected number of rows and columns.
- Always explicitly name the keys you are merging on.
- If you don’t name them, then the join command will use all variables in common across
x
andy
.
- If you don’t name them, then the join command will use all variables in common across
- When using tidyverse
join
commands, load thetidylog
R package in order to turn on very useful additional feedback.
20.10 Merge in new phenotype information
Carefully merge in the new data in using tidyverse commands. As this is subject-level information, it should be merged into the subject-level data frame b
which was created above when from your data frame ‘a’ you selected only the unique records.
If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.
Here we load the tidylog
R package, which will result in useful feedback when tidyverse commands are executed.
20.11 Further checks
When merging data based on an ID shared in common, it is not only important to check for duplicated IDs, but it is also important to check for overlap of the two ID sets.
Check if the set of subject_id
IDs in your dataframe b
fully overlaps the set of subject_id
IDs in the new
data set. If there is not full overlap, document which IDs do not overlap.
Hint: Use an anti_join
.
anti_join()
return all rows from x without a match in y.