21 R Merging Exercise

21.1 Merging Best Practice

Always be careful when merging.
Always check for duplicated IDs before doing the merge.
Always check that your ID columns do not contain any missing values.
Check that the values in the ID columns (e.g., the keys) match.
- Can use an anti_join to check this.
Inconsistencies in the values of the keys can be hard to fix.
Always check the dimensions, before and after the merge, to make sure the merged object has the expected number of rows and columns.
Always explicitly name the keys you are merging on.
When using tidyverse join commands, load the tidylog R package in order to turn on very useful additional feedback.

21.2 Load Libraries

21.3 Input data

Let’s load the synthetic simulated Project 1 data and associated data dictionary:

21.4 Select a subset of subject-level fields

Set up a data frame ‘a’ that has these subject-level fields: “subject_id” “maternal_age_delivery” “case_control_status” “prepregnancy_BMI”

21.5 Unique records

The data were given to us in a way that repeated subject-level information, once for each sample from each individual subject.

From your data frame ‘a’ select only the unique records, creating data frame b.

Expand to see solution

21.5.1 Comment

It is better to apply unique to the whole data frame, not just to the subject_id column, as that ensures that you are selecting whole records that are unique across all of their columns.

Note that the dplyr R package provides the distinct command, which keeps only unique/distinct rows from a data frame. It is faster than the unique command.

21.6 Check that the `subject_id`’s are now not duplicated

Are the subject_id’s unique?

Expand to see solution

21.7 Create random integer IDs

Create a new column ID containing randomly chosen integer IDs; this is necessary to de-identify the data. To do this, use the sample command, sampling integers from 1 to the number of rows in data frame b.

Expand to see solution

This could also be done using the sample-int() function:

21.8 Merge in new phenotype information

The PI has sent you new trait data for your subjects.

Carefully merge this in using tidyverse commands.

If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.

21.9 Always be careful when merging.

Always check for duplicated IDs before doing the merge.
Always check that your ID columns do not contain any missing values.
Check that the values in the ID columns (e.g., the keys) match.
- Can use an ‘anti_join’ to check this.
- Inconsistencies in the values of the keys can be hard to fix.
Always check the dimensions to make sure the merged object has the expected number of rows and columns.
Always explicitly name the keys you are merging on.
- If you don’t name them, then the join command will use all variables in common across x and y.
When using tidyverse join commands, load the tidylog R package in order to turn on very useful additional feedback.

21.10 Merge in new phenotype information

Carefully merge in the new data in using tidyverse commands. As this is subject-level information, it should be merged into the subject-level data frame b which was created above when from your data frame ‘a’ you selected only the unique records.

If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.

Expand to see solution

Here we load the tidylog R package, which will result in useful feedback when tidyverse commands are executed.

21.11 Further checks

When merging data based on an ID shared in common, it is not only important to check for duplicated IDs, but it is also important to check for overlap of the two ID sets.

Check if the set of subject_id IDs in your dataframe b fully overlaps the set of subject_id IDs in the new data set. If there is not full overlap, document which IDs do not overlap.

Hint: Use an anti_join.

Expand to see solution

anti_join() return all rows from x without a match in y.

21.1 Merging Best Practice

21.2 Load Libraries

21.3 Input data

21.4 Select a subset of subject-level fields

21.5 Unique records

21.5.1 Comment

21.6 Check that the subject_id’s are now not duplicated

21.7 Create random integer IDs

21.8 Merge in new phenotype information

21.9 Always be careful when merging.

21.10 Merge in new phenotype information

21.11 Further checks

21.6 Check that the `subject_id`’s are now not duplicated