20  R Merging Exercise

20.1 Merging Best Practice

  • Always be careful when merging.
  • Always check for duplicated IDs before doing the merge.
  • Always check that your ID columns do not contain any missing values.
  • Check that the values in the ID columns (e.g., the keys) match.
    • Can use an anti_join to check this.
  • Inconsistencies in the values of the keys can be hard to fix.
  • Always check the dimensions, before and after the merge, to make sure the merged object has the expected number of rows and columns.
  • Always explicitly name the keys you are merging on.

20.2 Load Libraries

20.3 Input data

Let’s load the synthetic simulated Project 1 data and associated data dictionary:

20.4 Select a subset of subject-level fields

Set up a data frame ‘a’ that has these subject-level fields: “subject_id” “maternal_age_delivery” “case_control_status” “prepregnancy_BMI”

20.5 Unique records

The data were given to us in a way that repeated subject-level information, once for each sample from each individual subject.

From your data frame ‘a’ select only the unique records, creating data frame b.

20.5.1 Comment

It is better to apply unique to the whole data frame, not just to the subject_id column, as that ensures that you are selecting whole records that are unique across all of their columns.

Note that the dplyr R package provides the distinct command, which keeps only unique/distinct rows from a data frame. It is faster than the unique command.

20.6 Check that the subject_id’s are now not duplicated

Are the subject_id’s unique?

20.7 Create random integer IDs

Create a new column ID containing randomly chosen integer IDs; this is necessary to de-identify the data. To do this, use the sample command, sampling integers from 1 to the number of rows in data frame b.

20.8 Merge in new phenotype information

The PI has sent you new trait data for your subjects.

Carefully merge this in using tidyverse commands.

If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.

20.9 Always be careful when merging.

  • Always check for duplicated IDs before doing the merge.
  • Always check that your ID columns do not contain any missing values.
  • Check that the values in the ID columns (e.g., the keys) match.
    • Can use an ‘anti_join’ to check this.
    • Inconsistencies in the values of the keys can be hard to fix.
  • Always check the dimensions to make sure the merged object has the expected number of rows and columns.
  • Always explicitly name the keys you are merging on.
    • If you don’t name them, then the join command will use all variables in common across x and y.

20.10 Merge in new phenotype information

Carefully merge in the new data in using tidyverse commands. As this is subject-level information, it should be merged into the subject-level data frame b which was created above when from your data frame ‘a’ you selected only the unique records.

If you notice any problems with this merge, prepare a report for the PI detailing what you noticed and what you’d like to ask the PI about.

20.11 Further checks

When merging data based on an ID shared in common, it is not only important to check for duplicated IDs, but it is also important to check for overlap of the two ID sets.

Check if the set of subject_id IDs in your dataframe b fully overlaps the set of subject_id IDs in the new data set. If there is not full overlap, document which IDs do not overlap.

Hint: Use an anti_join.

anti_join() return all rows from x without a match in y.