23 R Exploratory Data Analysis Exercise

24 Exploratory Data Analysis

Goal: Understand your data
Ask questions
- Understand each phenotype
- Understand how each phenotype varies
- Understand how the phenotypes are related to each other
- Understand how the data are organized
Plot often, plot everything!
Are the data consistent with your expectations?
- “Unusual” data may be erroneous.

24.1 Project 1 Data

In the ds data frame we have the synthetic yet realistic data we will be using in Project 1.

In the dd data frame we have the corresponding data dictionary.

24.2 Explore Project 1 data

Let’s explore the Project 1 data set:

ds = data set
dd = data dictionary

Project 1 Questions

Which of the measurements are sample-specific?
Which are subject-specific?
How to structure the data for sharing via dbGaP?

24.3 Dimensions and Names

What are the names of the variables in our data files?
What are the dimensions of our data?

24.4 Dimensions

Task: Examine the names and dimensions of our data and data dictionary.

Tip

Here the commands names(), dim(), head(), and tail() are useful.

Another useful command is glimpse(). Its man page describes it as:

glimpse() is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It’s a little like str() applied to a data frame but it tries to show you as much data as possible.

Expand to see solution

24.4.1 Data `ds`

24.4.2 Data dictionay `dd`

24.5 Arrangement

How are the data arranged?
- Is it in tidy format?
- Is it one row per sample or per subject?
- Were subjects sampled more than once?

24.5.1 Samples or subjects

Is it one row per sample or per subject?

Question: How would you figure out the answer to this question?

Expand to see solution

24.5.2 Unique values

To figure out which phenotypes vary within subjects, it would be helpful to answer this question:

Question: How can we figure out the number of unique values in each column of our ds data frame?

Tip

The R commands unique() and length() could be useful here.

Write a simple function to do this for a single selected column of the ds data frame, and then figure out how to apply your function to every column of your data frame.

Expand to see solution

Regarding this sapply(ds,function(x) {length(unique(x))}) approach, sapply is defined as:

sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

where X is an R object, and FUN is the function to be applied to each element of X. When X is a data frame, sapply feeds each column of X into the FUN function, one by one. Here’s an example where we first apply the max function to each column of df, and then apply the min function to each column of df:

> df
  a b c d
1 3 4 1 2
2 1 1 4 4
3 4 2 2 1
4 2 3 3 3
> sapply(df, max)
a b c d 
4 4 4 4 
> sapply(df, min)
a b c d 
1 1 1 1

So we know that when using sapply on a data frame, its first argument is the name of the data frame (ds) and its second argument is the function that we wish to apply to each column of the data frame.

Here, in the sapply(ds, function(x) {length(unique(x))}) command we are defining our own function as:

function(x) {
  length(unique(x))
}

When applied by sapply to each column of ds, this function will return the number of unique values in each column.

The same output can also be generated using the map function from the purrr R package.

A similar related question is: How would you count the number of subjects who have more than one distinct measure at each of the phenotypes?

Task: For each phenotype, total the count the number of subjects who have more than one distinct measure.

Tip

One approach for doing this would be to take the phenotype and group by subject_id and count distinct values within those subject-specific groups, and then add up the total number of subjects who have more than one distinct value.

Expand to see solution

The variables with a count of zero here are those where no more than 1 distinct value was observed for each subject. These are likely subject-level variables. Indeed, for the majority of these, the variable names are consistent with them being subject-level variables instead of variables that are measured every time a sample was taken.

Here’s another solution for this problem, written by Tianze (Vincent) Luo, a student in our Fall 2024 HuGen 2071 course:

24.5.3 Subject-level data set

For submission to dbGaP, we need to separate the phenotype data into subject-level data and sample-level data.

The dbGaP submission guide here

https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/#12-what-data-must-be-included-in

discusses the distinction between sample-level phenotypes and subject-level phenotypes as follows:

For the Subject Phenotypes, it would be data relevant to the individual person. For the Sample Attributes, it would be data relevant to the sample derived from the person. For instance, do not list the RACE variable in the Sample Attributes, since RACE is stable for a person across samples. However, for variables like TREATMENT, if the person was only treated once, and data was collected, then TREATMENT could belong in the Subject Phenotypes table. However, if TREATMENT was completed multiple times, and each time a sample was extracted, then it would be better for TREATMENT to be tracked in the Sample Attributes table.

Hint: Remember that in the original study design, a blood sample was to be taken once per trimester. Each of these blood samples was assigned a sample ID, and for each of these, the trimester (1, 2, or 3) and the gestational age of the baby at the time of sampling was recorded. When the synthetic data were generated, not everyone ended up with exactly 3 samples.

Task: Construct a subject-level data set ds.subj

How would you construct a subject-level data set?

Expand to see solution

We need to drop the sample-specific measures, retaining only subject-level measures, and then select unique records:

But there is a duplicated record where race differs but all other attributes are identical, so we filter one of those two records out:

Note: When you notice a duplicated record like that, you should report the discrepancy back to the person primarily in charge of the data, so they can correct their source data.

24.6 Coding

How are the data coded?
- Are they coded correctly?
- Which are categorical and which are continuous?
- Are they coded consistently with the data dictionary?
- Is there a data dictionary?
- Do we need to skip rows when reading the data in?

24.6.1 Recode for understandability

Using the subject-level data set ds.subj, let’s recode case_control_status from 0 and 1 into a new PE_status variable coded as control and case.

First, look up the coding used for the case_control_status variable in the Data Dictionary dd:

Task:: Using the subject-level data set ds.subj, recode case_control_status from 0 and 1 into a new PE_status variable coded as control and case.

Tip

Coding case_control_status as a factor might be helpful here.

Another approach would be to use mutate and case_when commands from the Tidyverse.

Expand to see solution

So the data dictionary gives the meaning of the 0 and 1 codes:

“0: normotensive control; 1: preeclampsia case”

Recoding could also be done using Tidyverse function:

24.7 Missing data

What is the pattern of missing data?
- How are missing data coded?
- Is there a single missing data code?

Here we could use plot_missing from the DataExplorer R package.

https://boxuancui.github.io/DataExplorer/index.html

Task:: Try out plot_missing on the subject-level data set ds.subj.

Expand to see solution

It is kind of unusual to have no missing data in a real data set.

To see what the output might look like when there is some missing data, let’s introduce some using the createNAs function from this StackOverflow entry:

https://stackoverflow.com/questions/39513837/add-exact-proportion-of-random-missing-values-to-data-frame

When there is some missing data, in addition to applying plot_missing and profile_missing, you could also apply functions from the ‘VIM’ R package, which has a number of commands that are useful for visualizing missing data patterns.

https://cran.r-project.org/web/packages/VIM/vignettes/VIM.html

24.8 Distribution

What is the distribution of each of our phenotypes?
- Are data skewed?
- What is the range of values?
- Is the range of values realistic?

Potentially useful DataExplorer commands to use in this context include:

plot_bar    Plot bar chart
plot_density    Plot density estimates
plot_histogram  Plot histogram
plot_qq Plot QQ plot

Task:: Try out these commands.

Expand to see solution

24.9 Variation

How do our data vary and co-vary?
- Do multiple measures agree with each other?
- Are there sex-specific or age-specific differences?
- Are there trait-specific differences?

Task:: As it is of interest to examine how our traits vary by pre-eclampsia case/control status, we can explore this by using the by="PE_status" argument within the DataExplorer commands to break down the plots drawn in the previous section by PE_status.

Also try creating boxplots using the plot_boxplot command.

Expand to see solution

24.9.1 Bar plots

24.9.2 Box plots

24.9.3 QQ plots

24.9.4 Correlation

For plotting correlation matrices, DataExplorer provides the plot_correlation command.

Task:: Try plot_correlation out, on the subset of numeric columns.

Expand to see solution

24.9.5 `ggpairs`

Use ggpairs from the GGally R package.

Task:: Try it out - apply ggpairs to ds1.

Expand to see solution

Task:: Redraw the ggpairs plot, using the mapping argument to color by PE_status.

To figure out how do this, look at the examples in the ?ggpairs function documentation.

Hint

This cannot be done using the ds1 object because that does not contain any PE_status information.

Expand to see solution

24.9.6 `ggcorr`

Note

The ggcorr function from the GGally R package can also be used to make a correlation matrix plot.

24.10 `DataExplorer`

We can quickly create a report using the create_report function from the DataExplorer R package

create_report(ds.subj)

See

https://boxuancui.github.io/DataExplorer/

24.11 `dataMaid`

The dataMaid R package can also be used to create an exporatory data analysis report.

library(dataMaid)
makeDataReport(ds.subj, output="html")

See

https://www.jstatsoft.org/article/view/v090i06

24.12 `SmartEDA`

The SmartEDA R package also has a command to create an exploratory data analysis report - this command is ExpReport.

library(SmartEDA)
ExpReport(ds.subj, op_file="SmartEDAReport.html")
ExpReport(ds.subj, Target="PE_status", Rc="control", op_file="SmartEDAReportII.html")

For more information, see https://cran.r-project.org/web/packages/SmartEDA/vignettes/SmartEDA.html

24 Exploratory Data Analysis

24.1 Project 1 Data

24.2 Explore Project 1 data

24.3 Dimensions and Names

24.4 Dimensions

24.4.1 Data ds

24.4.2 Data dictionay dd

24.5 Arrangement

24.5.1 Samples or subjects

24.5.2 Unique values

24.5.3 Subject-level data set

24.6 Coding

24.6.1 Recode for understandability

24.7 Missing data

24.8 Distribution

24.9 Variation

24.9.1 Bar plots

24.9.2 Box plots

24.9.3 QQ plots

24.9.4 Correlation

24.9.5 ggpairs

24.9.6 ggcorr

24.10 DataExplorer

24.11 dataMaid

24.12 SmartEDA

24.4.1 Data `ds`

24.4.2 Data dictionay `dd`

24.9.5 `ggpairs`

24.9.6 `ggcorr`

24.10 `DataExplorer`

24.11 `dataMaid`

24.12 `SmartEDA`