23 R Exploratory Data Analysis Exercise
24 Exploratory Data Analysis
- Goal: Understand your data
- Ask questions
- Understand each phenotype
- Understand how each phenotype varies
- Understand how the phenotypes are related to each other
- Understand how the data are organized
- Plot often, plot everything!
- Are the data consistent with your expectations?
- “Unusual” data may be erroneous.
24.1 Project 1 Data
In the ds
data frame we have the synthetic yet realistic data we will be using in Project 1.
In the dd
data frame we have the corresponding data dictionary.
24.2 Explore Project 1 data
Let’s explore the Project 1 data set:
- ds = data set
- dd = data dictionary
Project 1 Questions
- Which of the measurements are sample-specific?
- Which are subject-specific?
- How to structure the data for sharing via dbGaP?
24.3 Dimensions and Names
- What are the names of the variables in our data files?
- What are the dimensions of our data?
24.4 Dimensions
Task: Examine the names and dimensions of our data and data dictionary.
Here the commands names()
, dim()
, head()
, and tail()
are useful.
Another useful command is glimpse()
. Its man page describes it as:
glimpse() is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It’s a little like str() applied to a data frame but it tries to show you as much data as possible.
24.4.1 Data ds
24.4.2 Data dictionay dd
24.5 Arrangement
- How are the data arranged?
- Is it in tidy format?
- Is it one row per sample or per subject?
- Were subjects sampled more than once?
24.5.1 Samples or subjects
Is it one row per sample or per subject?
Question: How would you figure out the answer to this question?
24.5.2 Unique values
To figure out which phenotypes vary within subjects, it would be helpful to answer this question:
Question: How can we figure out the number of unique values in each column of our ds
data frame?
The R commands unique()
and length()
could be useful here.
Write a simple function to do this for a single selected column of the ds
data frame, and then figure out how to apply your function to every column of your data frame.
Regarding this sapply(ds,function(x) {length(unique(x))})
approach, sapply
is defined as:
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
where X
is an R object, and FUN
is the function to be applied to each element of X
. When X
is a data frame, sapply
feeds each column of X
into the FUN
function, one by one. Here’s an example where we first apply the max
function to each column of df
, and then apply the min
function to each column of df
:
> df
a b c d
1 3 4 1 2
2 1 1 4 4
3 4 2 2 1
4 2 3 3 3
> sapply(df, max)
a b c d
4 4 4 4
> sapply(df, min)
a b c d
1 1 1 1
So we know that when using sapply
on a data frame, its first argument is the name of the data frame (ds
) and its second argument is the function that we wish to apply to each column of the data frame.
Here, in the sapply(ds, function(x) {length(unique(x))})
command we are defining our own function as:
function(x) {
length(unique(x))
}
When applied by sapply
to each column of ds
, this function will return the number of unique values in each column.
The same output can also be generated using the map
function from the purrr
R package.
A similar related question is: How would you count the number of subjects who have more than one distinct measure at each of the phenotypes?
Task: For each phenotype, total the count the number of subjects who have more than one distinct measure.
One approach for doing this would be to take the phenotype and group by subject_id
and count distinct values within those subject-specific groups, and then add up the total number of subjects who have more than one distinct value.
The variables with a count of zero here are those where no more than 1 distinct value was observed for each subject. These are likely subject-level variables. Indeed, for the majority of these, the variable names are consistent with them being subject-level variables instead of variables that are measured every time a sample was taken.
Here’s another solution for this problem, written by Tianze (Vincent) Luo, a student in our Fall 2024 HuGen 2071 course:
24.5.3 Subject-level data set
For submission to dbGaP, we need to separate the phenotype data into subject-level data and sample-level data.
The dbGaP submission guide here
https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/#12-what-data-must-be-included-in
discusses the distinction between sample-level phenotypes and subject-level phenotypes as follows:
For the Subject Phenotypes, it would be data relevant to the individual person. For the Sample Attributes, it would be data relevant to the sample derived from the person. For instance, do not list the RACE variable in the Sample Attributes, since RACE is stable for a person across samples. However, for variables like TREATMENT, if the person was only treated once, and data was collected, then TREATMENT could belong in the Subject Phenotypes table. However, if TREATMENT was completed multiple times, and each time a sample was extracted, then it would be better for TREATMENT to be tracked in the Sample Attributes table.
Hint: Remember that in the original study design, a blood sample was to be taken once per trimester. Each of these blood samples was assigned a sample ID, and for each of these, the trimester (1, 2, or 3) and the gestational age of the baby at the time of sampling was recorded. When the synthetic data were generated, not everyone ended up with exactly 3 samples.
Task: Construct a subject-level data set ds.subj
How would you construct a subject-level data set?
We need to drop the sample-specific measures, retaining only subject-level measures, and then select unique records:
But there is a duplicated record where race
differs but all other attributes are identical, so we filter one of those two records out:
Note: When you notice a duplicated record like that, you should report the discrepancy back to the person primarily in charge of the data, so they can correct their source data.
24.6 Coding
- How are the data coded?
- Are they coded correctly?
- Which are categorical and which are continuous?
- Are they coded consistently with the data dictionary?
- Is there a data dictionary?
- Do we need to skip rows when reading the data in?
24.6.1 Recode for understandability
Using the subject-level data set ds.subj
, let’s recode case_control_status
from 0
and 1
into a new PE_status
variable coded as control
and case
.
First, look up the coding used for the case_control_status
variable in the Data Dictionary dd
:
Task:: Using the subject-level data set ds.subj
, recode case_control_status
from 0
and 1
into a new PE_status
variable coded as control
and case
.
Coding case_control_status
as a factor might be helpful here.
Another approach would be to use mutate
and case_when
commands from the Tidyverse.
So the data dictionary gives the meaning of the 0
and 1
codes:
“0: normotensive control; 1: preeclampsia case”
Recoding could also be done using Tidyverse function:
24.7 Missing data
- What is the pattern of missing data?
- How are missing data coded?
- Is there a single missing data code?
Here we could use plot_missing
from the DataExplorer
R package.
https://boxuancui.github.io/DataExplorer/index.html
Task:: Try out plot_missing
on the subject-level data set ds.subj
.
It is kind of unusual to have no missing data in a real data set.
To see what the output might look like when there is some missing data, let’s introduce some using the createNAs
function from this StackOverflow entry:
When there is some missing data, in addition to applying plot_missing
and profile_missing
, you could also apply functions from the ‘VIM’ R package, which has a number of commands that are useful for visualizing missing data patterns.
https://cran.r-project.org/web/packages/VIM/vignettes/VIM.html
24.8 Distribution
- What is the distribution of each of our phenotypes?
- Are data skewed?
- What is the range of values?
- Is the range of values realistic?
Potentially useful DataExplorer
commands to use in this context include:
plot_bar Plot bar chart
plot_density Plot density estimates
plot_histogram Plot histogram
plot_qq Plot QQ plot
Task:: Try out these commands.
24.9 Variation
- How do our data vary and co-vary?
- Do multiple measures agree with each other?
- Are there sex-specific or age-specific differences?
- Are there trait-specific differences?
Task:: As it is of interest to examine how our traits vary by pre-eclampsia case/control status, we can explore this by using the by="PE_status"
argument within the DataExplorer
commands to break down the plots drawn in the previous section by PE_status
.
Also try creating boxplots using the plot_boxplot
command.
24.9.1 Bar plots
24.9.2 Box plots
24.9.3 QQ plots
24.9.4 Correlation
For plotting correlation matrices, DataExplorer
provides the plot_correlation
command.
Task:: Try plot_correlation
out, on the subset of numeric columns.
24.9.5 ggpairs
Use ggpairs
from the GGally
R package.
Task:: Try it out - apply ggpairs
to ds1
.
Task:: Redraw the ggpairs
plot, using the mapping
argument to color by PE_status
.
To figure out how do this, look at the examples in the ?ggpairs
function documentation.
This cannot be done using the ds1
object because that does not contain any PE_status
information.
24.9.6 ggcorr
The ggcorr
function from the GGally
R package can also be used to make a correlation matrix plot.
24.10 DataExplorer
We can quickly create a report using the create_report
function from the DataExplorer
R package
create_report(ds.subj)
See
24.11 dataMaid
The dataMaid
R package can also be used to create an exporatory data analysis report.
library(dataMaid)
makeDataReport(ds.subj, output="html")
See
24.12 SmartEDA
The SmartEDA
R package also has a command to create an exploratory data analysis report - this command is ExpReport
.
library(SmartEDA)
ExpReport(ds.subj, op_file="SmartEDAReport.html")
ExpReport(ds.subj, Target="PE_status", Rc="control", op_file="SmartEDAReportII.html")
For more information, see https://cran.r-project.org/web/packages/SmartEDA/vignettes/SmartEDA.html