4 Active Learning and Readings
4.1 Introduction and Overview
4.1.1 Learning Objectives
- Review the syllabus
- Describe bioinformatics and genetic/genomic data
- Describe dbGaP, an important genomic data repository
4.1.2 Required Reading
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007 Oct;39(10):1181-6. doi: 10.1038/ng1007-1181. PMID: 17898773; PMCID: PMC2031016. https://pubmed.ncbi.nlm.nih.gov/17898773/
4.1.3 Suggested Readings
Barnes (2007) Chapter 1 Carey MA, Papin JA. Ten simple rules for biologists learning to program. PLoS Comput Biol. 2018;14(1):e1005871. https://doi.org/10.1371/journal.pcbi.1005871
Dudley JT, Butte AJ. A quick guide for developing effective bioinformatics programming skills. PLoS Comput Biol. 2009;5(12):e1000589. https://doi.org/10.1371/journal.pcbi.1000589
4.2 GitHub
4.2.1 Learning Objectives
- To learn how to use GitHub
- To learn how to use GitHub Classroom
- To learn how to use GitHub within RStudio
4.2.2 Online Lecture
GitHub Introduction: https://danieleweeks.github.io/HuGen2071/gitIntro.html
4.2.3 Active Learning
Version Control with git and GitHub (Sections 4.1 - 4.4): https://learning.nceas.ucsb.edu/2020-11-RRCourse/session-4-version-control-with-git-and-github.html
4.2.4 Required Readings
GitHub Classroom Guide for Students
To set up GitHub Classroom, please follow the steps to set up RStudio, R, and git in this detailed guide: https://github.com/jfiksel/github-classroom-for-students
Choose your GitHub user name carefully, as later in your career you may end up using it in a professional context.
Be sure to generate an SSH key so you don’t need to enter your password every time you interact with GitHub.
Do not clone your repository onto a OneDrive or other cloud folder, as git does not work properly on cloud drives. Cloud drive systems typically maintain their own backup copies and this confuses git.
4.2.5 Suggested Readings
Happy Git and GitHub for the useR. https://happygitwithr.com/
Perez-Riverol Y, Gatto L, Wang R, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol. 2016;12(7):e1004947. https://doi.org/10.1371/journal.pcbi.1004947
Version Control with Git: https://swcarpentry.github.io/git-novice/
Using Git from RStudio: https://ucsbcarpentry.github.io/2020-08-10-Summer-GitBash/24-supplemental-rstudio/index.html
4.3 R: Basics
4.3.1 Learning Objectives
- To become familiar with the R language and concepts
- To learn how to read and write data with R
- To learn control flow: choices and loops
4.3.2 Online Lectures
R Basics: https://danieleweeks.github.io/HuGen2071/RBasicsLecture.html
4.3.3 Active Learning:
4.3.4 Suggested Readings
Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)
Read the first four sections, up to the end of ‘Vectors, Vectorization, and Indexing’
https://datacarpentry.org/R-genomics/01-intro-to-R.html
Supplementary Reading: Spector (2008) Chapters 1 & 2 (Available online through PittCat+; link in syllabus)
4.4 R: Factors, Dates, Subscripting
4.4.1 Learning Objectives
- To learn how to subset data with R
- To learn how to handle factors and dates with R
- To learn how to manipulate characters with R
4.4.2 Online Lecture
R: factors, subscripting: https://danieleweeks.github.io/HuGen2071/RFactors.html
4.4.3 Active Learning:
Subsetting: https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting.html. This uses the gapminder data from here.
Factors: https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors.html. This uses data from this Zip file.
4.4.4 Suggested Readings
Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)
Read the ‘Factors and classes in R’ subsection at the end of the ‘Vectors, Vectorization, and Indexing’ section.
Read the ‘Exploring Data Through Slicing and Dicing: Subsetting Dataframes’ section.
Read the ‘Working with Strings’ section.
https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html
Supplementary Readings: Spector (2008) Chapters 4, 5, 6
4.5 R: Character Manipulation
4.5.1 Learning Objectives
- To learn how to handle character data in R
- To learn how to use regular expressions in R
4.5.2 Active Learning
Regular expressions: https://csiro-data-school.github.io/regex/08-r-regexs/index.html
4.5.3 Required Readings
Read the chapter on “Strings” in “R for Data Science”: https://r4ds.hadley.nz/strings
4.5.4 Suggested Readings
See the “String manipulation with stringr cheatsheet” at https://rstudio.github.io/cheatsheets/html/strings.html
Buffalo (2015) Chapter 8 ‘R Language Basics’ (Available online through PittCat+)
Read the ‘Working with Strings’ section at the end of the “Working with and Visualizing Data in R” section.
Read the chapter on “Strings” in “R for Data Science”: https://r4ds.hadley.nz/strings
Read the chapter on “Regular expressions” in “R for Data Science”: https://r4ds.hadley.nz/regexps
Supplementary Reading: Spector (2008) Chapter 7
4.6 R: Loops and Flow Control
4.6.1 Learning Objectives
- To learn how to implement loops in R
- To learn how to control flow in R
- To learn how to vectorize operations
4.6.2 Online Lectures
Loops in R: https://danieleweeks.github.io/HuGen2071/RLoops.html
4.6.3 Active Learning:
Flow control and loops: https://swcarpentry.github.io/r-novice-gapminder/07-control-flow.html
Loops in R, Part I: https://danieleweeks.github.io/HuGen2071/loops.html
Vectorization: https://swcarpentry.github.io/r-novice-gapminder/09-vectorization.html
4.7 R: Functions and Packages, Debugging R
4.7.1 Learning Objectives
- To learn how to write R functions and packages
- To learn how to debug R code
4.7.2 Active Learning:
https://swcarpentry.github.io/r-novice-gapminder/10-functions.html
4.7.3 Suggested Readings
Functions Explained: https://swcarpentry.github.io/r-novice-gapminder/10-functions.html
Buffalo (2015) Chapter 8: Read the section ‘Digression: Debugging R Code’
4.8 R: Tidyverse
4.8.1 Learning Objectives
- To learn how to use the pipe operator
- To learn how to use Tidyverse functions
4.8.2 Active Learning:
https://datacarpentry.org/genomics-r-intro/05-dplyr.html
The data file used in this is the combined_tidy_vcf.csv
file that can be downloaded from here.
4.8.3 Suggested Readings
Introduction to the Tidyverse: Manipulating tibbles with dplyr https://uomresearchit.github.io/r-day-workshop/04-dplyr/
Supplementary Reading: Buffalo (2015) Chapter 8: section ‘Exploring Dataframes with dplyr’
4.9 R: Recoding and Reshaping Data
4.9.1 Learning Objectives
- To learn how to reformat and reshape data in R
4.9.2 Active Learning:
Reshaping data https://sscc.wisc.edu/sscc/pubs/dwr/reshape-tidy.html
Recoding data: Pay particular attention to the Recoding values
and Creating new variables
sections
https://librarycarpentry.org/lc-r/03-data-cleaning-and-transformation.html
4.9.3 Suggested Readings
Supplementary Reading: Spector (2008) Chapters 8 & 9
4.10 R: Merging Data
4.10.1 Learning Objectives
- To learn how to use the R ‘merge’ command
- To learn how to use the R Tidyverse join commands
4.10.2 Active Learning:
https://mikoontz.github.io/data-carpentry-week/lesson_joins.html
continents.RDA
data set used near the end of this Active Learning exercise: https://mikoontz.github.io/data-carpentry-week/data/continents.RDA
4.10.3 Required Reading
Tidy Animated Verbs https://www.garrickadenbuie.com/project/tidyexplain/
4.10.4 Suggested Readings
https://mikoontz.github.io/data-carpentry-week/lesson_joins.html#practice_with_joins_using_gapminder
Supplementary Reading: Buffalo (2015) Chapter 8 ‘Merging and Combining Data’. Spector (2008) Chapter 9.
4.11 R: Traditional Graphics & Advanced Graphics
4.11.1 Learning Objectives
- To learn the basic graphics commands of R
- To learn the R graphing package ggplot2
4.11.2 Active Learning:
Data visualization with ggplot2: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
To create the required data for this “Data visualization with ggplot2” exercise, run this code:
library(tidyverse)
download.file(url = "https://ndownloader.figshare.com/files/2292169",
destfile = "portal_data_joined.csv")
surveys <- read_csv("portal_data_joined.csv")
surveys_complete <- surveys %>%
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex))
4.11.3 Suggested Readings
Plotting with ggplot2 https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
Supplementary Reading: Wickham (2009) Chapters 2 & 3
4.12 R: Exploratory Data Analysis
4.12.1 Learning Objectives
- To learn how to summarize data frames
- To learn how to visualize missing data patterns
- To learn how to visualize covariation
4.12.2 Active Learning
Exploratory analysis of RNAseq count data https://tavareshugo.github.io/data-carpentry-rnaseq/02_rnaseq_exploratory.html
4.12.3 Readings
Missing value visualization with tidyverse in R https://towardsdatascience.com/missing-value-visualization-with-tidyverse-in-r-a9b0fefd2246
Suggested Reading: Buffalo (2015) Chapter 8 Sections: Exploring Data Visually with ggplot2 I: Scatterplots and Densities Exploring Data Visually with ggplot2 II: Smoothing Binning Data with cut() and Bar Plots with ggplot2 Using ggplot2 Facets.
4.13 R: Genomic Ranges; Interactive Graphics
4.13.1 Learning Objectives - Genomic Ranges
- To learn about Genomic Ranges
- To learn to use Genomic Ranges to annotate SNPs of interest
4.13.2 Preparation - Genomic Ranges
Before class, install these BioConductor packages: (1) TxDb.Hsapiens.UCSC.hg19.knownGene
, and (2) org.Hs.eg.db
To install these, use these commands:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
BiocManager::install("org.Hs.eg.db")
4.13.3 Required Reading - Genomic Ranges
An Introduction to Bioconductor’s Packages for Working with Range Data
https://github.com/vsbuffalo/genomicranges-intro/blob/master/notes.md
4.13.4 Active Learning - Genomic Ranges
Working with genomics ranges
https://carpentries-incubator.github.io/bioc-project/07-genomic-ranges.html
4.13.5 Suggested Readings - Genomic Ranges
In “Bioinformatics Data Skills”, see Chapter 9 “Working with Range Data”
Bioinformatics Data Skills
Editor: Vince Buffalo
Publisher: O’Reilly
Web access: link
Hello Ranges: An Introduction to Analyzing Genomic Ranges in R.
link
4.13.6 Learning Objectives - Interactive Graphics
- To learn how to use interactive and dynamic graphics to explore your data more thoroughly
- To learn to use plotly
4.13.7 Required Reading - Interactive Graphics
Create interactive ggplot2 graphs with plotly https://www.littlemissdata.com/blog/interactiveplots
4.14 Suggested Reading - Interactive Graphics
Wickham (2009) Chapters 2 & 3
4.15 Data Quality Checking and Filters
4.15.1 Learning Objectives
- To learn the principles of data cleaning
- To practice applying data cleaning principles
- To learn how to check genotype data for quality
4.15.2 Active Learning
To see an example of quality control for SNP genotyping using Illumina genotyping microarrays, please read through this example report:
https://khp-informatics.github.io/COPILOT/README_summary_report.html
For more details, see this Current Protocols paper, which is long and detailed, but you can get most of the main points by concentrating on the Figures:
Patel H, Lee S-H, Breen G, Menzel S, Ojewunmi O, Dobson RJB. The COPILOT Raw Illumina Genotyping QC Protocol. Current Protocols. 2022;2(4):e373. PMID: 35452565 DOI: https://doi.org/10.1002/cpz1.373
4.15.3 Suggested Readings
Kässens JC, Wienbrandt L, Ellinghaus D. BIGwas: Single-command quality control and association testing for multi-cohort and biobank-scale GWAS/PheWAS data. GigaScience. 2021 Jun 1;10(6):giab047. PMID: 34184051 PMCID: PMC8239664 DOI: https://doi.org/10.1093/gigascience/giab047
Brandenburg J-T, Clark L, Botha G, Panji S, Baichoo S, Fields C, Hazelhurst S. H3AGWAS: a portable workflow for genome wide association studies. BMC Bioinformatics. 2022 Nov 19;23(1):498. PMID: 36402955 PMCID: PMC9675212 DOI: https://doi.org/10.1186/s12859-022-05034-w
Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Data quality control in genetic case-control association studies. Nat Protoc. 2010 Sep;5(9):1564–1573. DOI: https://doi.org/10.1038/nprot.2010.116
Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, Gabriel SB, Harris EL, Hu FB, Jacobs KB, Kraft P, Landi MT, Lumley T, Manolio TA, McHugh C, Painter I, Paschall J, Rice JP, Rice KM, Zheng X, Weir BS, GENEVA Investigators. Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic epidemiology. 2010 Sep;34(6):591–602. PMID: 20718045 DOI: https://doi.org/10.1002/gepi.20516
4.16 Unix: Basics
4.16.1 Learning Objectives
- To learn basic Unix commands
4.16.2 Preparation
Watch the online lecture and do the Active Learning before class
Do the Unix setup homework assignment
4.16.3 Online Lecture
Unix Basics: https://danieleweeks.github.io/HuGen2071/unix_basics.html
4.16.4 Active Learning
Software Carpentry Unix Shell intro parts 1-3 https://swcarpentry.github.io/shell-novice/
4.16.5 Required Reading
See Active Learning.
4.16.6 Suggested Reading
Buffalo (2015) Chapter 2. Setting up and managing a bioinformatics project.
Buffalo (2015)Chapter 3. Remedial Unix Shell (beginning of chapter up to and not including “working with streams and redirection”)
Terminus, a web-based game for learning and practicing basic Unix commands https://web.mit.edu/mprat/Public/web/Terminus/Web/main.html
“Chapter 43: Redirecting Input and Output” in Unix Power Tools, 3rd Edition by Jerry Peek, Shelley Powers, Tim O’Reilly, Mike Loukides. Published by O’Reilly Media, Inc. https://pitt.primo.exlibrisgroup.com/permalink/01PITT_INST/e8h8hp/alma9998520758606236
4.17 Unix: Streams, Pipes, Scripts
4.17.1 Learning Objectives
- To learn how streams operate in Unix
- To learn out to pass streamed data from program to program in Unix
- To learn how to interact with running processes
- To learn how to write a script that can run in Unix
- To learn about the cluster and how to submit jobs there
4.17.2 Preparation
- Watch the online lecture and do the Active Learning before class
4.17.3 Online Lecture
Unix: Streams, Pipes, Scripts: https://danieleweeks.github.io/HuGen2071/unix_streams_pipes_scripts.html
4.17.4 Active Learning
Software Carpentry Unix Shell intro parts 4 and 6 https://swcarpentry.github.io/shell-novice/
4.17.5 Required Reading
See Active Learning.
4.17.6 Suggested Reading
Buffalo (2015)Chapter 3. Remedial Unix Shell (from “working with streams and redirection” to and not including “command substitution”)
4.18 Genetic Data Structures
4.18.1 Learning Objectives
By the end of the learning activities on this topic, students will be able to:
- Implement a method for storing relationship information about individuals.
- Implement a method for distinguishing between social gender, biological sex, and sex chromosome complement.
- Create a data set that can store basic pedigree information “by hand.”
4.18.2 Readings
Bennett RL, Steinhaus KA, Uhrich SB, O’Sullivan CK, Resta RG, Lochner-Doyle D, Markel DS, Vincent V, Hamanishi J. Recommendations for standardized human pedigree nomenclature. J Genet Couns. 1995 Dec;4(4):267-79. https://doi.org/10.1007/BF01408073. PMID: 24234481.
Bennett RL, French KS, Resta RG, Doyle DL. Standardized human pedigree nomenclature: update and assessment of the recommendations of the National Society of Genetic Counselors. J Genet Couns. 2008 Oct;17(5):424-33. https://doi.org/10.1007/s10897-008-9169-9. Epub 2008 Sep 16. PMID: 18792771.
Bennett RL, French KS, Resta RG, Austin J. Practice resource-focused revision: Standardized pedigree nomenclature update centered on sex and gender inclusivity: A practice resource of the National Society of Genetic Counselors. J Genet Couns. 2022 Sep 15. https://doi.org/10.1002/jgc4.1621. Epub ahead of print. PMID: 36106433.
Montañez A. Beyond XX and XY: The Extraordinary Complexity of Sex Determination. Sci Am. 2017 Sep;317(3):50. https://doi.org/10.1038/scientificamerican0917-50.
Montañez A. Visualizing Sex as a Spectrum. Sci Am blog. 2017 Aug 29. https://www.scientificamerican.com/blog/sa-visual/visualizing-sex-as-a-spectrum/.
4.19 PLINK I
4.19.1 Learning Objectives
- Describe PLINK formats
- Create PLINK datafiles
- Use PLINK to perform genetic association testing
4.19.2 Readings
Marees AT, de Kluiver H, Stringer S, Vorspan F, Curis E, Marie-Claire C, Derks EM. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018 Jun;27(2):e1608. PMID: 29484742 PMCID: PMC6001694 DOI: https://doi.org/10.1002/mpr.1608
https://github.com/MareesAT/GWA_tutorial/
Introduction to PLINK (22n14-rlm-Introduction_to_PLINK.pdf, which is included in this lecture’s repository).
4.20 PLINK II
4.20.1 Learning Objectives
- To learn how to use PLINK to manipulate data files
4.21 PLINK Computer Lab
4.21.1 Learning Objectives
- To practice using PLINK to manipulate data files
4.22 Unix: Data Manipulation
4.22.1 Learning Objectives
- To learn Unix tools like sed and awk that can be used to manipulate data
4.22.2 Preparation
- Watch the online lecture and do the Active Learning before class
4.22.3 Online Lecture
Unix Data Manipulation: https://danieleweeks.github.io/HuGen2071/unix_data_manipulation.html
See Required Reading.
4.22.4 Active Learning
See Required Reading.
4.22.5 Required Reading
Buffalo (2015)Chapter 7. Unix Data Tools (Beginning of chapter up to and including “Finding Unique values in Uniq”)
4.22.6 Suggested Reading
None.
4.23 Unix: Miscellaneous
4.23.1 Learning Objectives
- To learn to string programs together to process data
- To learn how to parallelize functions in Unix
4.23.2 Preparation
- Watch the online lecture and do the Active Learning before class
4.23.3 Online Lecture
Unix Miscellaneous: https://danieleweeks.github.io/HuGen2071/unix_miscellaneous.html
4.23.4 Active Learning
See Required Reading.
4.23.5 Required Reading
Buffalo (2015)Chapter 7. Unix Data Tools (“Join” through the end of the chapter)
4.23.6 Suggested Reading
None.
4.24 Unix: Scripting
4.24.1 Learning Objectives - Unix: Scripting
- To learn how to use control structures in Unix scripting
- To learning how to use variables in Unix
4.24.2 Preparation - Unix: Scripting
- Do the Active Learning before class - the lecture will assume you have; otherwise you will have difficulty with the in-class exercises
4.24.3 Active Learning - Unix: Scripting
Software Carpentry Unix Shell intro parts 5 and 7 https://swcarpentry.github.io/shell-novice/
4.24.4 Required Reading - Unix: Scripting
See Active Learning.
4.24.5 Suggested Reading - Unix: Scripting
Buffalo (2015)Chapter 3. Remedial Unix Shell (“command substitution” through the end of the chapter.)
Buffalo (2015)Chapter 12. Bioinformatics Shell Scripting (entire chapter)
4.25 VCF, bcftools, vcftools
4.25.1 Learning Objectives
- To learn about VCF data format
- To learn about bcftools and vcftools for manipulating VCF files
4.26 SAM & samtools
4.26.1 Learning Objectives
- To learn about SAM data format for sequence data
- To learn about samtools to manipulate SAM data files
4.26.2 Readings
Buffalo Chapter 11 “Working with Alignment Data”
Data Wrangling and Processing for Genomics https://data-lessons.github.io/wrangling-genomics/
Relevant links: The Sequence Alignment/Map Format Specification http://samtools.github.io/hts-specs/
4.27 Genetic Data in R, GDS
4.27.1 Learning Objectives - Genetic Data in R, GDS
- To learn about data structures in R for storing genetic data
- To learn about the GDS format
4.27.2 Preparation - Genetic Data in R, GDS
See Required Reading.
4.27.3 Active Learning - Genetic Data in R, GDS
None. See Required Reading.
4.27.4 Required Reading - Genetic Data in R, GDS
Zheng X, Gogarten SM, Lawrence M, Stilp A, Conomos MP, Weir BS, Laurie C, Levine D. SeqArray-a storage-efficient high-performance data format for WGS variant calls. Bioinformatics. 2017 Aug 1;33(15):2251-2257. doi: 10.1093/bioinformatics/btx145. PMID: 28334390; PMCID: PMC5860110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860110/
4.27.5 Suggested Reading - Genetic Data in R, GDS
None