library(tidyverse)
library(knitr)
library(pander)
<- "data/trait.qassoc.gz"
ASSOC.file <- read_table(ASSOC.file)
b qplot(BP,-1.0*log10(P),main="Association p-values for the quantitative trait",data=b) + facet_wrap(~ CHR)+
geom_hline(yintercept=5,col='red') +
geom_hline(yintercept=4,col='green')
25 GRanges Exercise
In an example genome-wide association scan, we had several hits:
Here are our top hits:
<- b[order(b$P),c("CHR","BP","SNP","P")]
top.hits row.names(top.hits) <- NULL
pander(head(top.hits),caption="Top hits from our genome-wide association scan")
CHR | BP | SNP | P |
---|---|---|---|
6 | 1942538 | rs3800143 | 4.608e-11 |
11 | 4913057 | rs10836914 | 1.295e-07 |
11 | 4913314 | rs12577475 | 1.587e-07 |
6 | 1926650 | rs11242725 | 3.785e-07 |
6 | 1880964 | rs3800116 | 1.475e-06 |
6 | 1915129 | rs3778552 | 1.646e-06 |
Now we’d like to annotate each of these SNPs in terms of nearby or overlapping genes.
While we could do this using online databases like UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway), we could also use BioConductor R packages to write our own functions to retrieve this information into a nice tabular format, for each SNP, listing any gene that they might be in, including the SNP’s position and the gene boundaries.
25.1 Introductory Background
An Introduction to Bioconductor’s Packages for Working with Range Data
https://github.com/vsbuffalo/genomicranges-intro/blob/master/notes.md
25.2 Active Learning
Working with genomics ranges
https://carpentries-incubator.github.io/bioc-project/07-genomic-ranges.html
25.3 Suggested readings
In “Bioinformatics Data Skills”, see Chapter 9 “Working with Range Data”
Bioinformatics Data Skills
Editor: Vince Buffalo
Publisher: O’Reilly
Web access: link
Hello Ranges: An Introduction to Analyzing Genomic Ranges in R.
link
25.4 Install the needed Bioconductor libraries (one time only)
First, we need to figure out which genomic build was used in our data set. I usually do this by looking up the base pair positions of a couple of SNPs by hand in the ‘SNP’ data base. So if we go there and search for rs3800143
, we end up on this web page:
https://www.ncbi.nlm.nih.gov/snp/rs3800143#variant_details
which shows a Build GRCh37.p13 position of 1942538 on chromosome 6. So it looks like Build GRCh37.p13 was used, which is also known as Build hg19.
To double check, we can check that the given position for rs10836914 matches the Build GRCh37.p13 position.
So to determine the gene boundaries in Build hg19, we need to download and install the TxDb.Hsapiens.UCSC.hg19.knownGene
library from BioConductor.
So we search for it on BioConductor and that it can be installed by issuing these commands at the R prompt:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
You will also need to install another library, org.Hs.eg.db
:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
25.5 Construct the gene list
Using the annotation embedded in TxDb.Hsapiens.UCSC.hg19.knownGene
, construct a GRangesList
object that contains a list of all the genes.
Hint: Use transcriptsBy
.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
<- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb class(txdb)
[1] "TxDb"
attr(,"package")
[1] "GenomicFeatures"
The transcripts(txdb)
is a GenomicRanges
object:
transcripts(txdb)
GRanges object with 82960 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 11874-14409 + | 1 uc001aaa.3
[2] chr1 11874-14409 + | 2 uc010nxq.1
[3] chr1 11874-14409 + | 3 uc010nxr.1
[4] chr1 69091-70008 + | 4 uc001aal.1
[5] chr1 321084-321115 + | 5 uc001aaq.2
... ... ... ... . ... ...
[82956] chrUn_gl000237 1-2686 - | 82956 uc011mgu.1
[82957] chrUn_gl000241 20433-36875 - | 82957 uc011mgv.2
[82958] chrUn_gl000243 11501-11530 + | 82958 uc011mgw.1
[82959] chrUn_gl000243 13608-13637 + | 82959 uc022brq.1
[82960] chrUn_gl000247 5787-5816 - | 82960 uc022brr.1
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
We now group the transcripts by gene, and so create a GRangesList
object:
<- transcriptsBy(txdb, "gene")
tx.by.gene tx.by.gene
GRangesList object of length 23459:
$`1`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 58858172-58864865 - | 70455 uc002qsd.4
[2] chr19 58859832-58874214 - | 70456 uc002qsf.2
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
$`10`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr8 18248755-18258723 + | 31944 uc003wyw.1
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
$`100`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr20 43248163-43280376 - | 72132 uc002xmj.3
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
...
<23456 more elements>
The names of the list elements are Entrez gene IDs.
25.6 Construct a GRange containing our top SNP
Construct a GRange containing the top SNP. Note that the chromosome name needs to be in the same style as is seen in tx.by.gene
. That is, the chromosome name needs to be “chr6” instead of ‘6’.
<- with(top.hits[1,], GRanges(seqnames=paste0("chr",CHR),
top.snp IRanges(start=BP, width=1),
rsid=SNP, P=P))
top.snp
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | rsid P
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr6 1942538 * | rs3800143 4.608e-11
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
25.7 Search for match
Now use findOverlaps
and subsetByOverlaps
to find any genes that overlap our top SNP.
<- findOverlaps(tx.by.gene,top.snp)
top.snp.overlaps top.snp.overlaps
Hits object with 1 hit and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 8841 1
-------
queryLength: 23459 / subjectLength: 1
<- subsetByOverlaps(tx.by.gene,top.snp)
hits hits
GRangesList object of length 1:
$`2762`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr6 1624035-2176225 - | 25755 uc021ykn.1
[2] chr6 1624035-2245868 - | 25756 uc003mtq.3
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
So our top SNP rs3800143 overlaps two transcripts in a single gene with the same start position of 1624035.
However, the transcript name is not very human-readable, as it uses an Entrez Gene ID 2762.
25.8 Convert Entrez Gene IDs to Gene Names
Convert the integer Entrez Gene ID 2762 to a human-readable Gene Name.
This can be done using the org.Hs.eg.db
R library.
To convert Entrez Gene IDs to Gene Names, we can use another R database:
library(org.Hs.eg.db)
columns(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
names(hits)
[1] "2762"
<- select(org.Hs.eg.db, keys=names(hits), columns=c("ENTREZID", "SYMBOL", "GENENAME"), keytype="ENTREZID") gene.name
'select()' returned 1:1 mapping between keys and columns
gene.name
ENTREZID SYMBOL GENENAME
1 2762 GMDS GDP-mannose 4,6-dehydratase
So our top SNP rs3800143 overlaps the gene with the Entrez Gene ID 2762, which is also known as GMDS (GDP-mannose 4,6-dehydratase).
25.9 Question 1
Read in the data:
<- "data/trait.qassoc.gz"
ASSOC.file <- read_table(ASSOC.file) b
Warning: Missing column names filled in: 'X10' [10]
── Column specification ────────────────────────────────────────────────────────
cols(
CHR = col_double(),
SNP = col_character(),
BP = col_double(),
NMISS = col_double(),
BETA = col_double(),
SE = col_double(),
R2 = col_double(),
T = col_double(),
P = col_double(),
X10 = col_logical()
)
head(b)
# A tibble: 6 × 10
CHR SNP BP NMISS BETA SE R2 T P X10
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 6 rs9392298 203878 1573 0.110 0.0984 0.000801 1.12 0.262 NA
2 6 rs9405486 204072 1569 0.118 0.0982 0.000913 1.20 0.232 NA
3 6 rs7762550 204909 1574 0.0384 0.0685 0.000199 0.560 0.576 NA
4 6 rs1418706 205878 1575 -0.00783 0.0717 0.00000757 -0.109 0.913 NA
5 6 rs6920539 206528 1574 0.119 0.0955 0.000984 1.24 0.214 NA
6 6 rs9502959 206599 1575 -0.0317 0.0538 0.000220 -0.589 0.556 NA
Task: Using GRanges and associated Bioconductor tools, figure out if our top SNP, as ranked by the P-value (column P
) from data frame b
lies within an intron or exon.
For this, the functions exonsBy
and intronsByTranscript
might be useful.
25.10 Answer 1
The top SNP rs3800143 lies in an intron of GMDS (GDP-mannose 4,6-dehydratase), Gene ID: 2762.
Build hg19 (Human Feb. 2009 (GRCh37/hg19)) gene boundaries from UCSC Genes:
GMDS (uc003mtq.3) - chr6:1624035-2245868 - Homo sapiens GDP-mannose 4,6-dehydratase (GMDS), transcript variant 1, mRNA.
Does our top SNP overlap any exons?
<- with(top.hits[1,], GRanges(seqnames=paste0("chr",CHR),
top.snp IRanges(start=BP, width=1),
rsid=SNP, P=P))
top.snp
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | rsid P
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr6 1942538 * | rs3800143 4.608e-11
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
<- exonsBy(txdb, "gene")
exons.by.gene exons.by.gene
GRangesList object of length 23459:
$`1`
GRanges object with 15 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 58858172-58858395 - | 250809 <NA>
[2] chr19 58858719-58859006 - | 250810 <NA>
[3] chr19 58859832-58860494 - | 250811 <NA>
[4] chr19 58860934-58862017 - | 250812 <NA>
[5] chr19 58861736-58862017 - | 250813 <NA>
... ... ... ... . ... ...
[11] chr19 58868951-58869015 - | 250821 <NA>
[12] chr19 58869318-58869652 - | 250822 <NA>
[13] chr19 58869855-58869951 - | 250823 <NA>
[14] chr19 58870563-58870689 - | 250824 <NA>
[15] chr19 58874043-58874214 - | 250825 <NA>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
$`10`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr8 18248755-18248855 + | 113603 <NA>
[2] chr8 18257508-18258723 + | 113604 <NA>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
...
<23457 more elements>
<- findOverlaps(exons.by.gene,top.snp)
top.snp.overlaps top.snp.overlaps
Hits object with 0 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
-------
queryLength: 23459 / subjectLength: 1
<- subsetByOverlaps(exons.by.gene,top.snp)
hits hits
GRangesList object of length 0:
<0 elements>
From the output above, we find that our top SNP overlaps 0 sets of exons grouped by transcripts.
Does our top SNP overlap any introns?
<- intronsByTranscript(txdb)
introns.by.transcript
<- findOverlaps(introns.by.transcript,top.snp)
top.snp.overlaps top.snp.overlaps
Hits object with 2 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 25755 1
[2] 25756 1
-------
queryLength: 82960 / subjectLength: 1
<- subsetByOverlaps(introns.by.transcript,top.snp)
hits hits
GRangesList object of length 2:
$`25755`
GRanges object with 10 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr6 1624467-1624706 -
[2] chr6 1624776-1726649 -
[3] chr6 1726747-1742701 -
[4] chr6 1742821-1930336 -
[5] chr6 1930465-1960100 -
[6] chr6 1960206-1961007 -
[7] chr6 1961201-2116004 -
[8] chr6 2116115-2117702 -
[9] chr6 2117791-2124920 -
[10] chr6 2124966-2176165 -
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
$`25756`
GRanges object with 10 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr6 1624467-1624706 -
[2] chr6 1624776-1726649 -
[3] chr6 1726747-1742701 -
[4] chr6 1742821-1930336 -
[5] chr6 1930465-1960100 -
[6] chr6 1960206-1961007 -
[7] chr6 1961201-2116004 -
[8] chr6 2116115-2117702 -
[9] chr6 2117791-2124920 -
[10] chr6 2124966-2245554 -
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
From the output above, we find that our top SNP overlaps 2 sets of introns grouped by transcripts.
25.11 Question 2
While we have figure out above which gene our top SNP is in, what we worked out above involves very specific code. Now let’s try to generalize what we did above:
Task: Using GRanges and associated Bioconductor tools, write a function that takes as input a ranked list of the SNPS, and returns a nice table that lists the top N of these SNPs and any gene that they might be in, including the SNP position and the gene boundaries.
# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.table <- function(snp.list, N=15) {
}
Apply your snp.table
function to the top 15 SNPs in our example data set.
Use the subsetByOverlaps
followed by the select(org.Hs.eg.db, ...
approach described above in the Search for match
and Convert Entrez Gene IDs to Gene Names
sections above.
25.12 Answer 2
It is important to understand what the output of the findOverlaps
function means.
top.snp.gene <- findOverlaps(gene.bounds, top.snp)
top.snp.gene
returns:
Hits object with 10 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 3269 13
[2] 8841 5
[3] 8841 11
[4] 8841 7
[5] 8841 6
[6] 8841 4
[7] 8841 1
[8] 8841 8
[9] 17049 14
[10] 18258 9
-------
queryLength: 23459 / subjectLength: 15
Here, the ‘subjectHits’ is an index into top.snp
and the queryHits
is an index into gene.bounds
.
So the first line indicates that the 13th SNP in top.snp
overlaps the gene that is in the 3,269 slot of the gene.bounds
object.
# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
<- function(snp.list, N=15) {
snp.table require(org.Hs.eg.db)
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
<- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- transcriptsBy(txdb, "gene")
tx.by.gene # Find the gene boundaries
<- reduce(tx.by.gene)
gene.bounds # Set up a GRange with the first N top SNPs
<- with(snp.list[1:N, ], GRanges(seqnames = paste0("chr", CHR),
top.snp IRanges(start = BP, width = 1),
SNP = SNP, P = P))
# Find overlaps and hits
<- findOverlaps(gene.bounds, top.snp)
top.snp.gene <- subsetByOverlaps(gene.bounds, top.snp)
hits # The SNP hits
<- data.frame(SNP.ID = subjectHits(top.snp.gene),
snp.info SNP.chr = seqnames(top.snp[subjectHits(top.snp.gene)]),
ranges(top.snp[subjectHits(top.snp.gene)]),
mcols(top.snp[subjectHits(top.snp.gene)]))
# The Gene hits
<- data.frame(gene.ID = queryHits(top.snp.gene),
gene.info seqnames(gene.bounds[queryHits(top.snp.gene)]),
ranges(gene.bounds[queryHits(top.snp.gene)]))
# Reduce gene.info to distinct entries
<- gene.info %>% dplyr::select(-group, -group.1) %>% distinct()
gene.info # Construct a key linking SNPs to Genes
<- data.frame(gene.ID = queryHits(top.snp.gene),
key SNP.ID = subjectHits(top.snp.gene))
<- key %>% left_join(gene.info, by = "gene.ID") %>%
snp.gene left_join(snp.info, by = "SNP.ID")
<- select(org.Hs.eg.db, keys = names(hits),
gene.name columns = c("ENTREZID", "SYMBOL", "GENENAME"), keytype = "ENTREZID")
<- snp.gene %>% dplyr::rename(ENTREZID = group_name) %>%
snp.gene left_join(gene.name, by = "ENTREZID")
<- snp.gene %>% arrange(P)
snp.gene <- left_join(snp.list[1:N, ], snp.gene, by = c("SNP"))
snp.gene <- snp.gene %>%
snp.gene ::select(CHR, BP, SNP, P.x, SYMBOL, GENENAME, start.x, end.x, width.x)
dplyr<- snp.gene %>%
snp.gene ::rename(P = P.x, start = start.x, end = end.x, width = width.x)
dplyrreturn(snp.gene)
}
<- snp.table(top.hits, 15)) %>% kable(digits=15) (t
Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
'select()' returned 1:1 mapping between keys and columns
CHR | BP | SNP | P | SYMBOL | GENENAME | start | end | width |
---|---|---|---|---|---|---|---|---|
6 | 1942538 | rs3800143 | 4.608e-11 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
11 | 4913057 | rs10836914 | 1.295e-07 | NA | NA | NA | NA | NA |
11 | 4913314 | rs12577475 | 1.587e-07 | NA | NA | NA | NA | NA |
6 | 1926650 | rs11242725 | 3.785e-07 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
6 | 1880964 | rs3800116 | 1.475e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
6 | 1915129 | rs3778552 | 1.646e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
6 | 1908210 | rs2875711 | 2.550e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
6 | 1955398 | rs9378664 | 2.809e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
6 | 143144885 | rs926372 | 5.704e-06 | HIVEP2 | HIVEP zinc finger 2 | 143072604 | 143266338 | 193735 |
11 | 4912192 | rs10836912 | 8.977e-06 | NA | NA | NA | NA | NA |
6 | 1899671 | rs3800122 | 1.022e-05 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 |
11 | 4910224 | rs2570591 | 1.656e-05 | NA | NA | NA | NA | NA |
11 | 6340706 | rs1051992 | 2.356e-05 | CAVIN3 | caveolae associated protein 3 | 6340176 | 6341740 | 1565 |
6 | 3317016 | rs4959804 | 2.821e-05 | SLC22A23 | solute carrier family 22 member 23 | 3269207 | 3456793 | 187587 |
11 | 4915072 | rs1816448 | 2.834e-05 | NA | NA | NA | NA | NA |
dim(t)
[1] 15 9
Here’s an alternate way to construct the SNP table which is an incomplete example because it doesn’t include the gene boundaries. However, it is simpler than the function above because it avoids the construction of a key and the left_joins used above by looping through each SNP, one by one:
<- function(snp.list, N=15) {
snp.table2 require(TxDb.Hsapiens.UCSC.hg19.knownGene)
require(org.Hs.eg.db)
<- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- transcriptsBy(txdb, "gene")
tx.by.gene <- data.frame()
gene.name
for (i in 1:N) {
1] <- snp.list[i,]$CHR
gene.name[i, 2] <- snp.list[i,]$BP
gene.name[i, 3] <- snp.list[i,]$SNP
gene.name[i, 4] <- snp.list[i,]$P
gene.name[i, <- with(snp.list[i,], GRanges(seqnames=paste0("chr", CHR),
top.snp IRanges(start=BP, width=1),
rsid=SNP, P=P))
<- subsetByOverlaps(tx.by.gene, top.snp)
hits <- select(org.Hs.eg.db, keys=names(hits), columns=c("SYMBOL", "GENENAME"), keytype="ENTREZID")
genename 5] <- ifelse(nrow(genename) == 0, NA, genename[[2]])
gene.name[i, 6] <- ifelse(nrow(genename) == 0, NA, genename[[3]])
gene.name[i,
}colnames(gene.name) <- c("CHR", "BP", "SNP", "P", "Symbol", "Gene name")
return(gene.name)
}
<- snp.table2(top.hits, 15)) %>% kable(digits=15) (t
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
CHR | BP | SNP | P | Symbol | Gene name |
---|---|---|---|---|---|
6 | 1942538 | rs3800143 | 4.608e-11 | GMDS | GDP-mannose 4,6-dehydratase |
11 | 4913057 | rs10836914 | 1.295e-07 | NA | NA |
11 | 4913314 | rs12577475 | 1.587e-07 | NA | NA |
6 | 1926650 | rs11242725 | 3.785e-07 | GMDS | GDP-mannose 4,6-dehydratase |
6 | 1880964 | rs3800116 | 1.475e-06 | GMDS | GDP-mannose 4,6-dehydratase |
6 | 1915129 | rs3778552 | 1.646e-06 | GMDS | GDP-mannose 4,6-dehydratase |
6 | 1908210 | rs2875711 | 2.550e-06 | GMDS | GDP-mannose 4,6-dehydratase |
6 | 1955398 | rs9378664 | 2.809e-06 | GMDS | GDP-mannose 4,6-dehydratase |
6 | 143144885 | rs926372 | 5.704e-06 | HIVEP2 | HIVEP zinc finger 2 |
11 | 4912192 | rs10836912 | 8.977e-06 | NA | NA |
6 | 1899671 | rs3800122 | 1.022e-05 | GMDS | GDP-mannose 4,6-dehydratase |
11 | 4910224 | rs2570591 | 1.656e-05 | NA | NA |
11 | 6340706 | rs1051992 | 2.356e-05 | CAVIN3 | caveolae associated protein 3 |
6 | 3317016 | rs4959804 | 2.821e-05 | SLC22A23 | solute carrier family 22 member 23 |
11 | 4915072 | rs1816448 | 2.834e-05 | NA | NA |
dim(t)
[1] 15 6
25.13 Question 3
Task: Using GRanges and associated Bioconductor tools, write a function that takes as input a ranked list of the SNPS, and returns a nice table that lists the top N of these SNPs and the closest flanking genes on both sides, including the SNP position and the gene boundaries.
# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.flank <- function(snp.list, N=15) {
}
Apply your snp.flank
function to the top 15 SNPs in our example data set.
Note that the purpose here is to think more about how to creatively use GRanges than to search for nice annotation packages.
The GenomicRanges
package has a useful function nearest
that can be used to find the nearest gene.
Note that if we want to augment information about the nearest gene with the genes that precede and follow each top SNP, we could use functions like this:
genes <- genes(txdb, columns = "gene_id")
precede(top.snp, genes)
follow(top.snp, genes)
25.14 Answer 3
Note that we need to set up the txdb
object within the function itself to avoid creating a dependence on the global txdb
object. Although, it might be more efficient to create the txdb
object once and pass it in via a function parameter instead of recreating it each time the function is called.
Here we use require
statements to indicate dependence on certain libraries having been loaded, but if this were part of an R package, we’d take care of library dependencies at the package level instead of inside of specific functions. Usually when writing functions, we assume that all of the required packages have been loaded, so we tend not to use the library
or the require
command within functions.
# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
<- function(snp.list, N=15) {
snp.flank require(org.Hs.eg.db)
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
<- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- transcriptsBy(txdb, "gene")
tx.by.gene # Find the gene boundaries
<- reduce(tx.by.gene)
gene.bounds # Set up a GRange with the first N top SNPs
<- with(snp.list[1:N, ], GRanges(seqnames = paste0("chr", CHR),
top.snp IRanges(start = BP, width = 1),
SNP = SNP, P = P))
# Find overlaps and hits
<- findOverlaps(gene.bounds, top.snp)
top.snp.gene <- subsetByOverlaps(gene.bounds, top.snp)
hits # The SNP hits
<- data.frame(SNP.ID = subjectHits(top.snp.gene),
snp.info SNP.chr = seqnames(top.snp[subjectHits(top.snp.gene)]),
ranges(top.snp[subjectHits(top.snp.gene)]),
mcols(top.snp[subjectHits(top.snp.gene)]))
# The Gene hits
<- data.frame(gene.ID = queryHits(top.snp.gene),
gene.info seqnames(gene.bounds[queryHits(top.snp.gene)]),
ranges(gene.bounds[queryHits(top.snp.gene)]))
# Reduce gene.info to distinct entries
<- gene.info %>% dplyr::select(-group, -group.1) %>% distinct()
gene.info # Construct a key linking SNPs to Genes
<- data.frame(gene.ID = queryHits(top.snp.gene),
key SNP.ID = subjectHits(top.snp.gene))
<- key %>% left_join(gene.info, by = "gene.ID") %>%
snp.gene left_join(snp.info, by = "SNP.ID")
<- select(org.Hs.eg.db, keys = names(hits),
gene.name columns = c("ENTREZID", "SYMBOL", "GENENAME"), keytype = "ENTREZID")
<- snp.gene %>% dplyr::rename(ENTREZID = group_name) %>%
snp.gene left_join(gene.name, by = "ENTREZID")
<- snp.gene %>% arrange(P)
snp.gene <- left_join(snp.list[1:N, ], snp.gene, by = c("SNP"))
snp.gene <- snp.gene %>%
snp.gene ::select(CHR, BP, SNP, P.x, SYMBOL, GENENAME, start.x, end.x, width.x)
dplyr<- snp.gene %>%
snp.gene ::rename(P = P.x, start = start.x, end = end.x, width = width.x)
dplyr# Now find the nearest genes
<- genes(txdb, columns = "gene_id")
genes # Find the Nearest gene
<- nearest(top.snp, genes)
nearest_gene_index <- unlist(genes[nearest_gene_index]$gene_id)
EntrezID <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
lookup columns = c("SYMBOL", "GENENAME"))
<- lookup$SYMBOL
symbol <- lookup$GENENAME
genename <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
nearest symbol = symbol, genename = genename,
start = unlist(start(gene.bounds[EntrezID, ])),
end = unlist(end(gene.bounds[EntrezID, ])))
if (all.equal(mcols(top.snp)$SNP, nearest$SNP)) {
$Nearest <- nearest$symbol
snp.gene$N.GENENAME <- nearest$genename
snp.gene$N.start <- nearest$start
snp.gene$N.end <- nearest$end
snp.gene
}# Find the preceding genes
<- precede(top.snp, genes)
precede_gene_index <- unlist(genes[precede_gene_index]$gene_id)
EntrezID <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
lookup columns = c("SYMBOL", "GENENAME"))
<- lookup$SYMBOL
symbol <- lookup$GENENAME
genename <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
precede symbol = symbol, genename = genename,
start = unlist(start(gene.bounds[EntrezID, ])),
end = unlist(end(gene.bounds[EntrezID, ])))
if (all.equal(mcols(top.snp)$SNP, precede$SNP)) {
$Precede <- precede$symbol
snp.gene$P.GENENAME <- precede$genename
snp.gene$P.start <- precede$start
snp.gene$P.end <- precede$end
snp.gene
}# Find the following genes
<- follow(top.snp, genes)
follow_gene_index <- unlist(genes[follow_gene_index]$gene_id)
EntrezID <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
lookup columns = c("SYMBOL", "GENENAME"))
<- lookup$SYMBOL
symbol <- lookup$GENENAME
genename <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
follow symbol = symbol, genename = genename,
start = unlist(start(gene.bounds[EntrezID, ])),
end = unlist(end(gene.bounds[EntrezID, ])))
if (all.equal(mcols(top.snp)$SNP, follow$SNP)) {
$Follow <- follow$symbol
snp.gene$F.GENENAME <- follow$genename
snp.gene$F.start <- follow$start
snp.gene$F.end <- follow$end
snp.gene
}return(snp.gene)
}
<- snp.flank(top.hits, 15)) %>% kable(digits=15) (t
Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
'select()' returned 1:1 mapping between keys and columns
403 genes were dropped because they have exons located on both strands
of the same reference sequence or on more than one reference sequence,
so cannot be represented by a single genomic range.
Use 'single.strand.genes.only=FALSE' to get all the genes in a
GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
CHR | BP | SNP | P | SYMBOL | GENENAME | start | end | width | Nearest | N.GENENAME | N.start | N.end | Precede | P.GENENAME | P.start | P.end | Follow | F.GENENAME | F.start | F.end |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 1942538 | rs3800143 | 4.608e-11 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
11 | 4913057 | rs10836914 | 1.295e-07 | NA | NA | NA | NA | NA | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 | OR51A7 | olfactory receptor family 51 subfamily A member 7 | 4928600 | 4929538 | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 |
11 | 4913314 | rs12577475 | 1.587e-07 | NA | NA | NA | NA | NA | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 | OR51A7 | olfactory receptor family 51 subfamily A member 7 | 4928600 | 4929538 | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 |
6 | 1926650 | rs11242725 | 3.785e-07 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
6 | 1880964 | rs3800116 | 1.475e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
6 | 1915129 | rs3778552 | 1.646e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
6 | 1908210 | rs2875711 | 2.550e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
6 | 1955398 | rs9378664 | 2.809e-06 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
6 | 143144885 | rs926372 | 5.704e-06 | HIVEP2 | HIVEP zinc finger 2 | 143072604 | 143266338 | 193735 | HIVEP2 | HIVEP zinc finger 2 | 143072604 | 143266338 | LOC153910 | uncharacterized LOC153910 | 142847592 | 142959026 | LINC01277 | long intergenic non-protein coding RNA 1277 | 143287559 | 143358719 |
11 | 4912192 | rs10836912 | 8.977e-06 | NA | NA | NA | NA | NA | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 | OR51A7 | olfactory receptor family 51 subfamily A member 7 | 4928600 | 4929538 | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 |
6 | 1899671 | rs3800122 | 1.022e-05 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | 621834 | GMDS | GDP-mannose 4,6-dehydratase | 1624035 | 2245868 | GMDS-DT | GMDS divergent transcript | 2245987 | 2481403 | FOXC1 | forkhead box C1 | 1610681 | 1614129 |
11 | 4910224 | rs2570591 | 1.656e-05 | NA | NA | NA | NA | NA | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 | OR51A7 | olfactory receptor family 51 subfamily A member 7 | 4928600 | 4929538 | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 |
11 | 6340706 | rs1051992 | 2.356e-05 | CAVIN3 | caveolae associated protein 3 | 6340176 | 6341740 | 1565 | CAVIN3 | caveolae associated protein 3 | 6340176 | 6341740 | SMPD1 | sphingomyelin phosphodiesterase 1 | 6411644 | 6416228 | CCKBR | cholecystokinin B receptor | 6280904 | 6293357 |
6 | 3317016 | rs4959804 | 2.821e-05 | SLC22A23 | solute carrier family 22 member 23 | 3269207 | 3456793 | 187587 | SLC22A23 | solute carrier family 22 member 23 | 3269207 | 3456793 | TUBB2B | tubulin beta 2B class IIb | 3224495 | 3227968 | PSMG4 | proteasome assembly chaperone 4 | 3259162 | 3268300 |
11 | 4915072 | rs1816448 | 2.834e-05 | NA | NA | NA | NA | NA | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 | OR51A7 | olfactory receptor family 51 subfamily A member 7 | 4928600 | 4929538 | OR51T1 | olfactory receptor family 51 subfamily T member 1 | 4903049 | 4904113 |
dim(t)
[1] 15 21