25  GRanges Exercise

In an example genome-wide association scan, we had several hits:

library(tidyverse)
library(knitr)
library(pander)
ASSOC.file <- "data/trait.qassoc.gz"
b <- read_table(ASSOC.file)
qplot(BP,-1.0*log10(P),main="Association p-values for the quantitative trait",data=b) + facet_wrap(~ CHR)+
  geom_hline(yintercept=5,col='red') +    
  geom_hline(yintercept=4,col='green')

Here are our top hits:

top.hits <- b[order(b$P),c("CHR","BP","SNP","P")]
row.names(top.hits) <- NULL
pander(head(top.hits),caption="Top hits from our genome-wide association scan")
Top hits from our genome-wide association scan
CHR BP SNP P
6 1942538 rs3800143 4.608e-11
11 4913057 rs10836914 1.295e-07
11 4913314 rs12577475 1.587e-07
6 1926650 rs11242725 3.785e-07
6 1880964 rs3800116 1.475e-06
6 1915129 rs3778552 1.646e-06

Now we’d like to annotate each of these SNPs in terms of nearby or overlapping genes.

While we could do this using online databases like UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway), we could also use BioConductor R packages to write our own functions to retrieve this information into a nice tabular format, for each SNP, listing any gene that they might be in, including the SNP’s position and the gene boundaries.

25.1 Introductory Background

An Introduction to Bioconductor’s Packages for Working with Range Data

https://github.com/vsbuffalo/genomicranges-intro/blob/master/notes.md

25.2 Active Learning

Working with genomics ranges

https://carpentries-incubator.github.io/bioc-project/07-genomic-ranges.html

25.3 Suggested readings

In “Bioinformatics Data Skills”, see Chapter 9 “Working with Range Data”

Bioinformatics Data Skills
Editor: Vince Buffalo
Publisher: O’Reilly
Web access: link

Hello Ranges: An Introduction to Analyzing Genomic Ranges in R.
link

25.4 Install the needed Bioconductor libraries (one time only)

First, we need to figure out which genomic build was used in our data set. I usually do this by looking up the base pair positions of a couple of SNPs by hand in the ‘SNP’ data base. So if we go there and search for rs3800143, we end up on this web page:

https://www.ncbi.nlm.nih.gov/snp/rs3800143#variant_details

which shows a Build GRCh37.p13 position of 1942538 on chromosome 6. So it looks like Build GRCh37.p13 was used, which is also known as Build hg19.

To double check, we can check that the given position for rs10836914 matches the Build GRCh37.p13 position.

So to determine the gene boundaries in Build hg19, we need to download and install the TxDb.Hsapiens.UCSC.hg19.knownGene library from BioConductor.

So we search for it on BioConductor and that it can be installed by issuing these commands at the R prompt:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")

You will also need to install another library, org.Hs.eg.db:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("org.Hs.eg.db")

25.5 Construct the gene list

Using the annotation embedded in TxDb.Hsapiens.UCSC.hg19.knownGene, construct a GRangesList object that contains a list of all the genes.

Hint: Use transcriptsBy.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
class(txdb)
[1] "TxDb"
attr(,"package")
[1] "GenomicFeatures"

The transcripts(txdb) is a GenomicRanges object:

transcripts(txdb)
GRanges object with 82960 ranges and 2 metadata columns:
                seqnames        ranges strand |     tx_id     tx_name
                   <Rle>     <IRanges>  <Rle> | <integer> <character>
      [1]           chr1   11874-14409      + |         1  uc001aaa.3
      [2]           chr1   11874-14409      + |         2  uc010nxq.1
      [3]           chr1   11874-14409      + |         3  uc010nxr.1
      [4]           chr1   69091-70008      + |         4  uc001aal.1
      [5]           chr1 321084-321115      + |         5  uc001aaq.2
      ...            ...           ...    ... .       ...         ...
  [82956] chrUn_gl000237        1-2686      - |     82956  uc011mgu.1
  [82957] chrUn_gl000241   20433-36875      - |     82957  uc011mgv.2
  [82958] chrUn_gl000243   11501-11530      + |     82958  uc011mgw.1
  [82959] chrUn_gl000243   13608-13637      + |     82959  uc022brq.1
  [82960] chrUn_gl000247     5787-5816      - |     82960  uc022brr.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

We now group the transcripts by gene, and so create a GRangesList object:

tx.by.gene <- transcriptsBy(txdb, "gene")
tx.by.gene
GRangesList object of length 23459:
$`1`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id     tx_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 58858172-58864865      - |     70455  uc002qsd.4
  [2]    chr19 58859832-58874214      - |     70456  uc002qsf.2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`10`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_id     tx_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]     chr8 18248755-18258723      + |     31944  uc003wyw.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`100`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_id     tx_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]    chr20 43248163-43280376      - |     72132  uc002xmj.3
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

...
<23456 more elements>

The names of the list elements are Entrez gene IDs.

25.6 Construct a GRange containing our top SNP

Construct a GRange containing the top SNP. Note that the chromosome name needs to be in the same style as is seen in tx.by.gene. That is, the chromosome name needs to be “chr6” instead of ‘6’.

top.snp <- with(top.hits[1,], GRanges(seqnames=paste0("chr",CHR), 
                   IRanges(start=BP, width=1), 
                   rsid=SNP, P=P)) 
top.snp
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        rsid         P
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]     chr6   1942538      * |   rs3800143 4.608e-11
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

25.7 Search for match

Now use findOverlaps and subsetByOverlaps to find any genes that overlap our top SNP.

top.snp.overlaps <- findOverlaps(tx.by.gene,top.snp)
top.snp.overlaps
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]      8841           1
  -------
  queryLength: 23459 / subjectLength: 1
hits <- subsetByOverlaps(tx.by.gene,top.snp)
hits
GRangesList object of length 1:
$`2762`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames          ranges strand |     tx_id     tx_name
         <Rle>       <IRanges>  <Rle> | <integer> <character>
  [1]     chr6 1624035-2176225      - |     25755  uc021ykn.1
  [2]     chr6 1624035-2245868      - |     25756  uc003mtq.3
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

So our top SNP rs3800143 overlaps two transcripts in a single gene with the same start position of 1624035.

However, the transcript name is not very human-readable, as it uses an Entrez Gene ID 2762.

25.8 Convert Entrez Gene IDs to Gene Names

Convert the integer Entrez Gene ID 2762 to a human-readable Gene Name.

This can be done using the org.Hs.eg.db R library.

To convert Entrez Gene IDs to Gene Names, we can use another R database:

library(org.Hs.eg.db)
columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
names(hits)
[1] "2762"
gene.name <- select(org.Hs.eg.db, keys=names(hits), columns=c("ENTREZID", "SYMBOL", "GENENAME"), keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
gene.name
  ENTREZID SYMBOL                    GENENAME
1     2762   GMDS GDP-mannose 4,6-dehydratase

So our top SNP rs3800143 overlaps the gene with the Entrez Gene ID 2762, which is also known as GMDS (GDP-mannose 4,6-dehydratase).

25.9 Question 1

Read in the data:

ASSOC.file <- "data/trait.qassoc.gz"
b <- read_table(ASSOC.file)
Warning: Missing column names filled in: 'X10' [10]

── Column specification ────────────────────────────────────────────────────────
cols(
  CHR = col_double(),
  SNP = col_character(),
  BP = col_double(),
  NMISS = col_double(),
  BETA = col_double(),
  SE = col_double(),
  R2 = col_double(),
  T = col_double(),
  P = col_double(),
  X10 = col_logical()
)
head(b)
# A tibble: 6 × 10
    CHR SNP           BP NMISS     BETA     SE         R2      T     P X10  
  <dbl> <chr>      <dbl> <dbl>    <dbl>  <dbl>      <dbl>  <dbl> <dbl> <lgl>
1     6 rs9392298 203878  1573  0.110   0.0984 0.000801    1.12  0.262 NA   
2     6 rs9405486 204072  1569  0.118   0.0982 0.000913    1.20  0.232 NA   
3     6 rs7762550 204909  1574  0.0384  0.0685 0.000199    0.560 0.576 NA   
4     6 rs1418706 205878  1575 -0.00783 0.0717 0.00000757 -0.109 0.913 NA   
5     6 rs6920539 206528  1574  0.119   0.0955 0.000984    1.24  0.214 NA   
6     6 rs9502959 206599  1575 -0.0317  0.0538 0.000220   -0.589 0.556 NA   

Task: Using GRanges and associated Bioconductor tools, figure out if our top SNP, as ranked by the P-value (column P) from data frame b lies within an intron or exon.

For this, the functions exonsBy and intronsByTranscript might be useful.

25.10 Answer 1

The top SNP rs3800143 lies in an intron of GMDS (GDP-mannose 4,6-dehydratase), Gene ID: 2762.

Build hg19 (Human Feb. 2009 (GRCh37/hg19)) gene boundaries from UCSC Genes:

GMDS (uc003mtq.3) - chr6:1624035-2245868 - Homo sapiens GDP-mannose 4,6-dehydratase (GMDS), transcript variant 1, mRNA.

The top SNP rs3800143 lies in an intron of GMDS Does our top SNP overlap any exons?

top.snp <- with(top.hits[1,], GRanges(seqnames=paste0("chr",CHR), 
                   IRanges(start=BP, width=1), 
                   rsid=SNP, P=P)) 
top.snp
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        rsid         P
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]     chr6   1942538      * |   rs3800143 4.608e-11
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
exons.by.gene <- exonsBy(txdb, "gene")
exons.by.gene
GRangesList object of length 23459:
$`1`
GRanges object with 15 ranges and 2 metadata columns:
       seqnames            ranges strand |   exon_id   exon_name
          <Rle>         <IRanges>  <Rle> | <integer> <character>
   [1]    chr19 58858172-58858395      - |    250809        <NA>
   [2]    chr19 58858719-58859006      - |    250810        <NA>
   [3]    chr19 58859832-58860494      - |    250811        <NA>
   [4]    chr19 58860934-58862017      - |    250812        <NA>
   [5]    chr19 58861736-58862017      - |    250813        <NA>
   ...      ...               ...    ... .       ...         ...
  [11]    chr19 58868951-58869015      - |    250821        <NA>
  [12]    chr19 58869318-58869652      - |    250822        <NA>
  [13]    chr19 58869855-58869951      - |    250823        <NA>
  [14]    chr19 58870563-58870689      - |    250824        <NA>
  [15]    chr19 58874043-58874214      - |    250825        <NA>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`10`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |   exon_id   exon_name
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]     chr8 18248755-18248855      + |    113603        <NA>
  [2]     chr8 18257508-18258723      + |    113604        <NA>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

...
<23457 more elements>
top.snp.overlaps <- findOverlaps(exons.by.gene,top.snp)
top.snp.overlaps
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 23459 / subjectLength: 1
hits <- subsetByOverlaps(exons.by.gene,top.snp)
hits
GRangesList object of length 0:
<0 elements>

From the output above, we find that our top SNP overlaps 0 sets of exons grouped by transcripts.

Does our top SNP overlap any introns?

introns.by.transcript <- intronsByTranscript(txdb)


top.snp.overlaps <- findOverlaps(introns.by.transcript,top.snp)
top.snp.overlaps
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]     25755           1
  [2]     25756           1
  -------
  queryLength: 82960 / subjectLength: 1
hits <- subsetByOverlaps(introns.by.transcript,top.snp)
hits
GRangesList object of length 2:
$`25755`
GRanges object with 10 ranges and 0 metadata columns:
       seqnames          ranges strand
          <Rle>       <IRanges>  <Rle>
   [1]     chr6 1624467-1624706      -
   [2]     chr6 1624776-1726649      -
   [3]     chr6 1726747-1742701      -
   [4]     chr6 1742821-1930336      -
   [5]     chr6 1930465-1960100      -
   [6]     chr6 1960206-1961007      -
   [7]     chr6 1961201-2116004      -
   [8]     chr6 2116115-2117702      -
   [9]     chr6 2117791-2124920      -
  [10]     chr6 2124966-2176165      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`25756`
GRanges object with 10 ranges and 0 metadata columns:
       seqnames          ranges strand
          <Rle>       <IRanges>  <Rle>
   [1]     chr6 1624467-1624706      -
   [2]     chr6 1624776-1726649      -
   [3]     chr6 1726747-1742701      -
   [4]     chr6 1742821-1930336      -
   [5]     chr6 1930465-1960100      -
   [6]     chr6 1960206-1961007      -
   [7]     chr6 1961201-2116004      -
   [8]     chr6 2116115-2117702      -
   [9]     chr6 2117791-2124920      -
  [10]     chr6 2124966-2245554      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

From the output above, we find that our top SNP overlaps 2 sets of introns grouped by transcripts.

25.11 Question 2

While we have figure out above which gene our top SNP is in, what we worked out above involves very specific code. Now let’s try to generalize what we did above:

Task: Using GRanges and associated Bioconductor tools, write a function that takes as input a ranked list of the SNPS, and returns a nice table that lists the top N of these SNPs and any gene that they might be in, including the SNP position and the gene boundaries.

# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.table <- function(snp.list, N=15) {
  
}

Apply your snp.table function to the top 15 SNPs in our example data set.

Hint

Use the subsetByOverlaps followed by the select(org.Hs.eg.db, ... approach described above in the Search for match and Convert Entrez Gene IDs to Gene Names sections above.

25.12 Answer 2

It is important to understand what the output of the findOverlaps function means.

top.snp.gene <- findOverlaps(gene.bounds, top.snp)
top.snp.gene

returns:

Hits object with 10 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]      3269          13
   [2]      8841           5
   [3]      8841          11
   [4]      8841           7
   [5]      8841           6
   [6]      8841           4
   [7]      8841           1
   [8]      8841           8
   [9]     17049          14
  [10]     18258           9
  -------
  queryLength: 23459 / subjectLength: 15

Here, the ‘subjectHits’ is an index into top.snp and the queryHits is an index into gene.bounds.

So the first line indicates that the 13th SNP in top.snp overlaps the gene that is in the 3,269 slot of the gene.bounds object.

# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.table <- function(snp.list, N=15) {
  require(org.Hs.eg.db)
  require(TxDb.Hsapiens.UCSC.hg19.knownGene)
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  tx.by.gene <- transcriptsBy(txdb, "gene")
  # Find the gene boundaries
  gene.bounds <- reduce(tx.by.gene)
  # Set up a GRange with the first N top SNPs
  top.snp <- with(snp.list[1:N, ], GRanges(seqnames = paste0("chr", CHR),
                                          IRanges(start = BP, width = 1),
                                          SNP = SNP, P = P))
  # Find overlaps and hits
  top.snp.gene <- findOverlaps(gene.bounds, top.snp)
  hits <- subsetByOverlaps(gene.bounds, top.snp)
  # The SNP hits
  snp.info <- data.frame(SNP.ID = subjectHits(top.snp.gene),
                         SNP.chr = seqnames(top.snp[subjectHits(top.snp.gene)]),
                         ranges(top.snp[subjectHits(top.snp.gene)]),
                         mcols(top.snp[subjectHits(top.snp.gene)]))
  # The Gene hits
  gene.info <- data.frame(gene.ID = queryHits(top.snp.gene),
                          seqnames(gene.bounds[queryHits(top.snp.gene)]),
                          ranges(gene.bounds[queryHits(top.snp.gene)]))
  # Reduce gene.info to distinct entries
  gene.info <- gene.info %>% dplyr::select(-group, -group.1) %>% distinct()
  # Construct a key linking SNPs to Genes
  key <- data.frame(gene.ID = queryHits(top.snp.gene),
                    SNP.ID = subjectHits(top.snp.gene))
  snp.gene <- key %>% left_join(gene.info, by = "gene.ID") %>%
    left_join(snp.info, by = "SNP.ID")
  gene.name <- select(org.Hs.eg.db, keys = names(hits),
      columns = c("ENTREZID", "SYMBOL", "GENENAME"), keytype = "ENTREZID")
  snp.gene <- snp.gene %>% dplyr::rename(ENTREZID = group_name) %>%
    left_join(gene.name, by = "ENTREZID")
  snp.gene <- snp.gene %>% arrange(P)
  snp.gene <- left_join(snp.list[1:N, ], snp.gene, by = c("SNP"))
  snp.gene <- snp.gene %>%
    dplyr::select(CHR, BP, SNP, P.x, SYMBOL, GENENAME, start.x, end.x, width.x)
  snp.gene <- snp.gene %>%
    dplyr::rename(P = P.x, start = start.x, end = end.x, width = width.x)
  return(snp.gene)
}
(t <- snp.table(top.hits, 15)) %>% kable(digits=15)
Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
'select()' returned 1:1 mapping between keys and columns
CHR BP SNP P SYMBOL GENENAME start end width
6 1942538 rs3800143 4.608e-11 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
11 4913057 rs10836914 1.295e-07 NA NA NA NA NA
11 4913314 rs12577475 1.587e-07 NA NA NA NA NA
6 1926650 rs11242725 3.785e-07 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
6 1880964 rs3800116 1.475e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
6 1915129 rs3778552 1.646e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
6 1908210 rs2875711 2.550e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
6 1955398 rs9378664 2.809e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
6 143144885 rs926372 5.704e-06 HIVEP2 HIVEP zinc finger 2 143072604 143266338 193735
11 4912192 rs10836912 8.977e-06 NA NA NA NA NA
6 1899671 rs3800122 1.022e-05 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834
11 4910224 rs2570591 1.656e-05 NA NA NA NA NA
11 6340706 rs1051992 2.356e-05 CAVIN3 caveolae associated protein 3 6340176 6341740 1565
6 3317016 rs4959804 2.821e-05 SLC22A23 solute carrier family 22 member 23 3269207 3456793 187587
11 4915072 rs1816448 2.834e-05 NA NA NA NA NA
dim(t)
[1] 15  9

Here’s an alternate way to construct the SNP table which is an incomplete example because it doesn’t include the gene boundaries. However, it is simpler than the function above because it avoids the construction of a key and the left_joins used above by looping through each SNP, one by one:

snp.table2 <- function(snp.list, N=15) {
  require(TxDb.Hsapiens.UCSC.hg19.knownGene)
  require(org.Hs.eg.db)
  
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  tx.by.gene <- transcriptsBy(txdb, "gene")
  gene.name <- data.frame()
  
  for (i in 1:N) {
    gene.name[i, 1] <- snp.list[i,]$CHR
    gene.name[i, 2] <- snp.list[i,]$BP
    gene.name[i, 3] <- snp.list[i,]$SNP
    gene.name[i, 4] <- snp.list[i,]$P
    top.snp <- with(snp.list[i,], GRanges(seqnames=paste0("chr", CHR), 
                                      IRanges(start=BP, width=1), 
                                      rsid=SNP, P=P))
    hits <- subsetByOverlaps(tx.by.gene, top.snp)
    genename <- select(org.Hs.eg.db, keys=names(hits), columns=c("SYMBOL", "GENENAME"), keytype="ENTREZID")
    gene.name[i, 5] <- ifelse(nrow(genename) == 0, NA, genename[[2]])
    gene.name[i, 6] <- ifelse(nrow(genename) == 0, NA, genename[[3]])
  }
  colnames(gene.name) <- c("CHR", "BP", "SNP", "P", "Symbol", "Gene name")
  return(gene.name)
}

(t <- snp.table2(top.hits, 15)) %>% kable(digits=15)
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
CHR BP SNP P Symbol Gene name
6 1942538 rs3800143 4.608e-11 GMDS GDP-mannose 4,6-dehydratase
11 4913057 rs10836914 1.295e-07 NA NA
11 4913314 rs12577475 1.587e-07 NA NA
6 1926650 rs11242725 3.785e-07 GMDS GDP-mannose 4,6-dehydratase
6 1880964 rs3800116 1.475e-06 GMDS GDP-mannose 4,6-dehydratase
6 1915129 rs3778552 1.646e-06 GMDS GDP-mannose 4,6-dehydratase
6 1908210 rs2875711 2.550e-06 GMDS GDP-mannose 4,6-dehydratase
6 1955398 rs9378664 2.809e-06 GMDS GDP-mannose 4,6-dehydratase
6 143144885 rs926372 5.704e-06 HIVEP2 HIVEP zinc finger 2
11 4912192 rs10836912 8.977e-06 NA NA
6 1899671 rs3800122 1.022e-05 GMDS GDP-mannose 4,6-dehydratase
11 4910224 rs2570591 1.656e-05 NA NA
11 6340706 rs1051992 2.356e-05 CAVIN3 caveolae associated protein 3
6 3317016 rs4959804 2.821e-05 SLC22A23 solute carrier family 22 member 23
11 4915072 rs1816448 2.834e-05 NA NA
dim(t)
[1] 15  6

25.13 Question 3

Task: Using GRanges and associated Bioconductor tools, write a function that takes as input a ranked list of the SNPS, and returns a nice table that lists the top N of these SNPs and the closest flanking genes on both sides, including the SNP position and the gene boundaries.

# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.flank <- function(snp.list, N=15) {
 
}

Apply your snp.flank function to the top 15 SNPs in our example data set.

Note that the purpose here is to think more about how to creatively use GRanges than to search for nice annotation packages.

Hint

The GenomicRanges package has a useful function nearest that can be used to find the nearest gene.

Note that if we want to augment information about the nearest gene with the genes that precede and follow each top SNP, we could use functions like this:

genes <- genes(txdb, columns = "gene_id")
precede(top.snp, genes)
follow(top.snp, genes)

25.14 Answer 3

Note that we need to set up the txdb object within the function itself to avoid creating a dependence on the global txdb object. Although, it might be more efficient to create the txdb object once and pass it in via a function parameter instead of recreating it each time the function is called.

Here we use require statements to indicate dependence on certain libraries having been loaded, but if this were part of an R package, we’d take care of library dependencies at the package level instead of inside of specific functions. Usually when writing functions, we assume that all of the required packages have been loaded, so we tend not to use the library or the require command within functions.

# snp.list = the ranked list of the top SNPs
# N = the number of top SNPs to annotate
snp.flank <- function(snp.list, N=15) {
  require(org.Hs.eg.db)
  require(TxDb.Hsapiens.UCSC.hg19.knownGene)
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  tx.by.gene <- transcriptsBy(txdb, "gene")
  # Find the gene boundaries
  gene.bounds <- reduce(tx.by.gene)
  # Set up a GRange with the first N top SNPs
  top.snp <- with(snp.list[1:N, ], GRanges(seqnames = paste0("chr", CHR),
                                           IRanges(start = BP, width = 1),
                                           SNP = SNP, P = P))
  # Find overlaps and hits
  top.snp.gene <- findOverlaps(gene.bounds, top.snp)
  hits <- subsetByOverlaps(gene.bounds, top.snp)
  # The SNP hits
  snp.info <- data.frame(SNP.ID = subjectHits(top.snp.gene),
                         SNP.chr = seqnames(top.snp[subjectHits(top.snp.gene)]),
                         ranges(top.snp[subjectHits(top.snp.gene)]),
                         mcols(top.snp[subjectHits(top.snp.gene)]))
  # The Gene hits
  gene.info <- data.frame(gene.ID = queryHits(top.snp.gene),
                          seqnames(gene.bounds[queryHits(top.snp.gene)]),
                          ranges(gene.bounds[queryHits(top.snp.gene)]))
  # Reduce gene.info to distinct entries
  gene.info <- gene.info %>% dplyr::select(-group, -group.1) %>% distinct()
  # Construct a key linking SNPs to Genes
  key <- data.frame(gene.ID = queryHits(top.snp.gene),
                    SNP.ID = subjectHits(top.snp.gene))
  snp.gene <- key %>% left_join(gene.info, by = "gene.ID") %>%
    left_join(snp.info, by = "SNP.ID")
  gene.name <- select(org.Hs.eg.db, keys = names(hits),
        columns = c("ENTREZID", "SYMBOL", "GENENAME"), keytype = "ENTREZID")
  snp.gene <- snp.gene %>% dplyr::rename(ENTREZID = group_name) %>%
    left_join(gene.name, by = "ENTREZID")
  snp.gene <- snp.gene %>% arrange(P)
  snp.gene <- left_join(snp.list[1:N, ], snp.gene, by = c("SNP"))
  snp.gene <- snp.gene %>%
    dplyr::select(CHR, BP, SNP, P.x, SYMBOL, GENENAME, start.x, end.x, width.x)
  snp.gene <- snp.gene %>%
    dplyr::rename(P = P.x, start = start.x, end = end.x, width = width.x)
  # Now find the nearest genes
  genes <- genes(txdb, columns = "gene_id")
  # Find the Nearest gene
  nearest_gene_index <- nearest(top.snp, genes)
  EntrezID <- unlist(genes[nearest_gene_index]$gene_id)
  lookup <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
                   columns = c("SYMBOL", "GENENAME"))
  symbol <- lookup$SYMBOL
  genename <- lookup$GENENAME
  nearest <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
                       symbol = symbol, genename = genename,
                       start = unlist(start(gene.bounds[EntrezID, ])),
                       end = unlist(end(gene.bounds[EntrezID, ])))
  if (all.equal(mcols(top.snp)$SNP, nearest$SNP)) {
    snp.gene$Nearest <- nearest$symbol
    snp.gene$N.GENENAME <- nearest$genename
    snp.gene$N.start <- nearest$start
    snp.gene$N.end <- nearest$end
  }
  # Find the preceding genes
  precede_gene_index <- precede(top.snp, genes)
  EntrezID <- unlist(genes[precede_gene_index]$gene_id)
  lookup <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
                   columns = c("SYMBOL", "GENENAME"))
  symbol <- lookup$SYMBOL
  genename <- lookup$GENENAME
  precede <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
                       symbol = symbol, genename = genename,
                       start = unlist(start(gene.bounds[EntrezID, ])),
                       end = unlist(end(gene.bounds[EntrezID, ])))
  if (all.equal(mcols(top.snp)$SNP, precede$SNP)) {
    snp.gene$Precede <- precede$symbol
    snp.gene$P.GENENAME <- precede$genename
    snp.gene$P.start <- precede$start
    snp.gene$P.end <- precede$end
  }
  # Find the following genes
  follow_gene_index <- follow(top.snp, genes)
  EntrezID <- unlist(genes[follow_gene_index]$gene_id)
  lookup <- select(org.Hs.eg.db, keys = EntrezID, keytype = "ENTREZID",
                   columns = c("SYMBOL", "GENENAME"))
  symbol <- lookup$SYMBOL
  genename <- lookup$GENENAME
  follow <- bind_cols(SNP = mcols(top.snp)$SNP, EntrezID = EntrezID,
                       symbol = symbol, genename = genename,
                       start = unlist(start(gene.bounds[EntrezID, ])),
                       end = unlist(end(gene.bounds[EntrezID, ])))
  if (all.equal(mcols(top.snp)$SNP, follow$SNP)) {
    snp.gene$Follow <- follow$symbol
    snp.gene$F.GENENAME <- follow$genename
    snp.gene$F.start <- follow$start
    snp.gene$F.end <- follow$end
  }
  return(snp.gene)
}
(t <- snp.flank(top.hits, 15)) %>% kable(digits=15)
Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
'select()' returned 1:1 mapping between keys and columns
  403 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
CHR BP SNP P SYMBOL GENENAME start end width Nearest N.GENENAME N.start N.end Precede P.GENENAME P.start P.end Follow F.GENENAME F.start F.end
6 1942538 rs3800143 4.608e-11 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
11 4913057 rs10836914 1.295e-07 NA NA NA NA NA OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113 OR51A7 olfactory receptor family 51 subfamily A member 7 4928600 4929538 OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113
11 4913314 rs12577475 1.587e-07 NA NA NA NA NA OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113 OR51A7 olfactory receptor family 51 subfamily A member 7 4928600 4929538 OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113
6 1926650 rs11242725 3.785e-07 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
6 1880964 rs3800116 1.475e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
6 1915129 rs3778552 1.646e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
6 1908210 rs2875711 2.550e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
6 1955398 rs9378664 2.809e-06 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
6 143144885 rs926372 5.704e-06 HIVEP2 HIVEP zinc finger 2 143072604 143266338 193735 HIVEP2 HIVEP zinc finger 2 143072604 143266338 LOC153910 uncharacterized LOC153910 142847592 142959026 LINC01277 long intergenic non-protein coding RNA 1277 143287559 143358719
11 4912192 rs10836912 8.977e-06 NA NA NA NA NA OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113 OR51A7 olfactory receptor family 51 subfamily A member 7 4928600 4929538 OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113
6 1899671 rs3800122 1.022e-05 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 621834 GMDS GDP-mannose 4,6-dehydratase 1624035 2245868 GMDS-DT GMDS divergent transcript 2245987 2481403 FOXC1 forkhead box C1 1610681 1614129
11 4910224 rs2570591 1.656e-05 NA NA NA NA NA OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113 OR51A7 olfactory receptor family 51 subfamily A member 7 4928600 4929538 OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113
11 6340706 rs1051992 2.356e-05 CAVIN3 caveolae associated protein 3 6340176 6341740 1565 CAVIN3 caveolae associated protein 3 6340176 6341740 SMPD1 sphingomyelin phosphodiesterase 1 6411644 6416228 CCKBR cholecystokinin B receptor 6280904 6293357
6 3317016 rs4959804 2.821e-05 SLC22A23 solute carrier family 22 member 23 3269207 3456793 187587 SLC22A23 solute carrier family 22 member 23 3269207 3456793 TUBB2B tubulin beta 2B class IIb 3224495 3227968 PSMG4 proteasome assembly chaperone 4 3259162 3268300
11 4915072 rs1816448 2.834e-05 NA NA NA NA NA OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113 OR51A7 olfactory receptor family 51 subfamily A member 7 4928600 4929538 OR51T1 olfactory receptor family 51 subfamily T member 1 4903049 4904113
dim(t)
[1] 15 21