Introduction
This vignette provides a step-by-step guide to using the
linkSet
package to analyze Hi-C/HiChIP data. We will use a
toy example to illustrate the main functions and workflows. Our goal is
to identify the enhancer-gene links in this example
We will use the following datasets as input:
- validPairs produced by HiC-Pro.
- Mouse embryo body enhancer data from enhancer atlas website.
- Gene annotation data from TxDb.Mmusculus.UCSC.mm10.knownGene pakcage.
We highly recommend you to use custom data instead of the example data provided in this vignette.
Setup
suppressPackageStartupMessages({
library(linkSet)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(Organism.dplyr)
library(InteractionSet)
})
We use our custom function readvalidPairs
to load the
example data. Firstly, we need to load into GInteractions
object.
hic_file <- system.file("extdata", "toyValidatePair",
package="linkSet")
gi <- readvalidPairs(hic_file)
src <- Organism.dplyr::src_organism("TxDb.Mmusculus.UCSC.mm10.knownGene")
#> creating 'src_organism' database...
geneGr <- Organism.dplyr::genes(src,columns = "symbol")
promoterGr <- IRanges::promoters(geneGr,upstream = 10000)
#> Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 3 out-of-bound ranges located on sequences
#> chr4_JH584294_random, chr4_JH584292_random, and chr5_GL456354_random.
#> Note that ranges located on a sequence whose length is unknown (NA) or
#> on a circular sequence are not considered out-of-bound (use
#> seqlengths() and isCircular() to get the lengths and circularity flags
#> of the underlying sequences). You can use trim() to trim these ranges.
#> See ?`trim,GenomicRanges-method` for more information.
file_url <- c("http://www.enhanceratlas.org/data/download/enhancer/mm/Embryo_body.bed")
temp_file <- tempfile(fileext = ".bed")
download.file(file_url, temp_file, method = "curl")
enhancer <- rtracklayer::import(temp_file)
Because the hic data only contains digist end, so we resize the
region to upstream 5kb and downstream 5kb. After that, we use
baitGInteractions
to generate the linkSet
object.
gi <- resize(gi,10000,fix = "center")
ls <- baitGInteractions(gi,geneGr = geneGr,peakGr = enhancer,"symbol")
ls
#> linkSet object with 579 interactions and 2 metadata columns:
#> bait seqnames_oe ranges_oe | anchor1.symbol
#> <character> <Rle> <IRanges> | <character>
#> [1] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [2] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [3] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [4] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [5] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> ... ... ... ... ... . ...
#> [575] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [576] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [577] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [578] Slco5a1 --- chr1 78187171-78187630 | Slco5a1
#> [579] Slco5a1 --- chr1 39924881-39925280 | Slco5a1
#> anchor2.name
#> <character>
#> [1] 12.2266622787328
#> [2] 12.2266622787328
#> [3] 12.2266622787328
#> [4] 12.2266622787328
#> [5] 12.2266622787328
#> ... ...
#> [575] 16.1891332694206
#> [576] 16.1891332694206
#> [577] 16.1891332694206
#> [578] 11.771392591681
#> [579] 14.6843556168742
#> -------
#> regions: 109 ranges and 0 metadata columns
#> seqinfo: 66 sequences (1 circular) from mm10 genome
When we print the linkSet
object, we can see the basic
information of the linkSet
object. By default, we don’t
show the bait region. But you are interested in the bait region, you can
set showBaitRegion = TRUE
.
showLinkSet(ls, baitRegion = TRUE)
#> linkSet object with 579 interactions and 2 metadata columns:
#> bait seqnames_bait ranges_bait seqnames_oe ranges_oe
#> [1] Sulf1 chr1 12692277-12861192 --- chr1 12785091-12785750
#> [2] Sulf1 chr1 12692277-12861192 --- chr1 12785091-12785750
#> [3] Sulf1 chr1 12692277-12861192 --- chr1 12785091-12785750
#> [4] Sulf1 chr1 12692277-12861192 --- chr1 12785091-12785750
#> [5] Sulf1 chr1 12692277-12861192 --- chr1 12785091-12785750
#> ... ... ... ... ... ... ...
#> [575] Slco5a1 chr1 12866549-12992650 --- chr1 14496641-14497000
#> [576] Slco5a1 chr1 12866549-12992650 --- chr1 14496641-14497000
#> [577] Slco5a1 chr1 12866549-12992650 --- chr1 14496641-14497000
#> [578] Slco5a1 chr1 12866549-12992650 --- chr1 78187171-78187630
#> [579] Slco5a1 chr1 12866549-12992650 --- chr1 39924881-39925280
#> | anchor1.symbol anchor2.name
#> [1] | Sulf1 12.2266622787328
#> [2] | Sulf1 12.2266622787328
#> [3] | Sulf1 12.2266622787328
#> [4] | Sulf1 12.2266622787328
#> [5] | Sulf1 12.2266622787328
#> ... . ... ...
#> [575] | Slco5a1 16.1891332694206
#> [576] | Slco5a1 16.1891332694206
#> [577] | Slco5a1 16.1891332694206
#> [578] | Slco5a1 11.771392591681
#> [579] | Slco5a1 14.6843556168742
Diagnose and filter links
Now, we can run diagnoseLinkSet to check the distance distribution and inter/intra interaction percentage.
diagnoseLinkSet(ls)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> linkSet object with 579 interactions and 4 metadata columns:
#> bait seqnames_oe ranges_oe | anchor1.symbol
#> <character> <Rle> <IRanges> | <character>
#> [1] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [2] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [3] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [4] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [5] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> ... ... ... ... ... . ...
#> [575] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [576] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [577] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [578] Slco5a1 --- chr1 78187171-78187630 | Slco5a1
#> [579] Slco5a1 --- chr1 39924881-39925280 | Slco5a1
#> anchor2.name inter_type distance
#> <character> <character> <integer>
#> [1] 12.2266622787328 inter 8686
#> [2] 12.2266622787328 inter 8686
#> [3] 12.2266622787328 inter 8686
#> [4] 12.2266622787328 inter 8686
#> [5] 12.2266622787328 inter 8686
#> ... ... ... ...
#> [575] 16.1891332694206 inter 1567221
#> [576] 16.1891332694206 inter 1567221
#> [577] 16.1891332694206 inter 1567221
#> [578] 11.771392591681 inter 65257801
#> [579] 14.6843556168742 inter 26995481
#> -------
#> regions: 109 ranges and 0 metadata columns
#> seqinfo: 66 sequences (1 circular) from mm10 genome
Intrachromosomal interaction and long distance interaction are likely be noise, so we filter them.
ls <- filterLinks(ls,filter_intra = TRUE, filter_unannotate = TRUE, distance = 50000000)
Duplicated links are associated with contact frequency, so it’s a good idea to count duplicated links.
ls <- countInteractions(ls)
orderLinks(ls,by = "count",decreasing = T)
#> linkSet object with 50 interactions and 5 metadata columns:
#> bait seqnames_oe ranges_oe | anchor1.symbol
#> <character> <Rle> <IRanges> | <character>
#> [1] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [2] Slco5a1 --- chr1 12785091-12785750 | Slco5a1
#> [3] Slco5a1 --- chr1 14496641-14497000 | Slco5a1
#> [4] Ncoa2 --- chr1 12785091-12785750 | Ncoa2
#> [5] Sulf1 --- chr1 14496641-14497000 | Sulf1
#> ... ... ... ... ... . ...
#> [46] Slco5a1 --- chr1 35926611-35926860 | Slco5a1
#> [47] Slco5a1 --- chr1 38683021-38683930 | Slco5a1
#> [48] Slco5a1 --- chr1 34163531-34164100 | Slco5a1
#> [49] Slco5a1 --- chr1 62841301-62841760 | Slco5a1
#> [50] Slco5a1 --- chr1 39924881-39925280 | Slco5a1
#> anchor2.name inter_type distance count
#> <character> <character> <integer> <integer>
#> [1] 12.2266622787328 inter 8686 404
#> [2] 12.2266622787328 inter 144179 20
#> [3] 16.1891332694206 inter 1567221 10
#> [4] 12.2266622787328 inter 471173 8
#> [5] 16.1891332694206 inter 1720086 5
#> ... ... ... ... ...
#> [46] 14.3114832520841 inter 22997136 1
#> [47] 10.9529360911238 inter 25753876 1
#> [48] 11.2903880454209 inter 21234216 1
#> [49] 11.6972535378061 inter 49911931 1
#> [50] 14.6843556168742 inter 26995481 1
#> -------
#> regions: 48 ranges and 0 metadata columns
#> seqinfo: 66 sequences (1 circular) from mm10 genome
We can notice that there is a significant link strength between
Sulf1
and chr1:12785091-12785750
.
Cross gene links and visualization
Enhancers that regulate multiple genes are biologically meaningful.
ls <- crossGeneEnhancer(ls)
ls <- orderLinks(ls,by = "crossFreq",decreasing = T)
ls
#> linkSet object with 50 interactions and 6 metadata columns:
#> bait seqnames_oe ranges_oe | anchor1.symbol
#> <character> <Rle> <IRanges> | <character>
#> [1] Sulf1 --- chr1 12785091-12785750 | Sulf1
#> [2] Slco5a1 --- chr1 12785091-12785750 | Slco5a1
#> [3] Xkr9 --- chr1 12785091-12785750 | Xkr9
#> [4] Ogfrl1 --- chr1 12785091-12785750 | Ogfrl1
#> [5] Adgrb3 --- chr1 12785091-12785750 | Adgrb3
#> ... ... ... ... ... . ...
#> [46] Slco5a1 --- chr1 39758181-39758690 | Slco5a1
#> [47] Slco5a1 --- chr1 38683021-38683930 | Slco5a1
#> [48] Slco5a1 --- chr1 34163531-34164100 | Slco5a1
#> [49] Slco5a1 --- chr1 62841301-62841760 | Slco5a1
#> [50] Slco5a1 --- chr1 39924881-39925280 | Slco5a1
#> anchor2.name inter_type distance count crossFreq
#> <character> <character> <integer> <integer> <integer>
#> [1] 12.2266622787328 inter 8686 404 37
#> [2] 12.2266622787328 inter 144179 20 37
#> [3] 12.2266622787328 inter 899826 2 37
#> [4] 12.2266622787328 inter 10596562 1 37
#> [5] 12.2266622787328 inter 12663171 5 37
#> ... ... ... ... ... ...
#> [46] 11.037989817782 inter 26828836 1 1
#> [47] 10.9529360911238 inter 25753876 1 1
#> [48] 11.2903880454209 inter 21234216 1 1
#> [49] 11.6972535378061 inter 49911931 1 1
#> [50] 14.6843556168742 inter 26995481 1 1
#> -------
#> regions: 48 ranges and 0 metadata columns
#> seqinfo: 66 sequences (1 circular) from mm10 genome
We can use plot_genomic_ranges
to visualize the cross
gene links.
plot_genomic_ranges(ls,showOE = oe(ls)[1])
We can also choose to visualze the bait center region.
plot_genomic_ranges(ls,showBait = "Ncoa2")
plot_genomic_ranges(ls,showBait = "Sulf1")