Skip to contents

Introduction

This vignette provides a step-by-step guide to using the linkSet package to analyze Hi-C/HiChIP data. We will use a toy example to illustrate the main functions and workflows. Our goal is to identify the enhancer-gene links in this example

We will use the following datasets as input:

  1. validPairs produced by HiC-Pro.
  2. Mouse embryo body enhancer data from enhancer atlas website.
  3. Gene annotation data from TxDb.Mmusculus.UCSC.mm10.knownGene pakcage.

We highly recommend you to use custom data instead of the example data provided in this vignette.

Setup

suppressPackageStartupMessages({
  library(linkSet)
  library(TxDb.Mmusculus.UCSC.mm10.knownGene)
  library(org.Mm.eg.db)
  library(Organism.dplyr)
  library(InteractionSet)
})

We use our custom function readvalidPairs to load the example data. Firstly, we need to load into GInteractions object.

hic_file <- system.file("extdata", "toyValidatePair", 
                        package="linkSet")
gi <- readvalidPairs(hic_file)
src <- Organism.dplyr::src_organism("TxDb.Mmusculus.UCSC.mm10.knownGene")
#> creating 'src_organism' database...
geneGr <- Organism.dplyr::genes(src,columns = "symbol")
promoterGr <- IRanges::promoters(geneGr,upstream = 10000)
#> Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 3 out-of-bound ranges located on sequences
#>   chr4_JH584294_random, chr4_JH584292_random, and chr5_GL456354_random.
#>   Note that ranges located on a sequence whose length is unknown (NA) or
#>   on a circular sequence are not considered out-of-bound (use
#>   seqlengths() and isCircular() to get the lengths and circularity flags
#>   of the underlying sequences). You can use trim() to trim these ranges.
#>   See ?`trim,GenomicRanges-method` for more information.
file_url <- c("http://www.enhanceratlas.org/data/download/enhancer/mm/Embryo_body.bed")
temp_file <- tempfile(fileext = ".bed")
download.file(file_url, temp_file, method = "curl")
enhancer <- rtracklayer::import(temp_file)

Because the hic data only contains digist end, so we resize the region to upstream 5kb and downstream 5kb. After that, we use baitGInteractions to generate the linkSet object.

gi <- resize(gi,10000,fix = "center")
ls <- baitGInteractions(gi,geneGr = geneGr,peakGr = enhancer,"symbol")
ls
#> linkSet object with 579 interactions and 2 metadata columns:
#>                bait     seqnames_oe         ranges_oe | anchor1.symbol
#>         <character>           <Rle>         <IRanges> |    <character>
#>     [1]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [2]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [3]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [4]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [5]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     ...         ... ...         ...               ... .            ...
#>   [575]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [576]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [577]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [578]     Slco5a1 ---        chr1 78187171-78187630 |        Slco5a1
#>   [579]     Slco5a1 ---        chr1 39924881-39925280 |        Slco5a1
#>             anchor2.name
#>              <character>
#>     [1] 12.2266622787328
#>     [2] 12.2266622787328
#>     [3] 12.2266622787328
#>     [4] 12.2266622787328
#>     [5] 12.2266622787328
#>     ...              ...
#>   [575] 16.1891332694206
#>   [576] 16.1891332694206
#>   [577] 16.1891332694206
#>   [578]  11.771392591681
#>   [579] 14.6843556168742
#>   -------
#>   regions: 109 ranges and 0 metadata columns
#>   seqinfo: 66 sequences (1 circular) from mm10 genome

When we print the linkSet object, we can see the basic information of the linkSet object. By default, we don’t show the bait region. But you are interested in the bait region, you can set showBaitRegion = TRUE.

showLinkSet(ls, baitRegion = TRUE)
#> linkSet object with 579 interactions and 2 metadata columns:
#>          bait seqnames_bait       ranges_bait     seqnames_oe         ranges_oe
#>   [1]   Sulf1          chr1 12692277-12861192 ---        chr1 12785091-12785750
#>   [2]   Sulf1          chr1 12692277-12861192 ---        chr1 12785091-12785750
#>   [3]   Sulf1          chr1 12692277-12861192 ---        chr1 12785091-12785750
#>   [4]   Sulf1          chr1 12692277-12861192 ---        chr1 12785091-12785750
#>   [5]   Sulf1          chr1 12692277-12861192 ---        chr1 12785091-12785750
#>   ...     ...           ...               ... ...         ...               ...
#> [575] Slco5a1          chr1 12866549-12992650 ---        chr1 14496641-14497000
#> [576] Slco5a1          chr1 12866549-12992650 ---        chr1 14496641-14497000
#> [577] Slco5a1          chr1 12866549-12992650 ---        chr1 14496641-14497000
#> [578] Slco5a1          chr1 12866549-12992650 ---        chr1 78187171-78187630
#> [579] Slco5a1          chr1 12866549-12992650 ---        chr1 39924881-39925280
#>       | anchor1.symbol     anchor2.name
#>   [1] |          Sulf1 12.2266622787328
#>   [2] |          Sulf1 12.2266622787328
#>   [3] |          Sulf1 12.2266622787328
#>   [4] |          Sulf1 12.2266622787328
#>   [5] |          Sulf1 12.2266622787328
#>   ... .            ...              ...
#> [575] |        Slco5a1 16.1891332694206
#> [576] |        Slco5a1 16.1891332694206
#> [577] |        Slco5a1 16.1891332694206
#> [578] |        Slco5a1  11.771392591681
#> [579] |        Slco5a1 14.6843556168742

Now, we can run diagnoseLinkSet to check the distance distribution and inter/intra interaction percentage.

diagnoseLinkSet(ls)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#> linkSet object with 579 interactions and 4 metadata columns:
#>                bait     seqnames_oe         ranges_oe | anchor1.symbol
#>         <character>           <Rle>         <IRanges> |    <character>
#>     [1]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [2]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [3]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [4]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     [5]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>     ...         ... ...         ...               ... .            ...
#>   [575]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [576]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [577]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>   [578]     Slco5a1 ---        chr1 78187171-78187630 |        Slco5a1
#>   [579]     Slco5a1 ---        chr1 39924881-39925280 |        Slco5a1
#>             anchor2.name  inter_type  distance
#>              <character> <character> <integer>
#>     [1] 12.2266622787328       inter      8686
#>     [2] 12.2266622787328       inter      8686
#>     [3] 12.2266622787328       inter      8686
#>     [4] 12.2266622787328       inter      8686
#>     [5] 12.2266622787328       inter      8686
#>     ...              ...         ...       ...
#>   [575] 16.1891332694206       inter   1567221
#>   [576] 16.1891332694206       inter   1567221
#>   [577] 16.1891332694206       inter   1567221
#>   [578]  11.771392591681       inter  65257801
#>   [579] 14.6843556168742       inter  26995481
#>   -------
#>   regions: 109 ranges and 0 metadata columns
#>   seqinfo: 66 sequences (1 circular) from mm10 genome

Intrachromosomal interaction and long distance interaction are likely be noise, so we filter them.

ls <- filterLinks(ls,filter_intra = TRUE, filter_unannotate = TRUE, distance = 50000000)

Duplicated links are associated with contact frequency, so it’s a good idea to count duplicated links.

ls <- countInteractions(ls)
orderLinks(ls,by = "count",decreasing = T)
#> linkSet object with 50 interactions and 5 metadata columns:
#>               bait     seqnames_oe         ranges_oe | anchor1.symbol
#>        <character>           <Rle>         <IRanges> |    <character>
#>    [1]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>    [2]     Slco5a1 ---        chr1 12785091-12785750 |        Slco5a1
#>    [3]     Slco5a1 ---        chr1 14496641-14497000 |        Slco5a1
#>    [4]       Ncoa2 ---        chr1 12785091-12785750 |          Ncoa2
#>    [5]       Sulf1 ---        chr1 14496641-14497000 |          Sulf1
#>    ...         ... ...         ...               ... .            ...
#>   [46]     Slco5a1 ---        chr1 35926611-35926860 |        Slco5a1
#>   [47]     Slco5a1 ---        chr1 38683021-38683930 |        Slco5a1
#>   [48]     Slco5a1 ---        chr1 34163531-34164100 |        Slco5a1
#>   [49]     Slco5a1 ---        chr1 62841301-62841760 |        Slco5a1
#>   [50]     Slco5a1 ---        chr1 39924881-39925280 |        Slco5a1
#>            anchor2.name  inter_type  distance     count
#>             <character> <character> <integer> <integer>
#>    [1] 12.2266622787328       inter      8686       404
#>    [2] 12.2266622787328       inter    144179        20
#>    [3] 16.1891332694206       inter   1567221        10
#>    [4] 12.2266622787328       inter    471173         8
#>    [5] 16.1891332694206       inter   1720086         5
#>    ...              ...         ...       ...       ...
#>   [46] 14.3114832520841       inter  22997136         1
#>   [47] 10.9529360911238       inter  25753876         1
#>   [48] 11.2903880454209       inter  21234216         1
#>   [49] 11.6972535378061       inter  49911931         1
#>   [50] 14.6843556168742       inter  26995481         1
#>   -------
#>   regions: 48 ranges and 0 metadata columns
#>   seqinfo: 66 sequences (1 circular) from mm10 genome

We can notice that there is a significant link strength between Sulf1 and chr1:12785091-12785750.

Enhancers that regulate multiple genes are biologically meaningful.

ls <- crossGeneEnhancer(ls)
ls <- orderLinks(ls,by = "crossFreq",decreasing = T)
ls
#> linkSet object with 50 interactions and 6 metadata columns:
#>               bait     seqnames_oe         ranges_oe | anchor1.symbol
#>        <character>           <Rle>         <IRanges> |    <character>
#>    [1]       Sulf1 ---        chr1 12785091-12785750 |          Sulf1
#>    [2]     Slco5a1 ---        chr1 12785091-12785750 |        Slco5a1
#>    [3]        Xkr9 ---        chr1 12785091-12785750 |           Xkr9
#>    [4]      Ogfrl1 ---        chr1 12785091-12785750 |         Ogfrl1
#>    [5]      Adgrb3 ---        chr1 12785091-12785750 |         Adgrb3
#>    ...         ... ...         ...               ... .            ...
#>   [46]     Slco5a1 ---        chr1 39758181-39758690 |        Slco5a1
#>   [47]     Slco5a1 ---        chr1 38683021-38683930 |        Slco5a1
#>   [48]     Slco5a1 ---        chr1 34163531-34164100 |        Slco5a1
#>   [49]     Slco5a1 ---        chr1 62841301-62841760 |        Slco5a1
#>   [50]     Slco5a1 ---        chr1 39924881-39925280 |        Slco5a1
#>            anchor2.name  inter_type  distance     count crossFreq
#>             <character> <character> <integer> <integer> <integer>
#>    [1] 12.2266622787328       inter      8686       404        37
#>    [2] 12.2266622787328       inter    144179        20        37
#>    [3] 12.2266622787328       inter    899826         2        37
#>    [4] 12.2266622787328       inter  10596562         1        37
#>    [5] 12.2266622787328       inter  12663171         5        37
#>    ...              ...         ...       ...       ...       ...
#>   [46]  11.037989817782       inter  26828836         1         1
#>   [47] 10.9529360911238       inter  25753876         1         1
#>   [48] 11.2903880454209       inter  21234216         1         1
#>   [49] 11.6972535378061       inter  49911931         1         1
#>   [50] 14.6843556168742       inter  26995481         1         1
#>   -------
#>   regions: 48 ranges and 0 metadata columns
#>   seqinfo: 66 sequences (1 circular) from mm10 genome

We can use plot_genomic_ranges to visualize the cross gene links.

plot_genomic_ranges(ls,showOE = oe(ls)[1])

We can also choose to visualze the bait center region.

plot_genomic_ranges(ls,showBait = "Ncoa2")

plot_genomic_ranges(ls,showBait = "Sulf1")