---
title: "Hi-C Workflow with linkSet"
author:
  - name: Gilbert Han
    affiliation: Your Affiliation
    email: GilbertHan1011@gmail.com
date: "`r Sys.Date()`"
package: linkSet
output:
  BiocStyle::html_document:
    toc: true
    toc_float: 
      collapsed: false
      smooth_scroll: true
    toc_depth: 2
    number_sections: true
    theme: flatly
    highlight: tango
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{Hi-C Workflow with linkSet}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.wide = TRUE,
  fig.align = "center",
  fig.retina = 2,
  warning = FALSE,
  message = FALSE,
  cache = FALSE,
  echo = TRUE
)
suppressPackageStartupMessages({
  library(BiocStyle)
})
```

`r BiocStyle::doc_date()`

`r BiocStyle::pkg_ver('linkSet')`



# Introduction
This vignette provides a step-by-step guide to using the `r Biocpkg("linkSet")` package to analyze Hi-C/HiChIP data. We will use a toy example to illustrate the main functions and workflows.
Our goal is to identify the enhancer-gene links in this example.

We will use the following datasets as input:

1. validPairs produced by [HiC-Pro](https://github.com/nservant/HiC-Pro).
2. Mouse embryo body enhancer data from enhancer atlas [website](http://www.enhanceratlas.org/).
3. Gene annotation data from `r Biocpkg("TxDb.Mmusculus.UCSC.mm10.knownGene")` package.

We highly recommend you to use custom data instead of the example data provided in this vignette.

# Setup

```{r installation, eval = FALSE}
if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
```

```{r load-packages}
suppressPackageStartupMessages({
  library(linkSet)
  # Load additional packages only if available
  if (requireNamespace("TxDb.Mmusculus.UCSC.mm10.knownGene", quietly = TRUE)) {
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
  }
  if (requireNamespace("org.Mm.eg.db", quietly = TRUE)) {
    library(org.Mm.eg.db)
  }
  if (requireNamespace("Organism.dplyr", quietly = TRUE)) {
    library(Organism.dplyr)
  }
  if (requireNamespace("InteractionSet", quietly = TRUE)) {
    library(InteractionSet)
  }
  if (requireNamespace("rtracklayer", quietly = TRUE)) {
    library(rtracklayer)
  }
})
```

We use our custom function `readvalidPairs` to load the [example data](https://data.4dnucleome.org/files-processed/4DNFI9DUBGCW/). 
Firstly, we need to load into [GInteractions object](https://bioconductor.org/packages/release/bioc/vignettes/InteractionSet/inst/doc/GInteractions.html).
```{r setup-data}
hic_file <- system.file("extdata", "toyExample.pair.gz",
  package = "linkSet"
)
gi <- readvalidPairs(hic_file, format = "pair")
promoterGr <- tryCatch(
  {
    withTxDb("mm10", function(src) {
      genes <- Organism.dplyr::genes(src, columns = "symbol")
      IRanges::promoters(genes, upstream = 10000)
    })
  },
  error = function(e) {
    # Fallback approach using TxDb directly
    if (requireNamespace("TxDb.Mmusculus.UCSC.mm10.knownGene", quietly = TRUE) &&
      requireNamespace("org.Mm.eg.db", quietly = TRUE)) {
      txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene::TxDb.Mmusculus.UCSC.mm10.knownGene
      gene_ranges <- GenomicFeatures::genes(txdb)

      # Add gene symbols
      gene_ids <- names(gene_ranges)
      symbols <- AnnotationDbi::select(org.Mm.eg.db::org.Mm.eg.db,
        keys = gene_ids,
        columns = "SYMBOL",
        keytype = "ENTREZID"
      )
      symbol_map <- setNames(symbols$SYMBOL, symbols$ENTREZID)
      mcols(gene_ranges)$symbol <- symbol_map[names(gene_ranges)]
      genes <- gene_ranges[!is.na(mcols(gene_ranges)$symbol)]

      IRanges::promoters(genes, upstream = 10000)
    } else {
      stop("Required packages not available for mm10 annotation")
    }
  }
)
file_path <- system.file("extdata", "Embryo_body.bed.gz", package = "linkSet")
enhancer <- rtracklayer::import(file_path, format = "BED")
```

Because the hic data only contains digist end, so we resize the region to upstream 5kb and downstream 5kb.
After that, we use `baitGInteractions` to generate the `linkSet` object.
```{r create-linkset}
gi <- resize(gi, 10000, fix = "center")
ls <- baitGInteractions(gi, promoterGr, enhancer, geneSymbol = "symbol")
ls
```

When we print the `linkSet` object, we can see the basic information of the `linkSet` object.
By default, we don't show the bait region. But you are interested in the bait region, you can set `showBaitRegion = TRUE`.
```{r show-linkset}
showLinkSet(ls, baitRegion = TRUE)
```

# Diagnose and filter links
Now, we can run diagnoseLinkSet to check the distance distribution and inter/intra interaction percentage.
First, let's annotate the promoter regions with genome information:
```{r annotate-and-diagnose}
ls <- suppressWarnings(annotatePromoter(ls, genome = "mm10", upstream = 1000))
diagnoseLinkSet(ls)
```

Intrachromosomal interaction and long distance interaction are likely be noise, so we filter them.
```{r filter-links}
ls <- countInteractibility(ls)
ls <- filterLinks(ls, filter_intra = TRUE, filter_unannotate = TRUE, distance = 50000000)
```

Duplicated links are associated with contact frequency, so it's a good idea to count duplicated links.
```{r count-and-order}
ls <- countInteractions(ls)
orderLinks(ls, by = "count", decreasing = TRUE)
```

We can notice that there is a significant link strength between `Sulf1` and `chr1:12785091-12785750`.

# Cross gene links and visualization
Enhancers that regulate multiple genes are biologically meaningful.
```{r cross-gene-analysis}
ls <- crossGeneEnhancer(ls)
ls <- orderLinks(ls, by = "crossFreq", decreasing = TRUE)
ls
```
We can use `plot_genomic_ranges` to visualize the cross gene links.
```{r plot-genomic-ranges}
plot_genomic_ranges(ls, showOE = oe(ls)[1])
```

We can also choose to visualze the bait center region.

```{r plot-bait-region}
# Check if regionsBait exists and has names
if (!is.null(regionsBait(ls)) && !is.null(names(regionsBait(ls)))) {
  # Get available bait names from regionsBait
  available_baits <- unique(names(regionsBait(ls)))
  if (length(available_baits) > 0) {
    # Use the first available bait for demonstration
    plot_genomic_ranges(ls, showBait = available_baits[1])
  } else {
    cat("No named bait regions available for plotting")
  }
} else {
  # If no regionsBait, just plot without showBait
  plot_genomic_ranges(ls)
}
```

# Session Information
```{r session-info}
sessionInfo()
```
