---
title: "scGraphVerse Case Study: Zero-Inflated Simulation and GRN Inference"
author: "Francesco Cecere"
output:
  BiocStyle::html_document:
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{scGraphVerse Simulation Study: Sim & GRN Reconstruction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

scGraphVerse is a modular and extensible R package for constructing, comparing, 
and visualizing gene regulatory networks (GRNs) from single-cell RNAseq data.
It includes several inference algorithms, utilities,
and visualization functions.

Single-cell data presents unique challenges: high sparsity, batch variation, 
and the need to distinguish shared versus condition-specific regulation. 
scGraphVerse helps address these through multi-method inference consensus. 
It starts with SingleCellExperiment, Seurat or matrix-based objects, enabling 
flexible pipelines across diverse experimental setups.

## Why use scGraphVerse?

While several GRN inference packages exist in Bioconductor 
(e.g. SCENIC), most are limited to single-method 
inference or lack support for multi-condition, multi-replicate, and 
multi-method comparisons. scGraphVerse adds value by:

1. Supporting multiple inference algorithms, including GENIE3, GRNBoost2, 
ZILGM, PCzinb, and Joint Random Forests (JRF).
2. Providing joint modeling of GRNs across multiple conditions via JRF
3. Enabling benchmarking simulations from known ground-truth GRNs with 
customizable ZINB-based count matrix generation.
4. Offering community analysis, including consensus GRNs
and databases integration (e.g. PubMed and STRINGdb).


## How scGraphVerse works

The typical scGraphVerse workflow begins with one or more count matrix, 
or objects like SingleCellExperiment or Seurat. These are passed to the core 
function infer_networks(), which runs the selected inference methods and
returns an adjacency matrix or list of matrices.

For benchmarking or teaching purposes, synthetic datasets can be generated 
using zinb_simdata() based on a known adjacency matrix. This supports method
comparison across early, late, and joint integration strategies using 
earlyj() and compare_consensus().

An end-to-end workflow, from network inference to community detection and 
consensus visualization, is provided in the rest of vignettes.

As intended for both high-throughput analysis and intuitive usage, 
scGraphVerse supports custom workflows using its modular functions.

# Simulation study 

In this simulation study, we use **scGraphVerse** to:

1. Define a ground-truth regulatory network from high-confidence interactions.
2. Simulate zero-inflated scRNA-seq count data that respects the ground truth.
3. Infer gene regulatory networks using **GENIE3**.
4. Evaluate performance with ROC curves, precision–recall scores, 
and community similarity.
5. Build consensus networks and perform edge mining.

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    eval = TRUE,
    collapse = TRUE,
    comment = "#>",
    fig.width = 6,
    fig.height = 5
)
library(scGraphVerse)
```

# 1. Load Ground-Truth Network

We use a pre-computed toy adjacency matrix (`toy_adj_matrix`) as our ground
truth network. This matrix represents known regulatory relationships between
genes and serves as the reference for evaluating network inference performance.

```{r ground-truth-network, fig.alt="Network visualization gtruth"}
# Load the toy ground truth adjacency matrix
data(toy_adj_matrix)
adj_truth <- toy_adj_matrix

# Visualize network
if (requireNamespace("igraph", quietly = TRUE) &&
    requireNamespace("ggraph", quietly = TRUE)) {
    gtruth <- igraph::graph_from_adjacency_matrix(adj_truth,mode="undirected")
    ggraph::ggraph(gtruth, layout = "fr") +
        ggraph::geom_edge_link(color = "gray") +
        ggraph::geom_node_point(color = "steelblue") +
        ggplot2::ggtitle(paste0(
            "Ground Truth: ",
            igraph::vcount(gtruth),
            " nodes, ",
            igraph::ecount(gtruth),
            " edges"
        )) +
        ggplot2::theme_minimal()
}
```


# 2. Simulating Zero-Inflated Count Data

Now we'll generate synthetic single-cell count data that respects our 
ground-truth network structure. We simulate three experimental conditions 
(batches) with 50 cells each, using different parameters to model realistic 
batch effects and biological variability. The `zinb_simdata()` function creates 
zero-inflated negative binomial count data where gene expression levels are 
influenced by the regulatory relationships defined in our ground-truth network.

**Key simulation parameters:**
- `mu_range`: Different mean expression levels across conditions
- `theta`: Overdispersion parameter (lower = more variable)  
- `pi`: Zero-inflation probability (dropout rate)
- `depth_range`: Sequencing depth variation per cell

```{r simulate}
# Simulation parameters
nodes <- nrow(adj_truth)
sims <- zinb_simdata(
    n = 50,
    p = nodes,
    B = adj_truth,
    mu_range = list(c(1, 4), c(1, 7)),
    mu_noise = c(1, 3),
    theta = c(1, 0.7),
    pi = c(0.2, 0.2),
    kmat = 2,
    depth_range = c(0.8 * nodes * 3, 1.2 * nodes * 3)
)
# Transpose to cells × genes
count_matrices <- lapply(sims, t)
```

# 3. Inferring Networks with GENIE3

Next, we apply the **GENIE3** algorithm to infer gene regulatory networks from
our simulated count data. GENIE3 uses random forests to predict each gene's
expression based on all other genes, then ranks the importance of potential
regulatory relationships. We process all three batches to capture
condition-specific and shared regulatory patterns.

**Note:** The package now uses Bioconductor data structures:
- `count_matrices` is now a `MultiAssayExperiment` (MAE) object
- Network outputs are stored in `SummarizedExperiment` (SE) objects

**GENIE3 workflow:**
1. Create a MAE object from the count matrices list
2. For each target gene, build random forest model using all other genes as
predictors
3. Extract feature importance scores as edge weights
4. Generate weighted adjacency matrices stored in a SummarizedExperiment
5. Symmetrize the matrices using mean weights to create undirected networks

```{r genie3}
# Create MultiAssayExperiment from count matrices
mae <- create_mae(count_matrices)

# Infer networks
networks_joint <- infer_networks(
    count_matrices_list = mae,
    method = "GENIE3",
    nCores = 1
)

# Generate weighted adjacency matrices (returns SummarizedExperiment)
wadj_se <- generate_adjacency(networks_joint)

# Symmetrize weights (returns SummarizedExperiment)
swadj_se <- symmetrize(wadj_se, weight_function = "mean")
```

# 4. ROC Curve and AUC

Now we evaluate how well our inferred networks recover the ground-truth
regulatory relationships using ROC curve analysis. The ROC curve plots
True Positive Rate (sensitivity) against False Positive Rate (1-specificity)
across different edge weight thresholds. A perfect classifier would achieve
AUC = 1.0, while random guessing gives AUC = 0.5.

**ROC Analysis Steps:**
1. Use continuous edge weights from symmetrized adjacency matrices
2. Compare against binary ground-truth network
3. Calculate Area Under Curve (AUC) as overall performance metric
4. Higher AUC indicates better recovery of true regulatory edges

```{r roc, fig.alt="ROC curve"}
if (requireNamespace("pROC", quietly = TRUE)) {
    roc_res <- plotROC(
        swadj_se,
        adj_truth,
        plot_title = "ROC Curve: GENIE3 Network Inference",
        is_binary = FALSE
    )
    roc_res$plot
    auc_joint <- roc_res$auc
}
```

## 4.1. Precision–Recall and Graph Visualization

To convert our continuous edge weights into binary networks, we need to
determine appropriate cutoff thresholds. The `cutoff_adjacency()` function uses
a data-driven approach: it creates shuffled versions of our count data, infers
networks from these null datasets, and sets the threshold at the 95th
percentile of shuffled edge weights. This ensures we keep only edges that are
much stronger than expected by random chance.

**Cutoff Method:**
1. Generate shuffled count matrices (randomize gene expression profiles)
2. Run GENIE3 on shuffled data to get null distribution of edge weights
3. Set threshold at 95th percentile of shuffled weights
4. Apply threshold to original networks to get binary adjacency matrices
5. Calculate precision metrics: TPR (sensitivity), FPR, Precision, F1-score

```{r cutoff, fig.alt="result Graphs"}
# Binary cutoff at 95th percentile (returns SummarizedExperiment)
binary_se <- cutoff_adjacency(
    count_matrices = mae,
    weighted_adjm_list = swadj_se,
    n = 1,
    method = "GENIE3",
    quantile_threshold = 0.95,
    nCores = 1,
    debug = TRUE
)

# Precision scores
pscores_joint <- pscores(adj_truth, binary_se)
head(pscores_joint)

# Network plot
if (requireNamespace("ggraph", quietly = TRUE)) {
    plotg(binary_se)
}
```

# 5. Consensus Networks and Community Similarity

Since we have three binary networks (one per experimental condition), we can
create a consensus network that captures edges consistently detected across
conditions. The "vote" method includes an edge in the consensus only if it
appears in at least 2 out of 3 individual networks, making it more robust to
condition-specific noise.

**Consensus Building:**
- **Vote method**: Edge included if present in majority of networks (≥2/3)
- **Union method**: Edge included if present in any network (≥1/3)
- **INet method**: Uses weighted evidence combination (more sophisticated)

```{r consensus, fig.alt="consensus Graph"}
# Consensus matrix (returns SummarizedExperiment)
consensus <- create_consensus(binary_se, method = "vote")

# Plot consensus network
if (requireNamespace("ggraph", quietly = TRUE)) {
    plotg(consensus)
}
```

```{r communities-adj-truth, fig.alt="Community detection adj_truth"}
# Compare consensus to truth using classify_edges and plot_network_comparison
if (requireNamespace("ggplot2", quietly = TRUE) &&
    requireNamespace("ggraph", quietly = TRUE)) {
    # Wrap toy_adj_matrix in a SummarizedExperiment for classify_edges
    adj_truth_se <- build_network_se(list(ground_truth = adj_truth))

    # classify_edges expects SummarizedExperiment objects
    edge_classification <- classify_edges(
        consensus_matrix = consensus,
        reference_matrix = adj_truth_se
    )

    print(edge_classification)

    # Visualize network comparison (show_fp = FALSE to hide false positives)
    comparison_plot <- plot_network_comparison(
        edge_classification = edge_classification,
        show_fp = FALSE
    )

    print(comparison_plot)
}
```

Community detection identifies groups of highly interconnected genes that
likely share biological functions or regulatory mechanisms. We'll detect
communities in both our inferred consensus network and the ground-truth
network, then compare their similarity using several metrics.

**Community Detection Steps:**
1. Apply Louvain algorithm to identify network modules
2. Compare community assignments using multiple similarity measures:
- **Variation of Information (VI)**: Lower values = more similar
- **Normalized Mutual Information (NMI)**: Higher values = more similar
- **Adjusted Rand Index (ARI)**: Higher values = more similar

Now we'll detect communities in our ground-truth network to establish the
reference community structure:

```{r communities-gtruth, fig.alt="Community detection gtruth"}
if (requireNamespace("igraph", quietly = TRUE)) {
    comm_truth <- community_path(adj_truth)
}
```

```{r communities-consensus, fig.alt="Community detection consensus"}
# Detect communities in consensus network
if (requireNamespace("igraph", quietly = TRUE)) {
    comm_cons <- community_path(consensus)
}
```

Finally, we'll quantify how similar the community structures are between our
inferred consensus network and the ground-truth network using the broken-down
functions:

```{r community-similarity, fig.alt="Community similarity metrics"}
# Calculate community metrics (VI, NMI, ARI)
if (requireNamespace("igraph", quietly = TRUE)) {
    comm_metrics <- compute_community_metrics(comm_truth, list(comm_cons))
    print(comm_metrics)

    # Calculate topology metrics (modularity, density, etc.)
    # Note: compute_topology_metrics expects community_path outputs
    # Returns a list with $topology_measures and $control_topology
    topo_metrics <- compute_topology_metrics(comm_truth, list(comm_cons))
    print(topo_metrics)

    # Visualize community comparison
    if (requireNamespace("fmsb", quietly = TRUE)) {
        plot_community_comparison(
            community_metrics = comm_metrics,
            topology_measures = topo_metrics$topology_measures,
            control_topology = topo_metrics$control_topology
        )
    }
}
```

## 5.1. Edge Mining

Edge mining provides a detailed analysis of network reconstruction performance
by categorizing each predicted edge as True Positive (TP), False Positive (FP),
True Negative (TN), or False Negative (FN). This gives us insight into which
specific regulatory relationships were successfully recovered. The function now
returns a list of dataframes matching the documentation.

**Edge Mining Analysis:**
- **True Positives (TP)**: Correctly predicted edges that exist in ground truth
- **False Positives (FP)**: Incorrectly predicted edges not in ground truth
- **False Negatives (FN)**: Missed edges that exist in ground truth
- **True Negatives (TN)**: Correctly identified non-edges

```{r edge-mining}
em <- edge_mining(
    consensus,
    ground_truth = adj_truth,
    query_edge_types = "TP"
)
head(em[[1]])
```

```{r sessioninfo}
sessionInfo()
```