---
title: "Overview to curatedPCaData"
output: 
  BiocStyle::html_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Overview to curatedPCaData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    warning = FALSE,
    message = FALSE,
    cache = TRUE,
    tidy = TRUE,
    width.cutoff = 70
)
```

# Package overview

This overview provides insight into the available datasets (R package version 
`r utils::packageVersion("curatedPCaData")`) provided via `ExperimentHub` cloud 
services. The main data class is a `MultiAssayExperiment` (MAE) object 
compatible with numerous Bioconductor packages.

***

3 different omics base data types and accompanying clinical/phenotype data are 
currently available: 

1. `gex.*` assays contain gene expression values, with the suffix wildcard 
indicating unit or method for gene expression
2. `cna.*` assays contain copy number values, with the suffix wildcard 
indicating method for copy number alterations
3. `mut` assays contain somatic mutation calls
4. `MultiAssayExperiment::colData(maeobj)` contains the clinical metadata 
curated based on a pre-defined template

Their availability is subject to the study in question, and you will find 
coverage of the omics here-in. Furthermore, derived variables based on these 
base data types are provided in the constructed `MultiAssayExperiment` (MAE) 
class objects.

For a comprehensive guide on how to neatly handle such `MAE` objects, refer to 
the MultiAssayExperiment user guide (or cheat-sheets): [MAE User Guide]
(https://www.bioconductor.org/packages/devel/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) .

# R-package

The `curatedPCaData` package contains a collection of manually curated datasets 
concerning patients diagnosed with prostate cancer. The datasets within this 
package have followed uniform processing and naming conventions to allow users 
to more easily reproduce similar analyses between datasets and spend less time 
concerned with harmonzing data from different sources. 

# Downloading data from ExperimentHub or loading them from local cache

To get a full list of available datasets see the documentation for `getPCa` 
function, or via querying `ExperimentHub`directly for the components used to 
construct `MultiAssayExperiments` for the studies. However, `getPCa` is aimed 
to comprehensively provide readily usable multi-omics compatible MAE-objects.

## All available datasets

Fetching all datasets available in `curatedPCaData`:

```{r message=FALSE}
library(curatedPCaData)

# Use a function to extract all known study short identifiers
studies <- curatedPCaData::getPCaStudies()
studies

# List apply across studies to extract all MAE objects corresponding to the 
# short identifiers
maes <- lapply(studies, FUN=\(id) { curatedPCaData::getPCa(id) })
names(maes) <- studies
```

### Dataset criteria

The datasets were manually selected based on various criteria, such as:

- Primary data availability (preferably raw data available)
- Data platform types and their overlap (gene expression, copy number 
alteration, mutation data, ...)
- End points (e.g. recurrence, Gleason, ...)
- Clinical metadata availability and reliability
- Design of the study

### Studies

The function `getPCa` utilizes the studies' short name for identifying which 
data to extract. An overview into the main datasets is as follows:

```{r studies1, message = FALSE}
# Create a summary table depicting key features in available studies
studytable <- curatedPCaData::getPCaSummaryStudies(maes)
```

```{r studies2, results = 'asis', echo=FALSE}
knitr::kable(studytable, caption = "Key study characteristics")
```

Please note that the TCGA PCa dataset is a subset of the TCGA pan-cancer 
initiative. For a package focused on TCGA exclusively beyond the PRAD subset, 
see the Bioconductor package [curatedTCGAData]
(https://bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html).

### Citations

The use of `curatedPCaData` ought to be cited with [@Laajala2023].

For individual datasets there-in, the following citations are suggested:

* Abida et al. : [@Abida2019]
* Baca et al. : [@Baca2013]
* Barbieri et al. : [@Barbieri2012]
* Barwick et al. : [@Barwick2010]
* Chandran et al. : [@Chandran2007]
* Friedrich et al. : [@Friedrich2020] 
* Hieronymus et al. : [@Hieronymus2014]
* ICGC (Canadian) : PRAD-CA in [@Zhang2019]
* IGC : GEO accession GSE2109
* Kim et al. : [@Kim2018]
* Kunderfranco et al. : [@Kunderfranco2010], [@PeraldoNeia2011], [@Longoni2012]
* Ren et al. : [@Ren2018]
* Sun et al. : [@Sun2009]
* Taylor et al. : [@Taylor2010]
* TCGA : [@TCGA2015], [@Goldman2020]
* True et al. : [@True2006]
* Wallace et al. : [@Wallace2008]
* Wang et al. : [@Wang2010], [@Jia2011]
* Weiner et al. : [@Weiner2021]

### Curated clinical variables

The `curatedPCaData`-package has been curated with an emphasis on the following 
primary clinical metadata, which were extracted and cleaned up always when 
available:

```{r template, results='asis', echo=FALSE}
data(template_prad)
template <- template_prad
# Add spaces to |-dividers for linechanges
template <- do.call("cbind", lapply(template, FUN = function(x) {
    gsub("|", " | ", x, fixed = TRUE)
}))
knitr::kable(template, caption = "Template for prostate adenocarcinoma clinical 
metadata")
```

### Clinical end-points

Three primary clinical end-points were utilized and are offered in the clinical 
metadata in colData for the MAE-objects, if available:

* Gleason grade/Grade group(s)
* Biochemical Recurrence (BCR)
* Overall Survival (OS)

Below are summaries for each of these endpoints for each study. Of note, OS had 
very few events, thus survival modelling for this end-point may be considered 
unreliable.

Gleason grades:

```{r gleasons1}
# Create a summary table of Gleason grades
gleasons <- curatedPCaData::getPCaSummaryTable(
    maes, 
    var.name = "gleason_grade", 
    vals=5:10
)
```

```{r gleasons2, results = 'asis', echo=FALSE}
knitr::kable(gleasons, caption = "Gleason grades across datasets in 
curatedPCaData")
```

Grade groups:

```{r groups1}
# Create a summary table of grade groups
gradegroups <- curatedPCaData::getPCaSummaryTable(
    maes, 
    var.name = "grade_group", 
    vals=c("<=6", "3+4", "4+3", "7", ">=8")
)
```

```{r groups2, results = 'asis', echo=FALSE}
knitr::kable(gradegroups, caption = "Grade groups across datasets in 
curatedPCaData")
```

Biochemical recurrences:

```{r recurrences1}
# Create a summary table of biochemical recurrences
recurrences <- curatedPCaData::getPCaSummarySurv(
    maes, 
    event.name = "disease_specific_recurrence_status", 
    time.name = "days_to_disease_specific_recurrence"
)
```

```{r recurrences2, results = 'asis', echo=FALSE}
knitr::kable(recurrences, caption = "Disease recurrence end point across 
datasets in curatedPCaData")
```

Overall survival:

```{r os1}
# Create a summary table of overall survival
survivals <- curatedPCaData::getPCaSummarySurv(
    maes, 
    event.name = "overall_survival_status", 
    time.name = "days_to_overall_survival"
)
```

```{r os2, results = 'asis', echo=FALSE}
knitr::kable(survivals, caption = "Overall survival end point across datasets 
in curatedPCaData")
```

## Querying datasets

The function `getPCa` functions as the primary interface with building 
MAE-objects from either live download from `ExperimentHub` or by loading them 
from local cache, if the datasets have been downloaded previously.

The syntax for the function `getPCa(dataset, assays, timestamp, verbose, ...)` 
consists of the following parameters:

- `dataset`: Primary indicator for which study to query from `ExperimentHub`; 
notice that this may only be one of the allowed values.
- `assays`: This indicates which MAE-assays are fetched from the candidate 
ExperimentList. Two names are always required (and are filled if missing): 
`colData` which contains information on the clinical metadata, and `sampleMap` 
which maps the rownames of the metadata to columns in the fetched assay data. 
- `timestamp`: When data is deposited in the `ExperimentHub` resources, they 
are time stamped to avoid ambiguity. The timestamps provided in this parameter 
are resolved from left to right, and the first deposit stamp is `"20230215"`. 
- `verbose`: Logical indicator whether additional information should be printed 
by `getPCa`.
- `...`: Further custom parameters passed on to `getPCa`.

As an example, let us consider querying the TCGA dataset, but suppose only wish 
to extract the gene expression data, and the immune deconvolution results 
derived by the method xCell. Further, we'll request risk and AR scores slot. 
This subset could be retrieved with:

```{r tcgaex}
tcga_subset <- getPCa(
    dataset = "tcga", 
    assays = c("gex.rsem.log", "xcell", "scores"), 
    timestamp = "20230215"
)

tcga_subset
``` 

The standard way of extracting the latest MAE-object with all available assays 
is done via querying with just the dataset name:

```{r ehquery}
mae_tcga <- getPCa("tcga")
mae_taylor <- getPCa("taylor")
```

### Accessing primary data

The primary assay names in the MAE objects for gene expression and copy number 
alteration will consist of two parts. Mutation data is provided as a 
`RaggedExperiment` object.

- Prefix indicating data type, either "gex." or "cna.".
- Suffix indicating unit and processing for the data; for example, a gene 
expression dataset (gex) may have a suffix of "rma" for RMA-processed data, 
"fpkm" for processed RNA-seq data, "relz" for relative z-score normalized 
expression values for tumor-normal gene expression pairs, or "logq" for 
logarithmic quantile-normalized data. The main suffix for copy number 
alteration is the discretized GISTIC alteration calls with values {-2,-1,0,1,2},
although earlier version also provided log-ratios ("logr")
- Mutation data is provided as `RaggedExperiment` objects as "mut".

The standard way for accessing a data slot in MAE could be done for example via:

```{r access}
mae_taylor[["gex.rma"]][1:5, 1:5]
```
The corresponding clinical variables have an accessor function `colData` 
provided by the `MultiAssayExperiment`-package:

```{r clinical}
MultiAssayExperiment::colData(mae_tcga)[1:2, ]
```

While it is ideal to make sure user is using the correct namespaces, the 
`pckgName::` can be omitted as `curatedPCaData` imports necessary packages such
as `MultiAssayExperiment` and their functions should be available in the 
workspace.

### ExperimentHub data listing

In order to access the latest listing of `curatedPCaData` related resources 
available in `ExperimentHub`, consult the `metadata.csv` file delivered with 
the package:

```{r metadat}
metadata <- read.csv(system.file("extdata", "metadata.csv", package = 
    "curatedPCaData"))
head(metadata)
```

## Omics sample count and overlap

```{r samplecountsoverlap}
# Retrieve samples counts across different unique assay names as well as omics 
# overlap sample counts
samplecounts <- curatedPCaData::getPCaSummarySamples(maes)
```

The sample counts in each 'omics separately is listed below:

```{r samplecounts, results='asis', echo=FALSE}
knitr::kable(samplecounts$Samples, caption = "Sample N counts in each omics for 
every MAE object")
```

However, taking intersections between different omics shows that different 
samples were analyzed on different platforms - therefore the effective N counts 
for analyzing multiple 'omics platforms simultaneously is smaller. The overlaps 
between gene expression (GEX), copy number alteration (CNA), and mutations (MUT)
are shown below:

```{r sampleoverlap, results='asis', echo=FALSE}
knitr::kable(samplecounts$Overlap, caption = "Sample N counts for intersections 
between different omics")
```

# Derived variables

In `curatedPCaData` we refer to derived variables as further downstream 
variables, which have been computed based on primarily data. For most cases, 
this was done by extracting key gene information from the `gex.*` assays and 
pre-computing informative downstream markers as described in their primary 
publications.

## Immune deconvolution

Tumor progression depends on the immune cell composition in the tumor 
microenvironment. The '[immunedeconv](https://github.com/icbi-lab/immunedeconv)'
package consists of different computational methods to computationally estimate 
immune cell content using gene expression data. In addition, CIBERTSORTx is 
provided externally, as this method required registered access. For user 
convenience, it has been run separately and provided as a slot in the MAE 
objects. The other methods have been run using the `immunedeconv` package 
[@Sturm2019] and code for reproducing these derived variables are provided 
alongside the package.

In this package, we provide estimates of immune cell content from the following 
deconvolution methods:

- quanTIseq
- xCell
- EPIC
- MCP counter
- CIBERSORT(x)
- ESTIMATE

The estimates from each of these methods are stored in the MAE object as a 
seperate assay as shown for example in the Taylor dataset
```{r}
mae_taylor
```

To access the quantiseq results for the Taylor et. al dataset, these 
pre-computed values can be obtained from the corresponding slot in the 
MAE-object:
```{r}
head(mae_taylor[["cibersort"]])[1:5, 1:3]
```

Similarly to access results from the other immune deconvolution methods, the 
following assays/experiments are also available:
```{r}
head(mae_taylor[["quantiseq"]])[1:5, 1:3]
head(mae_taylor[["xcell"]])[1:5, 1:3]
head(mae_taylor[["epic"]])[1:5, 1:3]
head(mae_taylor[["mcp"]])[1:5, 1:3]
```

Each row of the deconvolution matrix represents the content of a certain immune 
cell type and the columns represent the patient sample IDs. The variables on 
the rows are specific for each method. Further, it should be noted that not all 
methods could be run on all datasets due to lack of overlap in genes of 
interest.

## Risk scores and other metrics

The slot `scores` is used to provide key risk scores or other informative 
metrics based on the primary data. These scores can be accessed as a matrix as 
if they were variables on an assay with this name:

```{r scores}
mae_tcga[["scores"]][, 1:4]
```

The following PCa risk scores are offered:

- Decipher `(rowname: decipher)` [@Herlemann2019]
- Oncotype DX `(rowname: oncotype)` [@Knezevic2013]
- Prolaris `(rowname: prolaris)` [@NICE2018]

Further, the 20-gene Androgen Receptor (AR) score is calculated as described 
in the TCGA's Cell 2015 paper:

- AR score `(rowname: ar_score)` [@TCGA2015]

# Single study example for Taylor et al.

Here, a brief example on how to download and process a single study is 
provided. The example data is of Taylor et al. [@Taylor2010], also known as 
the MSKCC dataset.

## Downloading the MAE-object

A character vector with the short study ID is used to download the MAE 
object; we will focus only on the primary prostate cancer samples and 
CNA (GISTIC) and GEX:

```{r}
taylor <- getPCa("taylor", 
    assays = c("gex.rma", "cna.gistic"),
    sampletypes = "primary"
)

class(taylor) 

taylor
```

One typical end-point is an object of type `Surv`, exported from the 
`survival`-package. We will create this end-point for biochemical 
recurrence:

```{r}
library(survival)
# BCR events
taylor_bcr <- colData(taylor)$disease_specific_recurrence_status
# BCR events / censoring follow-up time
taylor_fu <- colData(taylor)$days_to_disease_specific_recurrence

taylor_surv <- Surv(event = taylor_bcr, time = taylor_fu)

class(taylor_surv)

head(taylor_surv)

```

With the response vector of type `Surv`, one can plot and analyze multiple 
survival modelling related tasks.

## Kaplan-Meier curves

One of the most common ways to depict survival curves (here BCR events), is 
a Kaplan-Meier (KM) curve. Gleason grade is known to be a good prognostic 
factor for BCR, thus a KM curve in respect to biopsy Gleason grade shows 
differences in prognosis.

For this visualization, package `survminer` offers a variety of functions 
building on top of `ggplot`:

```{r}
library(survminer)

taylor_bcr_gleason <- data.frame(bcr = taylor_surv, 
    gleason = colData(taylor)$gleason_grade)

fit <- survfit(bcr ~ gleason, data = taylor_bcr_gleason)
ggsurvplot(fit, 
    data = taylor_bcr_gleason, 
    ylab = "Biochemical recurrence free proportion",
    risk.table = TRUE,
    size = 1,
    pval = TRUE,
    ggtheme = theme_bw()
    )
```

For more settings for the KM plots, see documentation for 
`survminer::ggsurvplot`.

## Cox regression

Functions `longFormat` and `wideFormat` from `MultiAssayExperiment` are 
essential for extracting multi-omics data and metadata in the right format. 
For the purposes of Cox regression, we will utilize `wideFormat`:

```{r}
taylor_coxdat <- MultiAssayExperiment::wideFormat(taylor["PTEN",,],
    colDataCols = c("age_at_initial_diagnosis", "gleason_grade",
    "disease_specific_recurrence_status", 
    "days_to_disease_specific_recurrence"))

taylor_coxdat <- as.data.frame(taylor_coxdat)

taylor_coxdat$y <- Surv(
    time = taylor_coxdat$days_to_disease_specific_recurrence,
    event = taylor_coxdat$disease_specific_recurrence_status)
    
head(taylor_coxdat)    
```

We'll construct a simple Cox proportional hazards model with few variables; 
PTEN is a known tumor suppressor gene, so changes in its copy number or gene 
expression levels could also play a role in biochemical recurrence.

```{r}
coxmodel <- coxph(y ~ cna.gistic_PTEN + gex.rma_PTEN + gleason_grade, 
    data = taylor_coxdat)
coxmodel
```

In this case we took the GISTIC normalized PTEN amplification as well as 
its RMA-normalized gene expression. We notice that both are statistically 
significant Cox regression coefficients together with Gleason grade. 

# Session info

```{r session}
sessionInfo()
```

# References
