---
title: "General introduction"
author: "Carl Brunius, Vilhelm Suksi"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{General introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
biblio-style: apalike
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "##"
)
```

# Motivation

From the perspective of metabolites as the continuation of the central dogma of 
biology, metabolomics provides the closest link to many phenotypes of interest.
This makes untargeted LC-MS metabolomics data promising in teasing apart the 
complexities of living systems. However, due to experimental reasons, the data
includes non-wanted variation which limits quality and reproducibility, 
especially if the data is obtained from several batches. 

The batchCorr package reduces unwanted variation by way of between-batch 
alignment, within-batch drift correction and between-batch normalization using 
batch-specific quality control (QC) samples and long-term reference QC samples. 
Please see the associated article [@brunius2016large] for more thorough 
descriptions of algorithms.

# Installation

To install ```batchCorr```, install BiocManager first, if it is not installed. 
Afterwards use the install function from BiocManager.

```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("batchCorr")
```

# Data

The example data allows for demonstration of all core ```batchCorr``` 
functionality: between-batch alignment, within-batch drift correction and 
between-batch normalization. The example data consisting of three batches from 
a single analytical mode consists of three objects: PTnofill 
(non-imputed/filled abundances, matrix), PTfilled (imputed/filled abundances) 
and meta (sample and feature metadata, data.frame). 

```{r, message=FALSE, warning=FALSE}
library(batchCorr)
data("ThreeBatchData")
```

# How it works

`batchCorr` was originally designed to work with basic data structures but the 
three main methods, `alignBatches`, `correctDrift` and 
`normalizeBatches` also support SummarizedExperiment. This improves 
interoperability with other Bioconductor packages. This includes xcms for 
preprocessing, the qmtools, phenomis and/or pmp packages to complement 
normalization and quality control as well as statistical tests, machine 
learning and annotation. 

Abundances are included in a matrix, while sample and feature data is included 
in a data.frame. `batchCorr` works best as per the chronology presented below, 
where both the original functionality and SummarizedExperiment-functionality is 
demonstrated.

Utility functions for the original functionality include:

- `peakInfo` to extract m/z and rt from peak table based on a separator in 
rownames
- `getBatch` to extract specific batch from a list with a peak table and 
metadata
- `mergeBatches` to merge batches after drift correction

Important analytical background includes batch-specific QC samples and long-term
reference QC samples, which are regularly interspersed in the injection 
sequence. Batch-specific QC samples are typically pooled aliquots of study 
samples, and are used for within-batch drift correction. Long-term reference 
QC samples are not of the same biological origin as the batch-specific 
QC samples, and are therefore not directly representative of the sample 
population. Long-term reference QC samples are used for between-batch alignment,
within-batch drift correction and between-batch normalization.

Let's create a SummarizedExperiment object in order to demonstrate the original 
functionality and the new SummarizedExperiment methods in parallel. 
SummarizedExperiment may also be output from xcms-based preprocessing using 
`xcms::quantify()`. 

```{r}
peaks <- SimpleList(t(PTnofill), t(PTfill))
sampleData <- meta
featureData <- peakInfo(PT = PTnofill, sep = "@", start = 3)
rownames(featureData) <- rownames(peaks[[1]])

se <- SummarizedExperiment(assays = peaks, colData = sampleData,
    rowData = featureData)
names(assays(se)) <- c("nofill", "fill")
```

Below, we focus on basic usage. The list output for basic data structures 
includes processing metadata which can be used for troubleshooting.
The SummarizedExperiment methods return objects with modified peak tables. 
Please see the documentation for more information (for example, ?alignBatches).


## Between-batch alignment

Shifts in retention time (RT) and mass-to-charge ratio (m/z) across batches 
results in some metabolites being redundantly represented in the dataset. 
To rectify this, between-batch alignment of features is performed using 
`alignBatches`, which encompasses the following steps:

1. Aggregation of feature presence/missingness on batch level
- batch-wise flagging of low-quality features with proportion of `NA`s to 
all samples > 80% based on long-term reference QC samples
- 0 < total batch presence of candidates features < number of batches to be an 
alignment candidate

2. Identification of features with missingness within "the box", i.e. 
sufficiently similar in RT and m/z
- potential alignment candidates have similar RT and m/z across batches
- orthogonal batch presence: two or more alignment candidates cannot be 
present in the same batch
- if there are multiple combinations of candidates across batches, the 
features are recursively subclustered before clustering across batches

3. Alignment of feature clusters resulting in new peak table

```{r}
# Extract peakinfo (i.e. m/z and rt of features),
# These column names have 2 leading characters describing LC-MS mode
# -> start at 3
peakIn <- peakInfo(PT = PTnofill, sep = "@", start = 3)
# Perform multi-batch alignment
alignBat <- alignBatches(
    peakInfo = peakIn, PeakTabNoFill = PTnofill,
    PeakTabFilled = PTfill, batches = meta$batch,
    sampleGroups = meta$grp, selectGroup = "QC",
    report = FALSE
)
# Extract new peak table
PT <- alignBat$PTalign
```
Below, we use SummarizedExperiment with multiple peak tables. When using 
SummarizedExperiment sequentially such that a single peak table is replaced by 
a new one, one doesn't need to specify the `assay.type` or `name` parameters. 
The assay is added to the SummarizedExperiment supplied to `PeakTabFilled`.

```{r}
se <- alignBatches(PeakTabNoFill = se, PeakTabFilled = se, batches = "batch",
    sampleGroups = "grp", report = FALSE, assay.type1 = "nofill",
    assay.type2 = "fill", name = "aligned", rt_col = "rt", mz_col = "mz")
```

## Within-batch drift correction
Drift in abundance within a batch gives rise to unwanted variation which can be
modelled in terms of injection order. Many methods fail to take into account 
different drift patterns in features or are prone to overfitting. Herein, 
within-batch drift correction is performed using the wrapper `correctDrift`,
which involves:

1. Clustering of features in observation space
- scaling by standard deviation
- clustering serves to identify features with similar drift patterns, which are 
corrected in aggregate. As such, different drift patterns are accounted for 
while mitigating overfitting to unwanted variation in a single feature.

2. Fitting a cubic spline and calculation of correction factor

3. Correction of the abundances using correction factor
- corrects to reference level at the first injection after scaling
- corrected values were retained only if the root mean square deviation of 
long-term reference QC samples was reduced after drift correction for the 
cluster at large

The mixture models used to cluster the drift patterns in `correctDrift()` can 
fail to converge for some combinations of geometry and cluster number. This 
results in missing Bayesian Information Criterion (BIC) measures for some 
models. You can check from which converged models the final model was selected 
using the "BIC" element in the output for basic data structures. To learn more 
about the model-based clustering used herein, refer to the mclust (e)book
[@mclust].

```{r, message = FALSE}
# Batch B
batchB <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "B"
)
BCorr <- correctDrift(
    peakTable = batchB$peakTable,
    injections = batchB$meta$inj,
    sampleGroups = batchB$meta$grp, QCID = "QC",
    G = seq(5, 35, by = 3), modelNames = c("VVE", "VEE"),
    report = FALSE
)
# Batch F
batchF <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "F"
)
FCorr <- correctDrift(
    peakTable = batchF$peakTable,
    injections = batchF$meta$inj,
    sampleGroups = batchF$meta$grp,
    QCID = "QC", G = seq(5, 35, by = 3),
    modelNames = c("VVE", "VEE"),
    report = FALSE
)
# Batch H
batchH <- getBatch(
    peakTable = PT, meta = meta,
    batch = meta$batch, select = "H"
)
HCorr <- correctDrift(
    peakTable = batchH$peakTable,
    injections = batchH$meta$inj,
    sampleGroups = batchH$meta$grp,
    QCID = "QC", G = seq(5, 35, by = 3),
    modelNames = c("VVE", "VEE"),
    report = FALSE
)

HCorr$BIC
```

Similarly, we subset the SummarizedExperiment by batch and perform drift 
correction, but for this example using long-term reference samples for quality 
indicators.

```{r, message = FALSE}
batch_labels <- unique(colData(se)$batch)
batches <- lapply(batch_labels, function(batch_label) {
    se[, colData(se)$batch == batch_label]
})

batches <- lapply(batches, correctDrift, injections = "inj", 
    sampleGroups = "grp", RefID = "Ref", G = seq(5, 35, by = 3), 
    modelNames = c("VVE", "VEE"), report = FALSE, 
    assay.type = "aligned", name = "corrected")
```

## Between-batch normalization

`normalizeBatches` performs between-batch normalization either based on 
long-term reference QC samples or median batch intensity depending on the 
following dual criterion:

1. long-term reference QC sample CV < 30%
2. fold-change < 5 for the ratio of the average feature intensity of a specific
feature between batches to the ratio of the all-feature average intensity 
between batches

If the long-term QC samples are not considered reliable according to the above 
dual criterion for a specific feature, batches were normalized by sample 
population median, where a sample population can be specified explicitly to 
the `population` argument. Features not present in all batches are also 
excluded from the dataset.

```{r, warning = FALSE}
mergedData <- mergeBatches(list(BCorr, FCorr, HCorr), qualRatio = 0.5)
normData <- normalizeBatches(
    peakTableCorr = mergedData$peakTableCorr,
    batches = meta$batch, sampleGroup = meta$grp,
    refGroup = "Ref", population = "all"
)
PTnorm <- normData$peakTable
```

Merging with `mergeBatches`, as above, includes a quality control step, where 
features with CV > the limit (supplied to correctDrift()) in a specified 
proportion of batches (default = 0.5) are excluded. For SummarizedExperiment, 
we join the batches, keeping features shared across all batches but without 
filtering features by proportion of quality batches. 

```{r, results = FALSE}
se <- do.call(cbind, batches)
se <- se[which(apply(assay(se), 1, function(x) any(is.na(x)))), ]

se <- normalizeBatches(peakTableCorr = se, 
    batches = "batch", sampleGroup = "grp", refGroup = "Ref",
    population = "all", assay.type = "corrected", name = "normalized")
```

# Authors & Acknowledgements

The first version of `batchCorr` was written by Carl Brunius. 
`batchCorr` was developed for Bioconductor by Carl Brunius, 
Anton Ribbenstedt and Vilhelm Suksi. If you find any bugs or other things to 
fix, please submit an issue on GitHub! All contributions to the package are 
always welcome!

# Session information

```{r, echo = FALSE}
sessionInfo()
```

# References