---
title: "HDCytoData package"
author: 
  - name: Lukas M. Weber
    affiliation: 
      - &id1 "Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland"
      - &id2 "SIB Swiss Institute of Bioinformatics, Zurich, Switzerland"
  - name: Charlotte Soneson
    affiliation: 
      - &id3 "Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland"
      - &id4 "SIB Swiss Institute of Bioinformatics, Basel, Switzerland"
package: HDCytoData
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{1. HDCytoData package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```



# Overview

The `HDCytoData` package is an extensible resource containing a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) benchmark datasets, which have been formatted into `SummarizedExperiment` and `flowSet` Bioconductor object formats. The data objects are hosted on Bioconductor's `ExperimentHub` platform.

The objects each contain one or more tables of cell-level expression values, as well as all required metadata. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population or cluster labels (where available), and labels identifying 'spiked in' cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns).

Note that raw expression values should be transformed prior to any downstream analyses (see below).

Currently, the package includes benchmark datasets used in our previous work to evaluate methods for clustering and differential analyses. The datasets are provided here in `SummarizedExperiment` and `flowSet` formats in order to make them easier to access and integrate into R/Bioconductor workflows.

For more details, see our paper describing the `HDCytoData` package:

- [Weber L.M. and Soneson C. (2019), *HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats*, F1000Research, 8:1459, v2.](https://f1000research.com/articles/8-1459)



# Datasets

The package contains the following datasets, which can be grouped into datasets useful for benchmarking methods for (i) clustering, and (ii) differential analyses.

- Clustering:
    - Levine_32dim
    - Levine_13dim
    - Samusik_01
    - Samusik_all
    - Nilsson_rare
    - Mosmann_rare

- Differential analyses:
    - Krieg_Anti_PD_1
    - Bodenmiller_BCR_XL
    - Weber_AML_sim (multiple datasets from several simulation scenarios)
    - Weber_BCR_XL_sim (multiple datasets from several simulation scenarios)

Extensive documentation is available in the help files for the objects. For each dataset, this includes a description of the dataset (e.g. biological context, number of samples and conditions, number of cells, number of reference cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, details on accessor functions required to access the expression tables and metadata, and references to original data sources.

File sizes are listed in the help files for the datasets. The `removeCache` function from the `ExperimentHub` package can be used to clear the local download cache (see `ExperimentHub` documentation).

The help files can be accessed by the dataset names, e.g. `?Bodenmiller_BCR_XL` or `help(Bodenmiller_BCR_XL)`.


## Programmatic access to list of datasets

An updated list of all available datasets can also be obtained programmatically using the `ExperimentHub` accessor functions, as follows. This retrieves a table of metadata from the `ExperimentHub` database, which includes information such as the ExperimentHub ID, title, and description for each dataset.

```{r}
suppressPackageStartupMessages(library(ExperimentHub))

# Create ExperimentHub instance
ehub <- ExperimentHub()

# Find HDCytoData datasets
ehub <- query(ehub, "HDCytoData")
ehub

# Retrieve metadata table
md <- as.data.frame(mcols(ehub))

head(md, 2)
```



# How to load data

This section shows how to load the datasets, using one of the datasets (`Bodenmiller_BCR_XL`) as an example.

The datasets can be loaded by either (i) referring to named functions for each dataset, or (ii) creating an `ExperimentHub` instance and referring to the dataset IDs. Both methods are demonstrated below.

See the help files (e.g. `?Bodenmiller_BCR_XL`) for details about the structure of the `SummarizedExperiment` or `flowSet` objects.

Load the datasets using named functions:

```{r}
suppressPackageStartupMessages(library(HDCytoData))

# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()

# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
```


Alternatively, load the datasets by creating an `ExperimentHub` instance:

```{r}
# Create ExperimentHub instance
ehub <- ExperimentHub()

# Find HDCytoData datasets
query(ehub, "HDCytoData")

# Load 'SummarizedExperiment' object using dataset ID
ehub[["EH2254"]]

# Load 'flowSet' object using dataset ID
ehub[["EH2255"]]
```


## Using the data

Once the datasets have been loaded from `ExperimentHub`, they can be used as normal within an R session. For example, using the `SummarizedExperiment` form of the dataset loaded above:

```{r}
# Load dataset in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()

# Inspect object
d_SE
length(assays(d_SE))
assay(d_SE)[1:6, 1:6]
rowData(d_SE)
colData(d_SE)
metadata(d_SE)
```


## Transformation of raw data

Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transformations include the `asinh` with `cofactor` parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2).


## Exploring the data

Interactive visualizations to explore the datasets can be generated from the `SummarizedExperiment` objects using the [iSEE](http://bioconductor.org/packages/iSEE) ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the `SummarizedExperiment` format. For more details, see the `iSEE` package vignettes.



# Contribution guidelines

We welcome contributions or suggestions for new datasets to include in the `HDCytoData` package. Contribution guidelines are provided in the [Contribution guidelines](http://bioconductor.org/packages/release/data/experiment/vignettes/HDCytoData/inst/doc/Contribution_ _guidelines.html) vignette, available from [Bioconductor](http://bioconductor.org/packages/HDCytoData).



# Citation

If the `HDCytoData` package is useful in your work, please cite the following paper:

- [Weber L.M. and Soneson C. (2019), *HDCytoData: Collection of high-dimensional cytometry benchmark datasets in Bioconductor object formats*, F1000Research, 8:1459, v2.](https://f1000research.com/articles/8-1459)



