---
title: "MicrobiomeBenchmarkData"
author:
  - name: "Samuel Gamboa"
    email: "Samuel.Gamboa.Tuz@gmail.com"
  - name: "Levi Waldron"
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{MicrobiomeBenchmarkData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

The `MicrobiomeBenchamrkData` package provides access to a collection of 
datasets with biological ground truth for benchmarking differential 
abundance methods. The datasets are deposited on Zenodo:
https://doi.org/10.5281/zenodo.6911026

# Installation

```{r installation, eval=FALSE}
## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")

## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData") 
```

```{r, message=FALSE}
library(MicrobiomeBenchmarkData)
library(purrr)
```

# Sample metadata

All sample metadata is merged into a single data frame and provided as a data
object:

```{r}
data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |> 
    discard(~any(is.na(.x))) |> 
    head()
knitr::kable(sample_metadata)
```

# Accessing datasets

Currently, there are `r nrow(MicrobiomeBenchmarkData::getBenchmarkData())`
datasets available through the MicrobiomeBenchmarkData. These datasets are
accessed through the `getBenchmarkData` function.

## Print avaialable datasets

If no arguments are provided, the list of available datasets is printed on
screen and a data.frame is returned with the description of the datasets:

```{r}
dats <- getBenchmarkData()
```
```{r}
dats
```

## Access a single dataset

In order to import a dataset, the `getBenchmarkData` function must be used with
the name of the dataset as the first argument (`x`) and the `dryrun` argument 
set to `FALSE`. The output is a list vector with the dataset imported as a
TreeSummarizedExperiment object.

```{r}
tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
tse
```

## Access a few datasets

Several datasets can be imported simultaneously by giving the names of the 
different datasets in a character vector:

```{r}
list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
str(list_tse, max.level = 1)
```

## Access all of the datasets

If all of the datasets must to be imported, this can be done by providing
the `dryrun = FALSE` argument alone.

```{r}
mbd <- getBenchmarkData(dryrun = FALSE)
str(mbd, max.level = 1)
```

# Annotations for each taxa are included in rowData

The biological annotations of each taxa are provided as a column in the
`rowData` slot of the TreeSummarizedExperiment.

```{r}
## In the case, the column is named as taxon_annotation 
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
```

# Cache 

The datasets are cached so they're only downloaded once. The cache and all of
the files contained in it can be removed with the `removeCache` function.

```{r, eval=FALSE}
removeCache()
```


# Session information

```{r}
sessionInfo()
```