---
title: "Pre-Processing for the Zebrafish RNA-Seq Gene-Level Counts"
author: "Davide Risso"
date: "Last modified: June 7, 2024; Compiled: `r format(Sys.time(), '%B %d, %Y')`"
bibliography: biblio.bib
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteEncoding{UTF-8}
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Pre-Processing for the Zebrafish RNA-Seq Gene-Level Counts}
-->

# Introduction

This vignette describes the pre-processing steps that were followed for
the generation of the gene-level read counts contained in the Bioconductor package
_zebrafishRNASeq_.

# Sample preparation and sequencing

Olfactory sensory neurons were isolated from three pairs of
gallein-treated and control embryonic zebrafish pools and purified by
fluorescence activated cell sorting (FACS)
[@ferreira2014silencing].  Each RNA sample was enriched in
poly(A)+ RNA from 10--30 ng total RNA and 1 $\mu$L (1:1000 dilution) of Ambion ERCC ExFold RNA Spike-in Control Mix 1 was added to 30 ng of total RNA before mRNA isolation. cDNA libraries were prepared
according to manufacturer's protocol.
The six libraries were sequenced in two multiplex runs on an Illumina HiSeq2000 sequencer,
yielding approximately 50 million 100bp paired-end reads per
library. 

# Read alignment and expression quantitation

We made use of a custom reference sequence,
defined as the union of the zebrafish reference genome (Zv9,
downloaded from Ensembl [@flicek2012ensembl], v. 67) and the [ERCC
spike-in sequences](https://www.thermofisher.com/order/catalog/product/4456739).  Reads were
mapped with TopHat [@trapnell2009tophat] (v. 2.0.4), with the
following parameters,
```
--library-type=fr-unstranded -G ensembl.gtf --transcriptome-index=transcript --no-novel-juncs 
```
where _ensembl.gtf_ is a GTF file containing Ensembl gene
annotation.

Gene-level read counts were obtained using the htseq-count python script [@htseq] in the "union" mode and Ensembl (v. 67) gene annotation.

After verifying that there were no run-specific biases, we used the sums of the counts of the two runs as the expression measures for each library.

# Loading the zebrafish data into R

To load the gene-level read counts into R, simply type

```{r loadData}
library(zebrafishRNASeq)
data(zfGenes)

head(zfGenes)
``` 

The ERCC spike-in read counts are in the last rows of the same matrix
and can be retrieved in the following way.

```{r ercc, eval=TRUE, results="markup"}
spikes <- zfGenes[grep("^ERCC", rownames(zfGenes)),]
head(spikes)
``` 

The typical use of this dataset is the indentification of differentially expressed genes between control (Ctl) and treated (Trt) samples. For additional details, exploratory analysis, and normalization of the zebrafish data see @risso2014ruv and @risso2014role. The data are used as a case study for the Bioconductor package
_RUVSeq_.

# Session info

```{r sessionInfo}
sessionInfo()
``` 

# References