<!--
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{Sample output files for tximport}
-->

# Sample output files for tximport

This data package provides a set of output files from running a number
of various transcript abundance quantifiers on 6 samples from the
[GEUVADIS Project](http://www.geuvadis.org/web/geuvadis). The
files are contained in the `inst/extdata` directory.

A citation for the GEUVADIS Project is:

> Lappalainen, et al., "Transcriptome and genome sequencing uncovers
> functional variation in humans", Nature 501, 506-511 (26 September
> 2013) [doi:10.1038/nature12531](http://dx.doi.org/10.1038/nature12531).

The purpose of this vignette is to detail which versions of software
were run, and exactly what calls were made.

# Sample information and quantification files

A small file, `samples.txt` is included in the `inst/extdata` directory:

```{r}
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
```

Further details can be found in a more extended table:

```{r}
samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)
```

The quantification outputs themselves can be found in sub-directories:

```{r}
list.files(dir)
list.files(file.path(dir,"cufflinks"))
list.files(file.path(dir,"rsem","ERR188021"))
list.files(file.path(dir,"kallisto","ERR188021"))
list.files(file.path(dir,"salmon","ERR188021"))
list.files(file.path(dir,"sailfish","ERR188021"))
list.files(file.path(dir,"alevin"))
```

# Genome and gene annotation file

* For Cufflinks and Sailfish, the Illumina iGenomes was used as the index, see details below. 
* For RSEM, Salmon and kallisto (without inference replicates), 
  the Gencode v27 CHR transcripts were used (`gencode.v27.transcripts.fa`).
* For the `salmon_gibbs` and `kallisto_boot` directories, 
  the Ensembl v87 cDNA transcripts were used (`Homo_sapiens.GRCh38.cdna.all.fa`).
* For the `salmon_dm` directory, the Ensembl Drosophila melanogaster
  v92 transcripts were used (either just cDNA or combining cDNA with
  non-coding transcripts).

Illumina iGenomes: The human genome and annotations were downloaded from
[Illumina iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html)
for the UCSC hg19 version. The human genome FASTA file used was in the
`Sequence/WholeGenomeFasta` directory and the gene annotation GTF file used
was the `genes.gtf` file in the `Annotation/Genes` directory. This GTF
file contains RefSeq transcript IDs and UCSC gene names. The
`Annotation` directory contained a `README.txt` file with the text:

> The contents of the annotation directories were downloaded from UCSC
> on: June 02, 2014.

The `genes.gtf` file was filtered to include only chromosomes
1-22, X, Y, and M.

# Cufflinks

Tophat2 version 2.0.11 was run with the call:

```
tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;
```

Cufflinks version 2.2.1 was run with the call:

```
cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;
```

Cuffnorm was run with the call:

```
cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb 
```

# RSEM

RSEM version 1.2.31 was run with the call:

```
rsem-calculate-expression --num-threads 6 --bowtie2 --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) index rsem/$f/$f
```

# kallisto

kallisto version 0.43.1 was run with the call:

```
kallisto quant --bias -i index -t 6 -o kallisto/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
```

For the files in `kallisto_boot` directory, kallisto version 0.43.0
was run, quantifying against the Ensembl transcripts (v87) in
`Homo_sapiens.GRCh38.cdna.all.fa`, using the call:

```
kallisto quant -i index -t 6 -b 5 -o kallisto_0.43.0/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
```

# Salmon

Salmon version 0.8.2 was run with the call:

```
salmon quant -p 6 --gcBias -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon/$f
```

For the files in the `salmon_gibbs` directory, Salmon version 0.8.1
was run, quantifying against the Ensembl transcripts (v87) in 
`Homo_sapiens.GRCh38.cdna.all.fa`, using the call:

```
salmon quant -p 6 --numGibbsSamples 5 -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon_gibbs/$f
```

For the files in the `salmon_dm` directory (Drosophila melanogaster),
Salmon version 0.10.2 was run (once with only cDNA, once combining
cDNA with non-coding transcripts):

```
salmon quant -l A --gcBias --seqBias --posBias -i Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2 -o SRR1197474 -1 SRR1197474_1.fastq.gz -2 SRR1197474_2.fastq.gz
```

For the files in the `salmon_ec` directory, 
Salmon version 1.1.0 was run with `--dumpEq` on the files from
Tasic, B., Yao, Z., Graybuck, L.T. *et al.* 
"Shared and distinct transcriptomic cell types across neocortical
areas" (2018)
[doi: 10.1038/s41586-018-0654-5](https://doi.org/10.1038/s41586-018-0654-5)
These files were generated by Jeroen Gilis. The raw data is from:
<https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA476008&o=acc_s%3Aa>

# alevin

Two small examples of `alevin` output (50 cells each) were generated by
Jeroen Gilis. The dataset is a subset from the paper, 
Hagai *et al.* "Gene expression variability across cells and species
shapes innate immunity" (2018) 
[doi: 10.1038/s41586-018-0657-2](https://doi.org/10.1038/s41586-018-0657-2)
Salmon/alevin version 1.6.0 was run, and using the tx2gene data that
is included in the package under `tx2gene_alevin.tsv`.

# oarfish

Three samples were processed with `oarfish` v0.9.0, downloaded from the 
[fastq download page](http://sg-nex-data.s3-website-ap-southeast-1.amazonaws.com/#data/sequencing_data_ont/fastq/) of the [SG-NEx Project](https://registry.opendata.aws/sgnex/). See also this [GitHub page](https://github.com/GoekeLab/sg-nex-data).

The samples were `SGNex_H9_cDNA` replicates 2-4, run 4.

These three samples were quantified with `oarfish` against human GENCODE v48 supplemented with "novel" sequences of length 1,000 bp across chr1-22 (500 per chromosome = 11,000 new transcripts named `novel1`, `novel2`, ...).

Code for the quantification steps are posted to a 
[GitHub repo](https://github.com/mikelove/oarfish4tximportData).

Briefly the following lines were used:

```
oarfish --only-index --annotated gencode.v48.transcripts.fa.gz --novel novel.fa.gz --seq-tech ont-cdna --threads 32 --index-out gencode_plus_novel
oarfish --reads reads/SGNex_H9_cDNA_replicateXYZ_run4.fastq.gz --index gencode_plus_novel --output quants/sgnex_h9_rep2 --seq-tech ont-cdna --filter-group no-filters --threads 32
```

Citation for the SG-NEx Project:

> Chen, Y., Davidson, N.M., Wan, Y.K. et al. A systematic benchmark of Nanopore long-read RNA sequencing for transcript-level analysis in human cell lines. Nat Methods 22, 801–812 (2025). <https://doi.org/10.1038/s41592-025-02623-4>

# Sailfish

Sailfish version 0.9.0 was run with the call:

```
sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f
```

# Session info

```{r}
sessionInfo()
```

