%\VignetteIndexEntry{An introduction to SimFFPE}
%\VignetteDepends{}
%\VignetteKeywords{Read Simulation, FFPE tissue, NGS}
%\VignettePackage{SimFFPE}
\documentclass{article}

<<style, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\title{An Introduction to \Biocpkg{SimFFPE}}
\author{Lanying Wei}
\date{Modified: Nov 11, 2020. Compiled: \today}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\tableofcontents

<<options, echo=FALSE>>=
options(width=60)
@
\newcommand{\bam}{\texttt{BAM}}
\newcommand{\fasta}{\texttt{FASTA}}
\newcommand{\phred}{\texttt{Phred score profile}}

\section{Introduction}

The NGS (Next-Generation Sequencing) reads from FFPE (Formalin-Fixed 
Paraffin-Embedded) samples contain numerous artificial chimeric reads, which can 
lead to false positive structural variant calls.  These  ACRs are 
derived from the combination of two single-stranded DNA (ss-DNA) fragments with 
short reverse complementary regions (SRCR). The combined ss-DNA may come from 
adjacent or distant genomic regions. The \Biocpkg{SimFFPE} package simulates 
these artificial reads as well as normal reads for FFPE samples. The simulation 
can cover whole genome, or several chromosomes, or large regions, or whole 
exome, or targeted regions. It also supports enzymatic / random fragmentation 
and paired-end / single-end sequencing simulations.Fine-tuning can be achieved 
by adjusting the parameters, and multi-threading is surported. 

\section{Input}

The essential inputs for the simulation include a \fasta{} file of the reference
genome, a \phred{} matrix to simulate Phred scores based on the position on the 
reads, and a DataFrame or GenomicRanges object representing the target regions 
(optional). The \phred{} can be estimated from existing \bam{} files using 
\Rfunction{calcPhredScoreProfile} function (two available examples for \phred{} 
are stored in the 'extdata' directory of \Biocpkg{SimFFPE} package). 

<<calcPhredScoreProfile>>=
library(SimFFPE)
bamFilePath <- system.file("extdata", "example.bam", package = "SimFFPE")
regionPath <- system.file("extdata", "regionsBam.txt", package = "SimFFPE")
regions <- read.table(regionPath)
PhredScoreProfile <- calcPhredScoreProfile(bamFilePath, targetRegions = regions)

## Example Phred score profile with 100 read length
PhredScoreProfilePath <- system.file("extdata", "PhredScoreProfile1.txt",
                                      package = "SimFFPE")
PhredScoreProfile <- as.matrix(read.table(PhredScoreProfilePath, skip = 1))
colnames(PhredScoreProfile)  <- 
    strsplit(readLines(PhredScoreProfilePath)[1], "\t")[[1]]
#
## Example Phred score profile with 150 read length

PhredScoreProfilePath2 <- system.file("extdata", "PhredScoreProfile2.txt",
                                      package = "SimFFPE")
PhredScoreProfile2 <- as.matrix(read.table(PhredScoreProfilePath2, skip = 1))
colnames(PhredScoreProfile2)  <- 
    strsplit(readLines(PhredScoreProfilePath2)[1], "\t")[[1]]

@

The \fasta{} file of reference genome can be read in as \Rclass{DNAStringSet}
with \Rfunction{readDNAStringSet} function from \Biocpkg{Biostrings} package.
The reference genome example file consists of small regions of 24 chromosomes 
from human hg19 reference genome.
<<readDNAStringSet>>=
referencePath <- system.file("extdata", "example.fasta", package = "SimFFPE")
reference <- readDNAStringSet(referencePath)
reference

@
%%
To simulate reads of certain regions, a DataFrame or GenomicRanges object 
representing the target regions is required (not required when simulating reads 
on the whole genome / several chromosomes / large regions). The DataFrame 
representing the target regions should have three columns, which indicate 
chromosomes, start positions and end positions respectively (one-based 
coordinate).
<<target region table>>=
regionPath <- system.file("extdata", "regionsSim.txt", package = "SimFFPE")
targetRegions <- read.table(regionPath)
@
%%

\section{Simulation}

The simulation includes three steps: 1) Simulate artificial chimeric reads
derived from the combination of two ss-DNA segments from adjacent regions on the
chromosome. 2) Simulate artificial chimeric reads derived from the combination 
of two ss-DNA from distant regions (distant regions on the same chromosome, or 
any regions on different chromosomes). 3) Simulate reads derived from normal 
sequences. You can also skip any of these steps in the simulation. There are two 
functions which can be used for the simulation: \Rfunction{readSimFFPE} and 
\Rfunction{targetReadSimFFPE}.

\subsection{Read simulation functions}

To simulate reads on whole genome, or several chromosomes, or large regions,
please use the \Rfunction{readSimFFPE} function:
<<readSimFFPE>>=
## Simulate reads of the first three sequences of the reference genome

sourceSeq <- reference[1:3]
outFile1 <- paste0(tempdir(), "/sim1")
readSimFFPE(sourceSeq, referencePath, PhredScoreProfile2, outFile1,
            overWrite = TRUE, coverage = 80, readLen = 150,
            enzymeCut = TRUE, threads = 2)
#
## Simulate reads of defined regions on the first two sequences of
## the reference genome

sourceSeq2 <- DNAStringSet(lapply(reference[1:2], function(x) x[1:10000]))
outFile2 <- paste0(tempdir(), "/sim2")
readSimFFPE(sourceSeq2, referencePath, PhredScoreProfile2, outFile2,
            overWrite = TRUE, coverage = 80, readLen = 150, enzymeCut = TRUE)

@
%%

To simulate reads on whole exome or targeted regions please use the
\Rfunction{targetReadSimFFPE} function:
<<targetReadSimFFPE>>=

outFile3 <- paste0(tempdir(), "/sim3")
targetReadSimFFPE(referencePath, PhredScoreProfile, targetRegions, outFile3,
                  coverage = 120, readLen = 100, meanInsertLen = 180, 
                  sdInsertLen = 50, enzymeCut = FALSE)

@
%%
Additional information can be found on the help pages for the
\Rfunction{readSimFFPE} function and the \Rfunction{targetReadSimFFPE} function.

\subsection{Fine-tuning of the simulation}
Fine-tuning of the simulation is achievable by the adjustments of some
parameters of the function \Rfunction{readSimFFPE} and
\Rfunction{targetReadSimFFPE}. You can simulate reads in smaller regions during
fine-tuning to save the runtime. To illustrate the impact of some of these
parameters, screenshots from IGV tools are used (see Figure \ref{fig:1},
\ref{fig:2} and \ref{fig:3}). These parameters include:

1) enzymeCut: Simulate enzymatic fragmentation. With this fragmentation method,
chimeric read pairs with improper pair orientations might be mapped to exactly 
the same location on the reference genome (see Figure \ref{fig:1}).

2) chimericProp: Proportion of artificial chimeric fragments. The higher the 
value, the greater the proportion of improper paired reads as shown in Figure 
\ref{fig:1} and \ref{fig:2}, the smaller the proportion of proper paired reads 
as shown in Figure \ref{fig:3}, and the larger the proportion of proper paired 
reads with soft-clips as shown in Figure \ref{fig:3}.

3) sameChrProp: Proportion of artifact chimeric fragments that are derived from 
the combination of two ss-DNA coming from the same chromosome. The higher the 
value, the greater the proportion of reads with improper pair orientations as
shown in Figure \ref{fig:1}, and the smaller the proportion of reads with their 
mates mapped to different chromosomes as shown in Figure \ref{fig:2}.

4) adjChimProp: Proportion of adjacent ss-DNA combinations among same 
chromosomal ss-DNA combinations. sameChrProp * adjChimProp determine the
proportion of simulated adjcent ss-DNA combinations.

5) sameStrandProb: Proportion of same-strand ss-DNA combinations among adjacent 
ss-DNA combinations. The higher the value, the greater the proportion of reads 
with RR and LL orientations in all reads with improper pair orientations as 
shown in Figure \ref{fig:1}.

6) spikeWidth: The width of chimeric read spike used in the simulation of 
distant ss-DNA combinations. As shown in Figure \ref{fig:3}, some regions 
are enriched in reads with paired reads mapped to other chromosomes, and some 
others are scarce. The lengths of these regions are of similar scale, and the 
parameter "spikeWidth" is used to simulate this length.

7) highNoiseRate and highNoiseProb: The noise rate for each base in noisy reads 
and the proportion of these noisy reads. These very noisy reads as well as less 
noisy  reads are shown in Figure \ref{fig:3}.

\begin{figure*}
  \includegraphics[width=1\textwidth]{./ex1.png}
  \caption{\label{fig:1} Simulated reads with improper pair orientations.
  These read pairs are mapped in RR, LL, or RL orientations. Black-framed reads
  are paired reads that are mapped to the same position (enzymatic
  fragmentation).}
\end{figure*}

\begin{figure*}
  \includegraphics[width=1\textwidth]{./ex2.png}
  \caption{\label{fig:2} Simulated reads with mate reads mapped to different
  chromosomes. The different read colors indicate the different chromosomes that 
  their mate reads mapped to.}
\end{figure*}

\begin{figure*}
  \includegraphics[width=1\textwidth]{./ex3.png}
  \caption{\label{fig:3} Simulated reads mapped in proper pair. Mismatches and
  soft-clips are shown as colored vertical lines in reads.}
\end{figure*}

\appendix
\section{Session info}
<<sessionInfo>>=
packageDescription("SimFFPE")
sessionInfo()
@

\end{document}
