% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vcf2dist.R
\name{vcf2dist}
\alias{vcf2dist}
\title{Calculate distances between samples of a VCF file}
\usage{
vcf2dist(
  inputFile,
  outputFile = NULL,
  threads = 2,
  compress = FALSE,
  verbose = FALSE
)
}
\arguments{
\item{inputFile}{Input vcf file location (uncompressed or gzip compressed).}

\item{outputFile}{Output distances file location.}

\item{threads}{Number of java threads to use.}

\item{compress}{Compress output (adds .gz extension).}

\item{verbose}{Logical. If TRUE, enables verbose output from the Java backend.}
}
\value{
A \code{\link[stats]{dist}} distances object of the calculation.
}
\description{
This function calculates a cosine type dissimilarity measurement between the
\code{n} samples of a VCF file.
}
\details{
Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL
variants are considered, phased or not. Some VCF encoding examples are:

    \itemize{
        \item heterozygous variants : \code{1/0} or \code{0/1} or \code{0/2}
        or \code{1|0} or \code{0|1} or \code{0|2}
        \item homozygous to the reference allele variants : \code{0/0}
        or \code{0|0}
        \item homozygous to the first alternate allele variants : \code{1/1}
        or \code{1|1}
    }

If there are \code{n} samples and \code{m} variants, an \code{nxn}
zero-diagonal symmetric distance matrix is calculated.
The calculated cosine type distance (1-cosine_similarity)/2 is in the range
[0,1] where value 0 means completely identical samples (cosine is 1),
value 0.5 means perpendicular samples (cosine is 0)
and value 1 means completely opposite samples (cosine is -1).

The calculation is performed by a Java backend implementation,
that supports multi-core CPU utilization
and can be demanding in terms of memory resources.
By default a JVM is launched with a maximum memory allocation of 512 MB.
When this amount is not sufficient,
the user needs to reserve additional memory resources,
before loading the package,
by updating the value of the \code{java.parameters} option.
For example in order to allocate 4GB of RAM,
the user needs to issue \code{options(java.parameters="-Xmx4g")}
before \code{library(fastreeR)}.

Output file, if provided, will contain \code{n+1} lines.
The first line contains the number \code{n} of samples
and number \code{m} of variants, separated by space.
Each of the subsequent \code{n} lines contains \code{n+1} values,
separated by space.
The first value of each line is a sample name
and the rest \code{n} values
are the calculated distances of this sample to all the samples.
Example output file of the distances of 3 samples
calculated from 1000 variants:
\tabular{llll}{
    3 1000 \tab \cr
    Sample1 \tab 0.0 \tab 0.5 \tab 0.2\cr
    Sample2 \tab 0.5 \tab 0.0 \tab 0.9\cr
    Sample3 \tab 0.2 \tab 0.9 \tab 0.0\cr
}
}
\examples{
my.dist <- vcf2dist(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)
}
\references{
Java implementation:
\url{https://github.com/gkanogiannis/BioInfoJava-Utils}
}
\author{
Anestis Gkanogiannis, \email{anestis@gkanogiannis.com}
}
