% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/permutation.R
\name{runCOCOA}
\alias{runCOCOA}
\title{Run COCOA: quantify inter-sample variation, score region sets}
\usage{
runCOCOA(
  genomicSignal,
  signalCoord,
  GRList,
  signalCol,
  targetVar,
  sampleOrder = 1:nrow(targetVar),
  variationMetric = "cor",
  scoringMetric = "default",
  verbose = TRUE,
  absVal = TRUE,
  olList = NULL,
  pOlapList = NULL,
  centerGenomicSignal = TRUE,
  centerTargetVar = TRUE,
  returnCovInfo = TRUE
)
}
\arguments{
\item{genomicSignal}{Matrix/data.frame. 
The genomic signal (e.g. DNA methylation levels)
Columns of genomicSignal should be samples/patients. 
Rows should be individual signal/features
(each row corresponds to one genomic coordinate/range)}

\item{signalCoord}{A GRanges object or data frame with coordinates 
for the genomic signal/original epigenetic data. 
Coordinates should be in the 
same order as the original data and the feature contribution scores 
(each item/row in signalCoord
corresponds to a row in signal). If a data.frame, 
must have chr and start columns (optionally can have end column, 
depending on the epigenetic data type).}

\item{GRList}{GRangesList object. Each list item is 
a distinct region set to test (region set: regions that correspond to 
the same biological annotation). The region set database
must be from the same reference genome
as the coordinates for the actual data/samples (signalCoord).}

\item{signalCol}{A character vector with the names of the sample variables
of interest/target variables (e.g. PCs or sample phenotypes). 

The columns in `sampleLabels` for which to calculate
the variation related to the epigenetic data
(e.g. correlation) and then to run COCOA on.}

\item{targetVar}{Matrix or data.frame. Rows should be samples. 
Columns should be the target variables 
(whatever variable you want to test for association with
the epigenetic signal: e.g. PC scores),}

\item{sampleOrder}{numeric. A vector of length (number of samples). If
sampleOrder is 1:(number of samples) then this function will return the
real COCOA scores.
To generate random COCOA scores in order to make 
null distributions, shuffle the samples in a random order.
E.g. sampleOrder = sample(1:ncol(genomicSignal), ncol(genomicSignal))
where ncol(genomicSignal) is the number of samples. 
Set the seed with set.seed() before making sampleOrder to ensure reproducibility.}

\item{variationMetric}{Character. The metric to use to quantify the
association between each feature in genomicSignal and each target
variable in sampleLabels.
Either "cor" (Pearson correlation), 
"cov" (covariation), or "spearmanCor" (Spearman correlation).}

\item{scoringMetric}{A character object with the scoring metric.
There are different methods available for 
signalCoordType="singleBase" vs  signalCoordType="multiBase".
For "singleBase", the available methods are "regionMean", 
"regionMedian", "simpleMean", and "simpleMedian". 
The default method is "regionMean".
For "multiBase", the methods are "proportionWeightedMean", 
"simpleMean", and "simpleMedian". The default is "proportionWeightedMean".
"regionMean" is a weighted
average of the signal, weighted by region (absolute value of signal 
if absVal=TRUE). First the signal is
averaged within each regionSet region, 
then all the regions are averaged. With
"regionMean" method, be cautious in interpretation for
region sets with low number of regions that overlap signalCoord. The
"regionMedian" method is the same as "regionMean" but the median is taken
at each step instead of the mean.
The "simpleMean"
method is just the unweighted average of all (absolute) signal values that
overlap the given region set. For multiBase data, this includes
signal regions that overlap a regionSet region at all (1 base
overlap or more) and the signal for each overlapping region is
given the same weight for the average regardless of how much it overlaps.
The "simpleMedian" method is the same as "simpleMean" but takes the median 
instead of the mean. 
"proportionWeightedMean" is a weighted average of all signalCoord 
regions that overlap with regionSet regions. For each signalCoord region
that overlaps with a regionSet region, we calculate what proportion
of the regionSet region is covered. Then this proportion is used to
weight the signal value when calculating the mean. 
The denominator of the mean
is the sum of all the proportion overlaps.}

\item{verbose}{A "logical" object. Whether progress 
of the function should be shown. One
bar indicates the region set is completed.}

\item{absVal}{Logical. If TRUE, take the absolute value of values in
signal. Choose TRUE if you think there may be some 
genomic loci in a region set that will increase and others
will decrease (if there may be anticorrelation between
regions in a region set). Choose FALSE if you expect regions in a 
given region set to all change in the same direction (all be positively
correlated with each other).}

\item{olList}{list. Each list item should be a "SortedByQueryHits" object 
(output of findOverlaps function). Each hits object should have the overlap
information between signalCoord and one item of GRList (one unique region set).
The region sets from GRList must be the "subject" in findOverlaps 
and signalCoord must be the "query". E.g. findOverlaps(subject=regionSet,
query=signalCoord).
Providing this information can greatly improve permutation speed since the 
overlaps will not have to be calculated for each permutation. 
The "runCOCOAPerm" function calculates this information only once, internally,
so this does not have to be provided when using that function. When using 
this parameter, signalCoord, 
genomicSignal, and each region set must be in the same order as they were
when olList was created. Otherwise, the wrong genomic loci will be referenced
(e.g. if epigenetic features were filtered out of genomicSignal after olList
was created.)}

\item{pOlapList}{list. This parameter is only used if the scoring metric is
"proportionWeightedMean" and olList is also provided as an argument. Each
item of the list should be a vector that contains the proportion overlap 
between signalCoord and regions from one region set (one item of GRList).
Specifically, each value should be the proportion of the region set region 
that is overlapped
by a signalCoord region.
The proportion overlap values should be in the same order as the overlaps
given by olList for the corresponding region set.}

\item{centerGenomicSignal}{Logical. Should rows in genomicSignal
be centered based on
their means? (subtracting row mean from each row)}

\item{centerTargetVar}{Logical. Should columns in targetVar be 
centered based
on their means? (subtract column mean from each column)}

\item{returnCovInfo}{logical. If TRUE, the following coverage and 
region set info will
be calculated and included in function output: regionSetCoverage, 
signalCoverage, totalRegionNumber, and meanRegionSize. For the
proportionWeightedMean scoring method, 
sumProportionOverlap will also be calculated.}
}
\value{
data.frame. The output of aggregateSignalGRList for one permutation.
}
\description{
This is a convenience function that does the two steps of COCOA: 
quantifying the epigenetic variation and scoring the region sets. 
This function will return the real COCOA scores if using the default
`sampleOrder` parameter values. This
function also makes it easy to generate null distributions in order to
evaluate the statistical significance of the real COCOA results.
You can use the sampleOrder parameter to shuffle the samples,
then run COCOA to get fake scores for each region set. By doing 
this many times, you can build a null distribution for each 
region set composed of the region set's random scores from each
permutation. There are multiple options for quantifying the
epigenetic variation, specified by the `variationMetric` parameter.
Quantifying the variation for the real/non-permuted COCOA 
scores should be done with the same 
variation metric as is used for the random permutations. For an
unsupervised analysis using dimensionality reduction, first, the
dimensionality reduction is done outside `runCOCOA`, then the
latent factors/principal components are input to `runCOCOA` as the
sample labels (targetVar parameter) when calculating both the real and 
also the permutated region set scores. For a supervised analysis, 
the target variables/phenotypes are the targetVar.
See the vignettes for examples.
}
\examples{
data("esr1_chr1")
data("nrf1_chr1")
data("brcaMethylData1")
data("brcaMCoord1")
pcScores <- prcomp(t(brcaMethylData1))$x
targetVarCols <- c("PC1", "PC2")
targetVar <- pcScores[, targetVarCols]

# give the actual order of samples to `runCOCOA` to get the real scores
correctSampleOrder=1:nrow(targetVar)
realRSScores <- runCOCOA(genomicSignal=brcaMethylData1,
                        signalCoord=brcaMCoord1,
                        GRList=GRangesList(esr1_chr1, nrf1_chr1),
                        signalCol=c("PC1", "PC2"),
                        targetVar=targetVar,
                        sampleOrder=correctSampleOrder,
                        variationMetric="cor")
realRSScores
        
# give random order of samples to get random COCOA scores 
# so you start building a null distribution for each region set 
# (see vignette for example of building a null distribution with `runCOCOA`)
randomOrder <- sample(1:nrow(targetVar), 
                      size=nrow(targetVar),
                      replace=FALSE)
randomRSScores <- runCOCOA(genomicSignal=brcaMethylData1,
                          signalCoord=brcaMCoord1,
                          GRList=GRangesList(esr1_chr1, nrf1_chr1),
                          signalCol=c("PC1", "PC2"),
                          targetVar=targetVar,
                          sampleOrder=randomOrder,
                          variationMetric="cor")
randomRSScores
}
