---
title: "Likelihood Calculations for vsn"
package: vsn
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{Likelihood Calculations for vsn}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: vsn.bib
nocite: |
  @Huber:2003:SAGMB
---

## Introduction

This vignette contains the computations that underlie the
numerical code of `r BiocStyle::Biocpkg("vsn")`.  If you are a new user and looking for an
introduction on how to **use** `r BiocStyle::Biocpkg("vsn")`, please refer to the
vignette **Robust calibration and variance stabilization with
`r BiocStyle::Biocpkg("vsn")`**, which is provided separately.

## Setup and Notation

Consider the model:

\begin{equation}
\text{arsinh}\left(f(b_i)\cdot y_{ki}+a_i\right) = \mu_k + \varepsilon_{ki}
(\#eq:model)
\end{equation}

where $\mu_k$, for $k=1,\ldots,n$, and $a_i$, $b_i$, for $i=1,\ldots,d$
are real-valued parameters, $f$ is a function $\mathbb{R}\to\mathbb{R}$ 
(see below), 
and $\varepsilon_{ki}$ are i.i.d.\ Normal with mean 0 and variance $\sigma^2$. 
$y_{ki}$ are the data. In applications to $\mu$array data, $k$ indexes the 
features and $i$ the arrays and/or colour channels.

Examples for $f$ are $f(b)=b$ and $f(b)=e^b$. The former is the most
obvious choice; in that case we will usually need to require $b_i>0$.
The choice $f(b)=e^b$ assures that the factor in front of $y_{ki}$ is
positive for all $b\in\mathbb{R}$, and as it turns out, simplifies
some of the computations.

In the following calculations, I will also use the notation

\begin{align}
Y \equiv Y(y,a,b) &=              f(b)\cdot y+a\\
h \equiv h(y,a,b) &= \text{arsinh}\left(f(b)\cdot y+a\right).
\end{align}

The probability of the data $(y_{ki})_{k=1\ldots n,\;i=1\ldots d}$ 
lying in a certain volume element of $y$-space
(hyperrectangle with sides $[y_{ki}^\alpha,y_{ki}^\beta]$) is 

\begin{equation}
P=\prod_{k=1}^n\prod_{i=1}^d
\int\limits_{y_{ki}^\alpha}^{y_{ki}^\beta} dy_{ki}\;\;
p_{\text{Normal}}(h(y_{ki}),\mu_k,\sigma^2)\;\;
\frac{dh}{dy}(y_{ki}),
\end{equation}
where $\mu_k$ is the expectation value for feature $k$ and
$\sigma^2$ the variance.

With

\begin{equation}
p_{\text{Normal}}(x,\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}
\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
\end{equation}

the likelihood is

\begin{equation}
L=\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^{nd}
\prod_{k=1}^n \prod_{i=1}^d
\exp\left(-\frac{(h(y_{ki})-\mu_k)^2}{2\sigma^2}\right)
\cdot\frac{dh}{dy}(y_{ki})\,.
(\#eq:likelihood)
\end{equation}

For the following, I will need the derivatives

\begin{align}
\frac{\partial Y}{\partial a}&=1\\
\frac{\partial Y}{\partial b}&=y\cdot f'(b)\\
\frac{dh}{dy}&=
\frac{f(b)}{\sqrt{1+(f(b)y+a)^2}}=
\frac{f(b)}{\sqrt{1+Y^2}},\\
\frac{\partial h}{\partial a}&=\frac{1}{\sqrt{1+Y^2}},\\
\frac{\partial h}{\partial b}&=\frac{y}{\sqrt{1+Y^2}}\cdot f'(b).
\end{align}

Note that for $f(b)=b$, we have $f'(b)=1$,
and for $f(b)=e^b$,  $f'(b)=f(b)=e^b$.


## Likelihood for Incremental Normalization{#sec:inc}

Here, *incremental normalization* means that the model
parameters $\mu_1,\ldots,\mu_n$ and $\sigma^2$ are already known from a
fit to a previous set of $\mu$arrays, i.\,e.\ a set of reference
arrays. See Section \@ref(sec:prof) for the profile likelihood approach
that is used if $\mu_1,\ldots,\mu_n$ and $\sigma^2$ are not known and
need to be estimated from the same data. 
Versions $\ge2.0$ of the `r BiocStyle::Biocpkg("vsn")` package implement both of 
these approaches; 
in versions $1.X$ only the profile likelihood approach was implemented, 
and it was described in the initial publication [@Huber:2002:Bioinformatics].

First, let us note that the likelihood \@ref(eq:likelihood)
is simply a product of independent terms for different $i$. 
We can optimize the parameters $(a_i,b_i)$ 
separately for each $i=1,\ldots,d$.
From the likelihood \@ref(eq:likelihood) we get the 
$i$-th negative log-likelihood

\begin{align}
-\log(L) &=\sum_{i=1}^d -LL_i\\
-LL_i&=\frac{n}{2}\log\left(2\pi\sigma^2\right)+
\sum_{k=1}^n \left(\frac{(h(y_{ki})-\mu_k)^2}{2\sigma^2}
+\log\frac{\sqrt{1+Y_{ki}^2}}{f(b_i)}\right)\\
&=\frac{n}{2}\log\left(2\pi\sigma^2\right)
-n\log f(b_i)
+\sum_{k=1}^n\left(\frac{(h(y_{ki})-\mu_k)^2}{2\sigma^2}
+\frac{1}{2}\log\left(1+Y_{ki}^2\right)\right)
(\#eq:nll)
\end{align}

This is what we want to optimize as a function of $a_i$ and $b_i$. 
The optimizer benefits from the derivatives. 
The  derivative with respect to $a_i$ is

\begin{align}
\frac{\partial}{\partial a_i}(-LL_i) &=
\sum_{k=1}^n \left( \frac{h(y_{ki})-\mu_k}{\sigma^2}
+\frac{Y_{ki}}{\sqrt{1+Y_{ki}^2}} \right)
\cdot\frac{1}{\sqrt{1+Y_{ki}^2}}
\nonumber\\
&= \sum_{k=1}^n 
\left(\frac{r_{ki}}{\sigma^2}+A_{ki}Y_{ki}\right)A_{ki}
(\#eq:ddanll)
\end{align}

and with respect to $b_i$

\begin{align}
\frac{\partial}{\partial b_i}(-LL_i) &=
-n\frac{f'(b_i)}{f(b_i)} 
+\sum_{k=1}^n \left( \frac{h(y_{ki})-\mu_k}{\sigma^2}
+\frac{Y_{ki}}{\sqrt{1+Y_{ki}^2}}\right)
\cdot\frac{y_{ki}}{\sqrt{1+Y_{ki}^2}}\cdot f'(b_i)
\nonumber\\
&=
-n\frac{f'(b_i)}{f(b_i)} 
+f'(b_i)\sum_{k=1}^n \left(\frac{r_{ki}}{\sigma^2}+A_{ki}Y_{ki}\right)
A_{ki}y_{ki}
(\#eq:ddbnll)
\end{align}

Here, I have introduced the following shorthand notation for the 
"intermediate results" terms

\begin{align}
r_{ki}&= h(y_{ki})-\mu_k\\
A_{ki}&=\frac{1}{\sqrt{1+Y_{ki}^2}}.
\end{align}

Variables for these intermediate values are also used 
in the C code to organise
the computations of the gradient.
 

## Profile Likelihood{#sec:prof}

If $\mu_1,\ldots,\mu_n$ and $\sigma^2$ are not already known,
we can plug in their maximum likelihood estimates, obtained 
from optimizing $LL$ for $\mu_1,\ldots,\mu_n$ and $\sigma^2$:

\begin{align}
\hat{\mu}_k &= \frac{1}{d}\sum_{j=1}^d h(y_{kj})
(\#eq:muhat)\\
\hat{\sigma}^2 &= \frac{1}{nd}\sum_{k=1}^n\sum_{j=1}^d 
(h(y_{kj})-\hat{\mu}_k)^2
(\#eq:sigmahat)
\end{align}

into the negative log-likelihood. 
The result is called the negative profile log-likelihood

\begin{equation}
-PLL=
 \frac{nd}{2}\log\left(2\pi\hat{\sigma}^2\right)
+\frac{nd}{2}
-n\sum_{j=1}^d\log f(b_j)
+\frac{1}{2}\sum_{k=1}^n\sum_{j=1}^d \log\sqrt{1+Y_{kj}^2}.
(\#eq:npll)
\end{equation}

Note that this no longer decomposes into a sum of terms for each $j$
that are independent of each other -- the terms for different $j$
are coupled through Equations \@ref(eq:muhat) and \@ref(eq:sigmahat).
We need the following derivatives.

\begin{align}
\frac{\partial \hat{\sigma}^2}{\partial a_i} &=
\frac{2}{nd}\sum_{k=1}^n
r_{ki}\frac{\partial h(y_{ki})}{\partial a_i}\nonumber\\
&=
\frac{2}{nd}
\sum_{k=1}^n r_{ki}A_{ki}\\
\frac{\partial \hat{\sigma}^2}{\partial b_i} &=
\frac{2}{nd}\cdot f'(b_i)
\sum_{k=1}^n r_{ki}A_{ki}y_{ki}
\end{align}

So, finally

\begin{align}
\frac{\partial}{\partial a_i}(-PLL) &=
\frac{nd}{2\hat{\sigma}^2}\cdot 
\frac{\partial \hat{\sigma}^2}{\partial a_i} 
+\sum_{k=1}^n A_{ki}^2Y_{ki}\nonumber\\
&=\sum_{k=1}^n 
\left(\frac{r_{ki}}{\hat{\sigma}^2}+A_{ki}Y_{ki}\right)A_{ki}
(\#eq:ddanpll)\\
\frac{\partial}{\partial b_i}(-PLL) &=
-n\frac{f'(b_i)}{f(b_i)} 
+ f'(b_i) \sum_{k=1}^n 
\left(\frac{r_{ki}}{\hat{\sigma}^2}+A_{ki}Y_{ki}\right)A_{ki}y_{ki}
(\#eq:ddbnpll)
\end{align}

## Summary{#sec:ggu}

Likelihoods, from Equations \@ref(eq:nll) and \@ref(eq:npll):

\begin{align}
-LL_i&=
\underbrace{%
  \frac{n}{2}\log\left(2\pi\sigma^2\right)
}_{\mbox{scale}} +
\underbrace{%
  \sum_{k=1}^n \frac{(h(y_{ki})-\mu_k)^2}{2\sigma^2}
}_{\mbox{residuals}} 
\underbrace{%
  -n\log f(b_i) + 
  \frac{1}{2}\sum_{k=1}^n \log(1+Y_{ki}^2)
}_{\mbox{jacobian}}\\
-PLL&=
\underbrace{%
\frac{nd}{2}\log\left(2\pi\hat{\sigma}^2\right)
}_{\mbox{scale}}+
\underbrace{%
\frac{nd}{2}
}_{\mbox{residuals}} +
\underbrace{%  
  \sum_{i=1}^d\left(
  -n\log f(b_i) + 
  \frac{1}{2}\sum_{k=1}^n \log(1+Y_{ki}^2)\right)
}_{\mbox{jacobian}}
\end{align}

The computations in the C code are organised into steps for computing the
terms "scale", "residuals" and "jacobian".

Partial derivatives with respect to $a_i$,
from Equations \@ref(eq:ddanll) and \@ref(eq:ddanpll):

\begin{align}
\frac{\partial}{\partial a_i}(-LL_i) &=
\sum_{k=1}^n 
\left(\frac{r_{ki}}{\sigma^2}+A_{ki}Y_{ki}\right)A_{ki}\\
%
\frac{\partial}{\partial a_i}(-PLL) &=
\sum_{k=1}^n 
\left(\frac{r_{ki}}{\hat{\sigma}^2}+A_{ki}Y_{ki}\right)A_{ki}
\end{align}

Partial derivatives with respect to $b_i$,
from Equations \@ref(eq:ddbnll) and \@ref(eq:ddbnpll):

\begin{align}
\frac{\partial}{\partial b_i}(-LL_i) &=
-n\frac{f'(b_i)}{f(b_i)}
+f'(b_i)\sum_{k=1}^n 
\left(\frac{r_{ki}}{\sigma^2}+A_{ki}Y_{ki}\right)A_{ki}y_{ki}\\
%
\frac{\partial}{\partial b_i}(-PLL) &=
-n\frac{f'(b_i)}{f(b_i)}
+f'(b_i)\sum_{k=1}^n 
\left(\frac{r_{ki}}{\hat{\sigma}^2}+A_{ki}Y_{ki}\right)A_{ki}y_{ki}.
\end{align}

Note that the terms have many similarities -- this
is used in the implementation in the C code.

## References
