---
title: "Functional Data Analysis of Spatial Metrics"
author: 
  - name: "Martin Emons"
    affiliation:
      - &DMLS Department of Molecular Life Sciences, University of Zurich, Switzerland
      - &SIB SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
    email: "martin.emons@uzh.ch"
  - name: Fabian Scheipl
    affiliation:
      - &LMU Department of Statistics, Ludwig-Maximilians-Universität München, Germany
  - name: Mark D. Robinson
    affiliation:
      - *DMLS
      - *SIB
package: "`r BiocStyle::Biocpkg('spatialFDA')`"
output:
  BiocStyle::html_document
abstract: >
  This vignette shows how to perform a custom spatial analysis with the package `spatialFDA`. 
vignette: >
  %\VignetteIndexEntry{Functional Data Analysis of Spatial Metrics}
  %\VignetteEncoding{UTF-8}  
  %\VignetteEngine{knitr::rmarkdown}
bibliography: spatialFDA.bib
editor_options: 
  chunk_output_type: console
---

```{r v1, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    eval = TRUE
)
```

# Introduction

This vignette demonstrates how to use `r BiocStyle::Biocpkg('spatialFDA')` to perform functional data analysis on spatial statistics metrics. The main aim of this package is to detect differential spatial arrangements within and between cell types given several samples/conditions. It does so by calculating spatial statistics metrics via the `r BiocStyle::CRANpkg('spatstat')` package and comparing differences using functional additive mixed models as implemented in the `r BiocStyle::CRANpkg('refund')` package [@spatstat2005; @refund2024].

The use case is a dataset from @damondMapHumanType2019 which contains images from 12 human donors. The raw data is published under a `CC-BY-4.0` License on [Mendeley](https://data.mendeley.com/datasets/cydmwsfztj/2). 

This package is similar to other packages in `python` and `R`. The following table shows the main differences in terms of functionality [@ali2024graphcompass; @canete2022spicyr; @wrobel2024mxfda]. 

|Package name                          | Foundation | Testing procedure   |
| ------------------------------------ | --------------- | ------------------- |
| `r BiocStyle::Biocpkg('spicyR')`     | $L$-function    | Scalar comparison   |
| `GraphCompass`                       | Graph-based     | Graph and  scalar comparison | 
| `r BiocStyle::CRANpkg('mxfda')`      | $K$-, $G$- and $L$- function | Function as input for survival modelling |
| `r BiocStyle::Biocpkg('spatialFDA')` | most `r BiocStyle::CRANpkg('spatstat')` functions   | Functional comparison over domain |

# Installation

`r BiocStyle::Biocpkg('spatialFDA')` can be installed and loaded from Bioconductor as follows

```{r installation, include = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("spatialFDA")
```

```{r setup, warning = FALSE, message = FALSE}
library("spatialFDA")
library("dplyr")
library("ggplot2")
library("tidyr")
library("stringr")
library("dplyr")
library("patchwork")
library("SpatialExperiment")

set.seed(1234)
```

# Getting started

In this vignette we will analyse a diabetes dataset acquired by imaging mass cytometry (IMC) as acquired by @damondMapHumanType2019. The dataset contains images from 12 human donors, 4 healthy and 8 with type 1 diabetes (T1D). With IMC, 35 markers were measured at single cell resolution [@damondMapHumanType2019].

## Loading the data

The @damondMapHumanType2019 dataset is easily loaded from `ExperimentHub` via a small reader function `.loadExample()`. The entire dataset can be loaded by setting `full = TRUE`. For computational reasons, one can reduce to three patients as well by setting this flag to `FALSE`. We will subset the entire dataset to two samples per condition in order to have a multi-condition/multi-sample setting. The package offers multiple datatypes, we will use the `r BiocStyle::Biocpkg('SpatialExperiment')` (SPE) object [@righelli2022spatialexperiment].

```{r loading, warning = FALSE, message = FALSE}
# retrieve example data from Damond et al. (2019)
spe <- .loadExample(full = TRUE)

spe <- subset(spe, ,patient_id %in% c(6089,6180,6126,6134,6228,6414))
# set cell types as factors
colData(spe)$cell_type <- as.factor(colData(spe)$cell_type) 
```

## Visualising the raw data

We can look at the fields of view (FOVs) of the diabetes dataset. To do so we extract the spatial coordinates, store them as a dataframe and add the colData from the SPE to this. We will look only at the first four FOVs of the healthy sample. We plot both the cell categories of all cells and then the cell types of secretory cells ($\alpha, \beta$ and $\delta$ cells) and T-cells (CD8+ and CD4+ T-cells).

```{r plotting fovs, warning = FALSE, fig.width=8, fig.height=15}
df <- data.frame(spatialCoords(spe), colData(spe))

dfSub <- df %>%
    subset(image_name %in% c("E02", "E03", "E04", "E05"))

p <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_category)) +
    geom_point(size= 0.5) +
    facet_wrap(~image_name) +
    theme(legend.title.size = 20, legend.text.size = 20) +
    xlab("x") +
    ylab("y") +
    labs(color = "cell category")+
    coord_equal() +
    theme_light()

dfSub <- dfSub %>%
    subset(cell_type %in% c("alpha", "beta", "delta", "Th", "Tc"))

q <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_type)) +
    geom_point(size= 0.5) +
    facet_wrap(~image_name) +
    theme(legend.title.size = 20, legend.text.size = 20) +
    xlab("x") +
    ylab("y") +
    labs(color = "cell type") +
    coord_equal() +
    theme_light()
wrap_plots(list(p,q), widths = c(1,1), heights = c(1,1), nrow = 2, ncol = 1)
```

# Calculating Spatial Statistics Metrics

In a first step, we calculate a spatial statistic curve as implemented by `r BiocStyle::CRANpkg('spatstat')`. One can choose from a range of metrics for discrete marks and calculate these within a mark or between two marks. Common metrics are:

- Ripley's $K$ function and its variance stabilised form, Besag's $L$

- Pair correlation function $g$

- Nearest-neighbour function $G$

- Empty space function $F$

Note that all of these functions have different implementations to correct for inhomogeneity and for comparison between two marks (cross functions) [@baddeleySpatialPointPatterns].

All spatial statistics metrics are a function of radius $r$. As such, we have to determine the maximum radius sequence to consider. [@baddeleySpatialPointPatterns] recommend to take a maximum radius $r$ that is not larger than $1/4$ of the window length. As the window can be non-convex, we calculate the half of the minimum bounding radius of the observation window as heuristic.

```{r}
p <- rMaxHeuristic(spe,
subsetby = "image_number", marks = "cell_type"
)
p
```

We will consider a rMax of $50$ for analysing shorter-range interactions.

## Correlation

With correlation metrics, we assess the *distances* of all points to one another while normalising for density effects and of the window size $|W|$. Furthermore, spatial metrics are corrected for edge effects, due to the fact that points at the border of a FOV do not have a fully-observed neighborhood [@baddeleySpatialPointPatterns, pp. 203 ff.].  

A well-known metric is Ripley's $K$ function or its variance-stabilised transformation, the $L$ function. We can calculate a variant of the $L$ function with the function `calcMetricPerFov` between e.g $\alpha$ and cytotoxic T cells. The output is a dataframe with the following most important columns:

- `r`: the radius at which the spatial metric is evaluated

- `theo`: the theoretical value of a homogeneous (Poisson) realisation of a point process

- `iso`: an isotropic edge corrected value of the $L$ function


```{r Lfunction, warning = FALSE, message = FALSE}
metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"),
                              subsetby = "image_number", fun = "Lcross", 
                              marks = "cell_type",
                              rSeq = seq(0, 50, length.out = 50), 
                              by = c("patient_stage", "patient_id",
                                     "image_number"),
                              ncores = 1)
metricRes %>% head(3)
```

We can visualise this metric with `plotMetricPerFov` function. Here, we need to specify which border correction we want to plot and what the $x$-axis is. Both can vary from function to function.

```{r plotLfunction, warning = FALSE, fig.width=8, fig.height=8}
# create a unique plotting ID
metricRes$ID <- paste0(
    metricRes$patient_stage, "|", metricRes$patient_id
)
# change levels for plotting
metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126",
                                                "Non-diabetic|6134",
                                                "Onset|6228","Onset|6414",
                                                "Long-duration|6089",
                                                "Long-duration|6180"))
# plot metrics
plotMetricPerFov(metricRes, correction = "iso", x = "r",
                 imageId = "image_number", ID = "ID", ncol = 2)
```

By eye, we see no visible difference between the conditions in terms of correlation 
of $\alpha$ and cytotoxic T cells.

## Spacing

Another important aspect of spatial analysis is spacing. Here, the shortest distances or empty space to the next neighbor is calculated. This quantifies a different aspect of a point pattern than correlation or intensity of points. Two well-known functions are [@baddeleySpatialPointPatterns, pp. 255-266]:

- nearest-neighbor distance distribution $G$

- empty space function $F$

For spacing metrics, we get different border corrections but otherwise the output stays the same:

```{r Gfunction, warning = FALSE, message = FALSE}
metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"),
                              subsetby = "image_number", fun = "Gcross", 
                              marks = "cell_type",
                              rSeq = seq(0, 50, length.out = 50), 
                              by = c("patient_stage", "patient_id",
                                     "image_number"),
                              ncores = 1)
metricRes %>% head(3)
```

```{r plotGfunction, warning = FALSE, fig.width=8, fig.height=8}
# create a unique plotting ID
metricRes$ID <- paste0(
    metricRes$patient_stage, "|", metricRes$patient_id
)
# change levels for plotting
metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126",
                                                "Non-diabetic|6134",
                                                "Onset|6228","Onset|6414",
                                                "Long-duration|6089",
                                                "Long-duration|6180"))
# plot metrics
plotMetricPerFov(metricRes, correction = "rs", x = "r",
                 imageId = "image_number", ID = "ID", ncol = 2)
```

In the nearest-neighbor distance function, we see a strong difference between onset T1D, long-duration T1D and non-diabetic controls in terms of spacing of $\alpha$ and cytotoxic T cells. 

# Functional boxplot

Looking at raw spatial statistics curves can be challenging. In order to summarise this information, we can plot functional boxplots by aggregating the curves into boxplots via a user-defined variable `aggregate_by`. We use the `fbplot` function from the `r BiocStyle::CRANpkg('fda')` package [@sun2011functional; @ramsay2024fda].

```{r, funcBoxPlot, warning = FALSE, results='hide'}
# create a unique ID per row in the dataframe
metricRes$ID <- paste0(
    metricRes$patient_stage, "x", metricRes$patient_id,
    "x", metricRes$image_number
)
#removing field of views that have as a curve only zeros - these are cases where
#there is no cells of one type
metricRes <- metricRes %>% dplyr::group_by(ID) %>% dplyr::filter(sum(rs) >= 1)

collector <- plotFbPlot(metricRes, "r", "rs", "patient_stage")
```

The functional boxplot shows that onset $G$-curves are more variable than the corresponding long-duration and non-diabetic curves. We note as well, that the variability is heteroscedastic along the domain (i.e., variance increases with radius), which is undesirable for our statistical modelling. Therefore, we can e.g. apply a variance stabilising transformation to our data or model this variance in the statistical model.

```{r, variancetransform, warning = FALSE}
# can determine with a boxcox transformation what is the ideal parameter
# for the transformation use Fisher's variance-stabilising transformation
metricRes$rs <- asin(sqrt(metricRes$rs))

collector <- plotFbPlot(metricRes, 'r', 'rs', 'patient_stage')
```

# Functional principal component analysis

Another analysis that can be performed is functional principal componentent analysis (fPCA). This is a method to capture the main modes of variation in functional data [@ramsayPrincipalComponentsAnalysis2005]. We use the `r BiocStyle::CRANpkg('refund')` implementation of fPCA. 

```{r fPCA, warning = FALSE}
# filter out all rows that have a constant zero part - all r=<10
metricRes <- metricRes %>% filter(r >= 10)

# prepare dataframe from calcMetricRes to be in the correct format for pffr
dat <- prepData(metricRes, "r", "rs", sample_id = "patient_id",
 image_id = "image_number", condition = "patient_stage")

dat$condition <- gsub("-","_", dat$patient_stage) %>% 
  as.factor() %>%
  relevel("Non_diabetic")

# drop rows with NA
dat <- dat |> drop_na()
# calculate the fPCA
pca <- functionalPCA(dat = dat, r = metricRes$r |> unique(), pve = 0.995)
evalues <- pca$evalues
efunctions <- pca$efunctions
# plot the mean curve and the two first eigenfunctions
p_mu <- ggplot(data.frame(r = unique(metricRes$r), mu = pca$mu), 
               aes(x = r, y = mu)) +
    geom_line() +
    theme_light() +
    xlab("r [µm]")

p_efunction1 <- ggplot(data.frame(r = unique(metricRes$r), 
                                  phi1 = pca$efunctions[,1]), 
                       aes(x = r, y = phi1)) +
    geom_line() +
    theme_light() +
    ylim(-0.3,0.3) +
    xlab("r [µm]")

p_efunction2 <- ggplot(data.frame(r = unique(metricRes$r),
                                  phi2 = pca$efunctions[,2]),
                       aes(x = r, y = phi2)) +
    geom_line() +
    theme_light() +
    ylim(-0.3,0.3) +
    xlab("r [µm]")

wrap_plots(list(p_mu, p_efunction1, p_efunction2), ncol = 3)
# plot the biplot of the first two PCs
plotFpca(dat = dat, res = pca, colourby = "condition")
# print the eigenvalues
evalues
```

In the biplot above we get a very basic differentiation of the $G$ curves. Onset T1D shows most variability along the first fPC. The second fPC describes less variation. 

# Functional additive mixed models

The $L$ function above showed no clear difference between the three conditions whereas the $G$ function showed a strong difference between onset T1D and the two other conditions. In order to test these differences we will use generalised functional additive mixed models. These are generalisations of standard additive mixed models to compare functions over their entire domain. The package that we use is the `r BiocStyle::CRANpkg('refund')` package [@scheiplFunctionalAdditiveMixed2015; @scheiplGeneralizedFunctionalAdditive2016; @refund2024].

The model implemented here is of the form:

$$
\mathbb{E}[y_i(r)] = g(\alpha(r) + \beta_{0,g(i)}(r) + \sum_{j=1}^J f_j(X_{ji},r))
$$

With the following terms:

- $y_i(r)$: functional response, here the `r BiocStyle::CRANpkg('spatstat')` curves

- $g$: optional link function 

- $\alpha(r)$: global functional intercept varying over the domain $r$

- $\beta_{0,g(i)}(r)$: random functional intercept varying over the domain $r$ per grouping variable $g(i)$.

- $f_j(X_{ji},r)$: additive predictors

For the family we will use a scaled $t$-distribution with a logarithmic link function. This distribution has a positive support, thereby modelling the strictly positive response of the spatial statistics functions. 

In this context we need to specify a design matrix and contrasts. For the functional random intercepts we define a smooth function for each patient ID as implemented in `refund` [@refund2024].

```{r funcGamG, fig.height=10, warning = FALSE}
library('refund')
# create a design matrix
mm <- model.matrix(~condition, data = dat)
mm %>% head()

r <- metricRes$r |> unique()
# fit the model
mdl <- functionalGam(
    data = dat, x = r,
    designmat = mm, weights = dat$npoints,
    formula = formula(Y ~ 1 + conditionLong_duration +
                          conditionOnset + s(patient_id, bs = "re")),
    family = mgcv::scat(link = "log"),
    algorithm = "bam"
)
summary(mdl)

plotLs <- lapply(colnames(mm), plotMdl, mdl = mdl,
                 shift = mdl$coefficients[["(Intercept)"]])
wrap_plots(plotLs, nrow = 3, axes = 'collect')
```


We note that there is a small non-significant difference in the $G$ function between non-diabetic and long-duration T1D samples, but a strong difference between non-diabetic and onset T1D according to the model summary. The point wise confidence bands are a limitation of this method and could be improved with either bootstrapping or continuous confidence bands [@liebl2023fast]. Thus, we see not only that a spatial difference in co-localisation of $\alpha$ and cytotoxic T cells is statistically significant but also at which spatial scale this difference occurs.

## Model evaluation

One open problem is the implementation of confidence bands that reflect the non-independently and non-identically distributed residuals. To visualise how much of a problem this is, we can plot the contours of the correlation/covariance and look at model diagnostics.

```{r contour, warning = FALSE}
resid(mdl) |> cor() |> filled.contour(levels = seq(-1, 1, l = 40))
resid(mdl) |> cov() |> filled.contour()

qqnorm(resid(mdl), pch = 16)
qqline(resid(mdl))
```

In these model diagnostics, we note that there is still some variability in the residuals that is not considered by the model. The Q-Q plot indicates a good but not perfect model fit. The residuals show a considerable structure that is in line with the structure in the auto-covariance / correlation plots.

In the functional additive mixed model, we have specified global intercept varying over the domain $r$ as well as functional random intercepts varying over the domain $r$ per grouping variable `patient_id`. We can plot these smooth estimates of the random intercepts.

```{r intercept, warning = FALSE, eval = TRUE}
# look at the smooth random intercepts per patient
data <- coef(mdl)$smterms$`s(patient_id)`$coef

data <- data %>% left_join(dat %>% 
                             select(patient_id, condition) %>% unique)

p <- ggplot(data, aes(x = x.vec, y = value, colour = condition, group = patient_id)) +
  geom_line() +
  theme_light() + 
  geom_smooth(aes(group = 1), col = 'black') +
  xlab("r [µm]")

p
```

We note that these random errors are not constrained to sum-to-zero over the domain $r$. This can lead to problems of identifiability between the global intercept and the functional random intercepts. 

```{r sessionInfo}
sessionInfo()
```
