---
title: "Advanced Exploration of the SomaScan Menu"
package: "`r BiocStyle::pkg_ver('SomaScan.db')`"
output: 
  BiocStyle::html_document:
      toc_float: true
vignette: >
  %\VignetteIndexEntry{Advanced Exploration of the SomaScan Menu}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(BiocStyle)
library(GO.db)
library(KEGGREST)
library(org.Hs.eg.db)
library(SomaScan.db)
library(withr)
```


# Introduction

This vignette is a follow up to the "Introduction to SomaScan.db" vignette,
and will introduce more advanced capabilities of the `SomaScan.db` 
package. Below we illustrate how `SomaScan.db` can be used to deeply 
explore the SomaScan menu and execute complex annotation functions, outside of 
the basic use of `select` outlined in the introductory vignette. Knowledge of
SQL is not required, but a familiarity with R and SomaScan data is highly 
suggested. For an introduction to the `SomaScan.db` package and its 
methods, please see `vignette("SomaScan-db", "SomaScan.db")`.

Please note that this vignette will require the installation and usage of 
*three* additional Bioconductor R packages: `r Biocpkg("GO.db")`, 
`r Biocpkg("EnsDb.Hsapiens.v75")`, and `r Biocpkg("KEGGREST")`. 
Please see the linked pages to find installation instructions for these 
packages.

```{r load-pkgs, warning = FALSE, message = FALSE}
library(GO.db)
library(KEGGREST)
library(org.Hs.eg.db)
library(SomaScan.db)
library(withr)
```


# Incorporating Additional Annotations

## Gene Ontology

The `SomaScan.db` package allows a user to retrieve Gene Ontology (GO) 
identifiers associated with a particular SomaScan `SeqId` (or set of 
`SeqIds`). However, the available GO annotations in `SomaScan.db` are 
limited; only the GO ID, evidence code, and ontology category are currently 
available. This helps prevent the package from accumulating an overwhelming 
number of annotation elements, but limits the ability to extract detailed GO 
information. 

To illustrate this limitation, below we will display the GO terms associated 
with the gene "IL31":

```{r il31-select-df}
il31_go <- select(SomaScan.db, keys = "IL31", keytype = "SYMBOL", 
                  columns = c("PROBEID", "GO"))

il31_go
```

In this data frame, IL31 maps to one single `SeqId` ("10455-196"), indicated 
by the "PROBEID" column. This `SeqId` and gene are associated with seven 
unique GO IDs (in the "GO" column). The GO knowledgebase is vast, however, 
and these identifiers are not particularly informative for anyone who 
hasn't memorized their more descriptive term names. Additional details for 
each ID would make this table more informative and interpretable. Luckily, 
there are two options for retrieving such data:

  1. Using an additional set of GO-specific methods to retrieve additional 
     information for each GO ID (`Term`, `Ontology`, `Definition`, and 
     `Synonym`)
  2. Linking results from `SomaScan.db` with another Bioconductor tool, like 
     the `r Biocpkg("GO.db")` annotation package

Each of these techniques have their own special utility. Below, we will work 
through examples of how the techniques described above can be used to link
GO information with the annotations from `SomaScan.db`.

---------------------


### GO Methods

The `Term`, `Ontology`, `Definition`, and `Synonym` methods are GO-specific 
methods imported from the `AnnotationDbi` package. They are designed to 
retrieve a single piece of information, indicated by the method name, that 
corresponds to a set of GO identifiers (note: we will skip `Ontology` in this 
vignette, as the GO Ontology is already retrievable with `SomaScan.db`).

The `Term` method retrieves a character string defining the role of the gene 
product that corresponds to provided GO ID(s). In the example below, we will 
retrieve the GO terms for each of the GO IDs in the `select` results generated 
previously:

```{r go-term}
Term(il31_go$GO)
```

The `Definition` method retrieves a more detailed and extended definition of 
the ontology for the input GO IDs:

```{r go-definition}
Definition(il31_go$GO)
```

And finally, the `Synonym` method can be used to retrieve other ontology terms 
that are considered to be synonymous to the primary term attached to the GO 
ID. For example, "type I programmed cell death" is considered synonymous with 
"apoptosis". It's worth noting that `Synonym` can return a large set
of results, so we caution against providing a large set of GO IDs to `Synonym`:

```{r go-synonym}
Synonym(il31_go$GO)
```

A GO synonym was not found for the first identifier in the provided vector,
so an `NA` was returned.

These functions are useful for quickly retrieving information for a given GO 
ID, but you'll notice that the results are returned as a vector or list, 
rather than a data frame. Depending on the application, this may be 
useful - for example, these methods are handy for on-the-fly GO 
term or definition lookups, but their format can be cumbersome to incorporate 
into a data frame created by `select`. 

Let's return to the `il31_go` data frame we generated previously. How can we 
incorporate the additional information obtained by `Term`, `Definition`, 
and `Synonym` into this object? Assuming the output is the same length as 
the number of rows in `il31_go`, the character vector obtained by `Term`,
`Definition`, or `Synonym`, can be easily appended as a new column in the 
`il31_go` data frame:

```{r append-terms}
trms <- Term(il31_go$GO)
class(trms)
length(trms) == length(il31_go$GO)

il31_go$TERM <- trms
il31_go
```

The same can be done with the output of `Definition`:

```{r append-definitions}
defs <- Definition(il31_go$GO)
class(defs)
length(defs) == length(il31_go$GO)

il31_go$DEFINITION <- defs
il31_go[ ,c("SYMBOL", "PROBEID", "GO", "TERM", "DEFINITION")]
```

However, this only works cleanly when the output is a character 
vector with the same order and number of elements as the input vector. 
With the list output of `Synonym`, the process is a little less 
straightforward. In addition, it takes multiple steps to generate these
additional annotations and combine them with a `select` data frame. 
Instead of performing so many steps, we can utilize another 
Bionconductor annotation resource called `r Biocpkg("GO.db")` to retrieve GO 
annotation elements in a convenient data frame format. 

---------------------


### GO.db

The `r Biocpkg("GO.db")` R package contains annotations describing the entire 
Gene Ontology knowledgebase, assembled using data directly from the 
[GO website](http://geneontology.org/). `r Biocpkg("GO.db")` provides a method 
to easily retrieve the latest version of the Gene Ontology knowledgebase into
an R session. Like `SomaScan.db`, `r Biocpkg("GO.db")` is an annotation 
package that can be queried using the same five methods (`select`, `keys`,
`keytypes`, `columns`, and `mapIds`). By utilizing both `SomaScan.db` and
`r Biocpkg("GO.db")`, it is possible to connect `SeqIds` to GO IDs, then add 
additional GO annotations that are not available within `SomaScan.db`. 

Let's walk through an example. First, select a key (and corresponding 
GO ID) to use as a starting point:

```{r example-go-ids}
go_ids <- select(SomaScan.db, "IL3RA", keytype = "SYMBOL",
                 columns = c("GO", "SYMBOL"))

go_ids
```

As shown previously, the GO ID, EVIDENCE code, and ONTOLOGY comprise the 
extent of GO information contained in `SomaScan.db`. However, we can use the 
GO ID (in the `GO` column) to connect these values to the annotations in 
`r Biocpkg("GO.db")`: 

```{r go-columns}
columns(GO.db)

go_defs <- select(GO.db, keys = go_ids$GO,
                  columns = c("GOID", "TERM", "DEFINITION"))

go_defs
```

```{r go-merge}
merge(go_ids, go_defs, by.x = "GO", by.y = "GOID")
```

Using this workflow, in just two steps we can link annotation information
*between* annotation package resources (i.e `SomaScan.db` <--> `GO.db`).

---------------------


## KEGG Pathways

Note that the same workflow _cannot_ be performed for KEGG pathways, 
due to KEGG's data sharing policy. Instead, the package 
`r Biocpkg("KEGGREST")` must be used. Rather than an annotation database-style 
package (like `SomaScan.db` and `GO.db`), `r Biocpkg("KEGGREST")` is a package 
that provides a client interface in R to the KEGG REST
(REpresentational State Transfer) server. For reference, 
REST is an interface that two computer 
systems can use to securely exchange information over the internet. Queries 
made with the `r Biocpkg("KEGGREST")` package retrieve information 
directly from the online KEGG database.

Let's take the same `select` query as we used for GO, but modify it to obtain
KEGG pathway identifiers instead:

```{r select-kegg}
kegg_sel <- select(SomaScan.db, keys = "CD86", keytype = "SYMBOL", 
                   columns = c("PROBEID", "PATH"))

kegg_sel
```

We can use the identifiers in the "PATH" column to query the KEGG database 
using `KEGGREST::keggGet()`:

```{r get-path-names}
# Add prefix indicating species (hsa = Homo sapiens)
hsa_names <- paste0("hsa", kegg_sel$PATH)

kegg_res <- keggGet(dbentries = hsa_names) |>
    setNames(hsa_names[1:10L]) # Setting names for results list
```

Because so much information is returned by `keggGet()`, a maximum number of 10 
entries are allowed. Input exceeding 10 entries will be truncated, and only
the first 10 results will be returned (as indicated in the warning message 
above). Let's take a look at what was returned for each KEGG pathway:

```{r str-kegg-path}
str(kegg_res$hsa04514)
```

Some additional data manipulation will be required to extract the desired 
information from the results of `keggGet()`. Let's just extract the pathway
name (`NAME`):

```{r path-names-vector}
kegg_names <- vapply(kegg_res, `[[`, i = "NAME", "", USE.NAMES = FALSE)

kegg_names
```

Now we can append this vector to our original results from `select`:

```{r append-names}
kegg_sel$PATHNAME <- kegg_names

kegg_sel
```

Other pieces of information can be extracted to the list and reduced to a 
character vector or used to build a data frame, which can then be appended to
or merged similar to the pathway name in the code chunks above. For more 
details about what can be done with the package, see `r Biocpkg("KEGGREST")`.


# Positional Annotation

Similar to the extended GO annotation in the previous section, positional 
annotation cannot currently be performed within `SomaScan.db`. 
`SomaScan.db` is a platform-centric annotation package, built around the 
probes of the SomaScan protein assay, and positional annotation is not
within its scope. However, it _is_ possible to retrieve positional 
annotations by linking to other Bioconductor annotation resources, which can
then be combined with `SomaScan.db` in a two-step process (similar to above).
The first step uses `SomaScan.db` to retrieve gene-level information
corresponding to SomaScan analytes; the second requires a human transcriptome 
or organism-centric annotation package to retrieve the desired chromosomal 
locations.

We will provide a brief example of this using the popular organism-centric 
package, `r Biocpkg("EnsDb.Hsapiens.v75")`, which contains a database of human 
annotations derived from `Ensembl release 75`. However, this procedure can 
also be performed using transcriptome-centric annotation packages like 
`r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene")`.

Let's say we are interested in collecting position information 
associated with the protein target corresponding to `SeqId = 11138-16`.
First, we must determine which gene this `SeqId` maps to:

```{r seqid-gene}
pos_sel <- select(SomaScan.db, "11138-16", columns = c("SYMBOL", "GENENAME", 
                                                       "ENTREZID", "ENSEMBL"))

pos_sel
```

We now know this probe targets protein encoded by the `r pos_sel$SYMBOL` 
gene. We can use `r Biocpkg("EnsDb.Hsapiens.v75")` to retrieve positional 
information about `r pos_sel$SYMBOL`, like which chromosome the 
`r pos_sel$SYMBOL` is on, its start and stop position, and how many exons it 
has (at the time of Ensembl's `v75` release):

```{r keys-ens75, eval = FALSE}
# Install package from Bioconductor, if not already installed
if (!require("EnsDb.Hsapiens.v75", quietly = TRUE)) {
    BiocManager::install("EnsDb.Hsapiens.v75")
}

# The central keys of the organism-level database are the Ensembl gene ID
keys(EnsDb.Hsapiens.v75)[1:10L]

# Also contains the Ensembl gene ID, so this column can be used for merging
grep("ENSEMBL", columns(SomaScan.db), value = TRUE)

# These columns will inform us as to what positional information we can 
# retrieve from the organism-level database
columns(EnsDb.Hsapiens.v75)

# Build a query to retrieve the prot IDs and start/stop pos of protein domains
pos_res <- select(EnsDb.Hsapiens.v75, keys = "ENSG00000020633", 
                  columns = c("GENEBIOTYPE", "SEQCOORDSYSTEM", "GENEID", 
                              "PROTEINID", "PROTDOMSTART", "PROTDOMEND"))

# Merge back into `pos_sel` using the "GENEID" column
merge(pos_sel, pos_res, by.x = "ENSEMBL", by.y = "GENEID")
```


# Functional Group Representation

As mentioned in the Introductory Vignette (`vignette("SomaScan-db", package = "SomaScan.db")`), 
the `SomaScan.db` annotation database can be queried using values other than
the central database key, the `SeqId` (i.e. the "PROBEID" column). This 
section will describe additional methods of retrieving information from the 
database without using the `SeqId`.

---------------------


## GO Term Coverage

The annotations in `SomaScan.db` can be used to answer general questions about 
SomaScan, without the need for a SomaScan dataset/ADAT file as a starting 
point. For example, if one were interested in proteins involved in cancer 
progression and metastasis (and therefore cell adhesion), is the SomaScan 
menu capable of measuring proteins involved in cell adhesion? If so, how 
many of these proteins can be measured with SomaScan? 

We can answer this by examining the coverage of the GO term 
"cell adhesion" in both the 5k and 7k SomaScan menus. We don't need the
GO identifier to get started, as that information can be retrieved from
`GO.db` using the name of the term as the key:

```{r kin-act-go}
select(GO.db, keys = "cell adhesion", keytype = "TERM", 
       columns = c("GOID", "TERM"))
```

Now that we have the GO ID, we can search in `SomaScan.db` to determine 
how many `SeqIds` are associated with cell adhesion.

```{r}
cellAd_ids <- select(SomaScan.db, keys = "GO:0007155", keytype = "GO",
                     columns = "PROBEID", "UNIPROTID")

head(cellAd_ids, n = 10L)

# Total number of SeqIds associated with cell adhesion
unique(cellAd_ids$PROBEID) |> length()
```

There are `r length(unique(cellAd_ids$PROBEID))` unique `SeqIds` associated 
with the "cell adhesion" GO term (_unique_ is important here because the data 
frame above may contain multiple entries per `SeqId`, due to the "EVIDENCE" 
column). There are `r length(keys(SomaScan.db))` total `SeqIds` in the 
`SomaScan.db` database, so 
`r round(length(unique(cellAd_ids$PROBEID))/length(keys(SomaScan.db))*100, 2)`% 
of keys in the database are associated with cell adhesion.

How many of the _total_ proteins in the cell adhesion GO term are covered by 
the SomaScan menu? To answer this question, we first must use another 
annotation package, `org.Hs.eg.db`, to retrieve a list of all human 
UniProt IDs associated with the "cell adhesion" GO term.

```{r}
cellAd_prots <- select(org.Hs.eg.db, 
                       keys = "GO:0007155", 
                       keytype = "GO", 
                       columns = "UNIPROT")

# Again, we take the unique set of proteins
length(unique(cellAd_prots$UNIPROT))
```

The GO term `GO:0007155` (cell adhesion) contains a total of 
`r length(unique(cellAd_prots$UNIPROT))` unique human UniProt IDs. Now we 
can check to see how many of these are covered by the SomaScan menu by 
searching for the proteins in `SomaScan.db` with `select`:

```{r}
cellAd_covProts <- select(SomaScan.db, keys = unique(cellAd_prots$UNIPROT),
                          keytype = "UNIPROT", columns = "PROBEID")

head(cellAd_covProts, n = 20L)
```

`select` will return an `NA` value if a key is not found in the database. As 
seen above, some proteins in `GO:0007155` do not map to a `SeqId` in 
`SomaScan.db`. To get an accurate count of the proteins that _do_ map to a 
`SeqId`, we must remove the unmapped proteins by filtering out rows with `NA` 
values:

```{r}
cellAd_covProts <- cellAd_covProts[!is.na(cellAd_covProts$PROBEID),]

cellAd_covIDs <- unique(cellAd_covProts$UNIPROT)

length(cellAd_covIDs)
```

We removed duplicates from the list of proteins provided as keys, to get a 
final count of `r length(cellAd_covIDs)` proteins 
(`r round(length(cellAd_covIDs)/length(unique(cellAd_prots$UNIPROT))*100, 2)`%) 
from the "cell adhesion" GO term that are covered by the SomaScan menu. 

Does this number differ between versions of the SomaScan Menu? Remember that 
the 7k menu contains _all_ of the `SeqIds` in the 5k menu, so what this really 
tells us is: were analytes targeting cell adhesion-related proteins added in 
the 7k menu?

```{r kin-act-menu-diff}
cellAd_menu <- lapply(c("5k", "7k"), function(x) {
    df <- select(SomaScan.db, keys = unique(cellAd_prots$UNIPROT),
                 keytype = "UNIPROT", columns = "PROBEID",
                 menu = x)
    
    # Again, removing probes that do not map to a cell adhesion protein
    df <- df[!is.na(df$PROBEID),]
}) |> setNames(c("somascan_5k", "somascan_7k"))

identical(cellAd_menu$somascan_5k, cellAd_menu$somascan_7k)
```

In this example, the number of `SeqIds` associated with cell adhesion does 
_not_ differ between SomaScan menu versions (the list of `SeqIds` is 
identical). The differences between menu versions can be explored with the 
`menu` argument of `select`, or via the `somascan_menu` data object (this is 
explained in the Introductory Vignette).

---------------------


## Gene Families

A number of gene families are targeted by reagents in the SomaScan assay. How 
can these be interrogated using `SomaScan.db`? Is the package capable of 
searching for/within specific gene families? The answer is yes, but  
a specific function does not exist for analyzing gene families as a whole. 
Instead, by using features of `select` and `keys`, `SomaScan.db` can 
be queried for common features connecting gene families of interest - more 
specifically, the `match=` argument of `select` and the `pattern=` argument of 
`keys` can be used to retrieve gene family members that contain a common
pattern in their name.

The `keys` method is capable of using regular expressions ("regex") to search 
for keys in the database that contain a specific pattern of characters. This 
feature is especially useful when looking for annotations for a gene family. 
For example, a regex pattern can be used to retrieve a list of all IL17 
receptor family genes in the database:

```{r keys-il17}
il17_family <- keys(SomaScan.db, keytype = "SYMBOL", pattern = "IL17")
```

Those keys can then be used to query the database with `select`:

```{r select-il17}
select(SomaScan.db, keys = il17_family, keytype = "SYMBOL",
       columns = c("PROBEID", "UNIPROT", "GENENAME"))
```

If multiple gene families are of interest, the `keys` argument of `select` 
(in combination with `match=TRUE`) can support a regex pattern, and will 
accomplish both of the previous steps in a single call:

```{r combine-keys-select}
select(SomaScan.db, keys = "NOTCH|ZF", keytype = "SYMBOL", 
       columns = c("PROBEID", "SYMBOL", "GENENAME"), match = TRUE)
```

The `GENENAME` column can also support a regex pattern, and can be used to 
search for keywords that are associated with specific gene families (and
not just the gene symbols themselves). Examples include "homeobox", 
"zinc finger", "notch", etc.

```{r homeobox}
select(SomaScan.db, keys = "homeobox", keytype = "GENENAME", 
       columns = c("PROBEID", "SYMBOL"), match = TRUE)
```


# Session info

```{r session-info}
sessionInfo()
```