---
title: "SynaptomeDB: database for Synaptic proteome"
subtitle: "Manual for querying SynaptomeDB"
author: "Oksana Sorokina, Anatoly Sorokin, J. Douglas Armstrong"
date: '`r format(Sys.time(), "%d.%m.%Y")`'
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{synaptome_db_query}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      collapse = TRUE,
                      comment = "#>"
)
```


```{r setup.show, include=FALSE}
suppressMessages(library(synaptome.db))
suppressMessages(library(dplyr))
library(pander)
library(ggplot2)
```

# Introduction

58 published synaptic proteomic datasets (2000-2022 years) that describe 
over 8,000 proteins were  integrated and combined with direct protein-protein 
interactions and functional metadata to build a network resource. 

The set includes 29 post synaptic proteome (PSP) studies (2000 to 2019) 
contributing a total of 5,560 mouse and human unique gene identifiers; 
19 presynaptic studies (2004 to 2022) describe 2,772 unique human and 
mouse gene IDs, and 11 studies that span the whole synaptosome and 
report 7,198 unique genes. 

NB: With the latest update of the database we have added 6 new studies: 1 
presynaptic and 5 postsynaptic, new compsrtment (Synaptic Vesicle), and 
coding mutations for Autistic Spectral Disorder (ASD) and epilepsy (Epi).

To reconstruct protein-protein interaction (PPI) networks for the pre- and 
post-synaptic proteomes we used human PPI data filtered for the highest 
confidence direct and physical interactions from BioGRID, Intact and DIP. 
The resulting postsynaptic proteome (PSP) network contains 4,817 nodes 
and 27,788 edges in the Largest Connected Component (LCC). The presynaptic 
network is significantly smaller and comprises 2,221 nodes and 8,678 edges 
in the LCC.

The database includes: proteomic and interactomic data with supporting 
information on compartment, specie and brain region, GO function 
information for three species: mouse, rat and human, disease annotation 
for human (based on Human Disease Ontology (HDO) and GeneToModel table, 
which links certain synaptic proteins to existing computational models of 
synaptic plasticity and synaptic signal transduction

The original files are maintained at 
Eidnburgh Datashare https://doi.org/10.7488/ds/3017. 
Updated database file could be found here:
https://doi.org/10.7488/ds/3771

The dataset was described in the paper:

1. Sorokina, O., Mclean, C., Croning, M.D.R. et al.  
*A unified resource and configurable model of the  synapse proteome and 
its role in disease*.  Sci Rep 11, 9967 (2021).
<https://doi.org/10.1038/s41598-021-88945-7>.

## Database content overview
Following diagrams illustrate content of the database available for analysis
by the package functions. Source code for construction of figures ar available
in the Appendix [statistics code](#statistics) section.

### Presynaptic datasets
```{r pre_histogram,fig.width=8,fig.height=8,fig.cap='Number of identified proteins in Presynaptic datasets',eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE,fig.show='hold'}
gp<-findGeneByCompartmentPaperCnt(1)

# presynaptic stats
presgp <- gp[gp$Localisation == "Presynaptic",]
syngp <- gp[gp$Localisation == "Synaptosome",]
presg <- getGeneInfoByIDs(presgp$GeneID)
mpres <- merge(presgp, presg, by = c("GeneID","Localisation"))
mmpres <- mpres[, c('GeneID',
                    'HumanEntrez.x',
                    'HumanName.x',
                    'Npmid',
                    'PaperPMID',
                    'Paper',
                    'Year')]
papers <- getPapers()
prespap <- papers[papers$Localisation == "Presynaptic",]
mmmpres <- mmpres[mmpres$PaperPMID %in% prespap$PaperPMID,]
mmmpres$found <- 0
for(i in 1:dim(mmmpres)[1]) {
    if (mmmpres$Npmid[i] == 1) {
        mmmpres$found[i] <- '1'
    } else if (mmmpres$Npmid[i] > 1 & mmmpres$Npmid[i] < 4) {
        mmmpres$found[i] <- '2-3'
    } else if (mmmpres$Npmid[i] >= 4 & mmmpres$Npmid[i] < 10) {
        mmmpres$found[i] <- '4-9'
    } else if (mmmpres$Npmid[i] >= 10) {
        mmmpres$found[i] <- '>10'
    }
}

mmmpres$found<- factor(mmmpres$found,
                        levels = c('1','2-3','4-9','>10'),
                        ordered=TRUE)
tp<-unique(mmmpres$Paper)
mmmpres$Paper<- factor(mmmpres$Paper,
                        levels =tp[order(as.numeric(sub('^[^0-9]+_([0-9]+)',
                                                        '\\1',tp)))],
                        ordered=TRUE)

ummpres<-unique(mmmpres[,c('GeneID','Paper','found')])
ggplot(ummpres) + geom_bar(aes(y = Paper, fill = found)) 
```

### Postsynaptic datasets
```{r post_histogram,fig.width=8,fig.height=8,fig.cap='Number of identified proteins in Postsynaptic datasets',eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE,fig.show='hold'}
#postsynaptic stats
pstgp <- gp[gp$Localisation == "Postsynaptic",]
postg <- getGeneInfoByIDs(pstgp$GeneID)
mpost <- merge(pstgp, postg, by = c("GeneID","Localisation"))
mmpost <- mpost[, c('GeneID',
                    'HumanEntrez.x',
                    'HumanName.x','Npmid',
                    'PaperPMID','Paper','Year')]
postspap <- papers[papers$Localisation == "Postsynaptic",]
mmmpost <- mmpost[mmpost$PaperPMID %in% postspap$PaperPMID,]
mmmpost$found <- 0
for(i in 1:dim(mmmpost)[1]) {
    if (mmmpost$Npmid[i] == 1) {
        mmmpost$found[i] <- '1'
    } else if (mmmpost$Npmid[i] > 1 & mmmpost$Npmid[i] < 4) {
        mmmpost$found[i] <- '2-3'
    } else if (mmmpost$Npmid[i] >= 4 & mmmpost$Npmid[i] < 10) {
        mmmpost$found[i] <- '4-9'
    } else if (mmmpost$Npmid[i] >= 10) {
        mmmpost$found[i] <- '>10'
    }
}
mmmpost$found<- factor(mmmpost$found,levels = c('1','2-3','4-9','>10'),
                        ordered=TRUE)
tp<-unique(mmmpost$Paper)
mmmpost$Paper<- factor(mmmpost$Paper,
                        levels =tp[order(as.numeric(sub('^[^0-9]+_([0-9]+)',
                                                        '\\1',tp)))],
                        ordered=TRUE)
ummpos<-unique(mmmpost[,c('GeneID','Paper','found')])
ggplot(ummpos) + geom_bar(aes(y = Paper, fill = found)) 
```

### Synaptic Vesicle

```{r sv_histogram,fig.width=8,fig.height=8,fig.cap='Number of identified proteins in Synaptic Vesicle datasets (note the difference in the color scale)',eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE,fig.show='hold'}
svgp <- gp[gp$Localisation == "Synaptic_Vesicle",]
svg <- getGeneInfoByIDs(svgp$GeneID)
mpost <- merge(svgp, svg, by = c("GeneID","Localisation"))
mpost$Paper<-paste0(mpost$Paper,ifelse('FULL'==mpost$Dataset,'','_SVR'))
mmpost <- mpost[, c('GeneID','HumanEntrez.x','HumanName.x','Npmid',
                    'PaperPMID','Paper','Year')]
postspap <- papers[papers$Localisation == "Synaptic_Vesicle",]
mmmpost <- mmpost[mmpost$PaperPMID %in% postspap$PaperPMID,]
mmmpost$found <- 0
for(i in 1:dim(mmmpost)[1]) {
    if (mmmpost$Npmid[i] == 1) {
        mmmpost$found[i] <- '1'
    } else if (mmmpost$Npmid[i] > 1 & mmmpost$Npmid[i] < 4) {
        mmmpost$found[i] <- '2-3'
    } else if (mmmpost$Npmid[i] >= 4 & mmmpost$Npmid[i] < 6) {
        mmmpost$found[i] <- '4-5'
    } else if (mmmpost$Npmid[i] >= 6) {
        mmmpost$found[i] <- '>6'
    }
}

mmmpost$found<- factor(mmmpost$found,levels = c('1','2-3','4-5','>6'),
                        ordered=TRUE)
tp<-unique(mmmpost$Paper)
mmmpost$Paper<- factor(mmmpost$Paper,
                        levels =tp[order(as.numeric(sub('^[^0-9]+_([0-9]+)_?.*',
                                                        '\\1',tp)))],
                        ordered=TRUE)

ummpos<-unique(mmmpost[,c('GeneID','Paper','found')])
g<-ggplot(ummpos) + geom_bar(aes(y = Paper, fill = found)) 
g
```

### Brain region distribution
```{r utot_histogram,fig.width=8,fig.height=8,fig.cap='Number of identified proteins in different Brain Regions(stacked)',eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE,fig.show='hold'}
#brain region statistics
totg <- getGeneInfoByIDs(gp$GeneID)
mtot <- merge(gp, totg, by = c("GeneID","Localisation"))
mmptot <- mtot[, c('GeneID',
                    'HumanEntrez.x',
                    'HumanName.x',
                    'Localisation',
                    'Npmid',
                    'Paper',
                    'BrainRegion')]
untot<-unique(mmptot[,c('GeneID','BrainRegion','Localisation')])
loccolors<-c("#3D5B59","#B5E5CF","#FCB5AC", "#B99095")
loccolors<-loccolors[1:length(unique(untot$Localisation))]
ggplot(untot) + geom_bar(aes(y = BrainRegion, fill = Localisation)) + 
    scale_fill_manual(values = loccolors)
```

```{r grouped_histogram,fig.width=8,fig.height=4,fig.cap='Number of identified proteins in different Brain Regions(grouped)',eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE,fig.show='hold'}
table(untot$Localisation,untot$BrainRegion)-> m
as.data.frame(m)->udf
names(udf)<-c('Localisation','BrainRegion','value')
ggplot(udf, aes(fill=Localisation, y=value, x=BrainRegion)) + 
    geom_bar(position="dodge", stat="identity")+ 
    scale_fill_manual(values = loccolors) + 
    theme(axis.text.x = element_text(face="plain", 
                                        color="#993333", 
                                        angle=45,vjust = 1,
                                        hjust=1,size = rel(1.5)))
```

# Overview of capabilities
## 1.Get information for a specific gene or gene set.
The dataset can be used to answer frequent questions such as “What is known 
about my favourite gene? Is it pre- or postsynaptic? Which brain region was 
it identified in?”, "Which publication it was reported in?"
Information could be obtained by submitting gene EntrezID or Gene name

```{r gene_info}
t <- getGeneInfoByEntrez(1742) 
pander(head(t))

t <- getGeneInfoByName("CASK")
pander(head(t))

t <- getGeneInfoByName(c("CASK", "DLG2"))
pander(head(t))                      
```

## 2.Get internal GeneIDs for the list of genes.
Obtaining Internal database GeneIDs is a useful intermediate step for more 
complex queries including those for building protein-protein interaction (PPI) 
networks for compartments and brain regions. Internal GeneID is specie-neutral 
and unique, which allows exact identification of the object of interest in 
case of redundancy (e.g. one Human genes matches on a few mouse ones, etc.)

```{r  findIDs}
t <- findGenesByEntrez(c(1742, 1741, 1739, 1740))
pander(head(t))

t <- findGenesByName(c("SRC", "SRCIN1", "FYN"))
pander(head(t))
```

## 3.Get disease information for the gene set
Synaptic genes are annotated with disease information from Human Disease 
Ontology, where available. To get disease information one can submit the 
list of Human Entrez Is or Human genes names, it could be also the list of 
Internal GeneIDs if using `getGeneDiseaseByIDs` function

```{r disease }
t <- getGeneDiseaseByName (c("CASK", "DLG2", "DLG1"))
pander(head(t))

t <- getGeneDiseaseByEntres (c(8573, 1742, 1739))
pander(head(t))
```

## 5.Get information about the studies, combined into dataset
One can obtain the overview of synaptic proteome papers combined into the 
database, which includes paper PMID, specie Tax ID, year of publication, 
subcellular localisation, brain region and number of proteins identified in 
the paper. This information may help to choose the specific study(ies) for 
further work.
```{r paper}
p <- getPapers()
pander(head(p))
```

## 6.Get the table of frequently identified proteins
It is also possible to obtain the list of proteins found in more than one 
study, for the whole synaptic proteome. For that, the user needs to provide 
a "count" value as a desired minimal number of identifications (e.g. 2 or 
more). The command returns the table with gene identifiers and "Npmid" column, 
which contains the number of studies where this gene was identified.
```{r gene_count}
#find all proteins in synaptic proteome identified 2 times or more
gp <- findGeneByPaperCnt(cnt = 2)
pander(head(gp))
```

## 7.Get the table of proteins identified in specific studies
Following section 5, when the information for all considered proteomic studies 
was obtained, user can select specific study(ies) by PMID and get the proteins 
identified in those studies. By providing `count` value user can extract either 
all proteins from specified studies (count = 1), or just frequently found ones 
(count >=2). As above, the command returns the table with gene identifiers with 
`Npmid` column, which contains the number of studies where this protein was 
identified

```{r gene_papCount}
spg <- findGeneByPapers(p$PaperPMID[1:5], cnt = 1)
pander(head(spg))
```

## 8.Get the table of proteins frequently identified in specific compartment
Most of the times, user is interested in the specific compartment rather then 
in total synaptic proteome. To help identify the genes most probably 
residenting in the specific compartment and exclude possible contaminants,  
findGeneByCompartmentPaperCnt function provides the table of proteins found 
`cnt` or more times in different compartment-paper pairs.

```{r gene_comp}
gcp <- findGeneByCompartmentPaperCnt(cnt = 2)
pander(head(gcp))
# Now user can select the specific compartment and proceed working with 
# obtained list of frequently found proteins
presgp <- gcp[gcp$Localisation == "Presynaptic",]
dim(presgp)
pander(head(presgp))
```

## 9.Get PPI interactions for my list of genes
Custom Protein-protein interactions based on bespoke subsets of molecules 
could be extracted in two general ways: "induced" and "limited". In the
first case, the command will return all possible interactors for the genes 
within the whole interactome. In the second case it will return only 
interactions between the genes of interest. PPIs could be obtained by 
submitting list of EntrezIDs or gene names, or Internal IDs - in all 
cases the interactions will be returned as a list of interacting pairs 
of Intenal GeneIDs.

```{r PPI}
t <- getPPIbyName(
    c("CASK", "DLG4", "GRIN2A", "GRIN2B","GRIN1"), 
    type = "limited")
pander(head(t))

t <- getPPIbyEntrez(c(1739, 1740, 1742, 1741), type='induced')
pander(head(t))
 #obtain PPIs for the list of frequently found genes in presynaptc compartment
t <- getPPIbyEntrez(presgp$HumanEntrez, type='induced')
pander(head(t))
```

## 10.Get the molecular structure of synaptic compartment
Three main synaptic compartments considered in the database are "presynaptic", 
"postsynaptic" and "synaptosome". Genes are classified to compartments based 
on respective publications, so that each gene can belong to one or two, or 
even three compartments. The full list of genes for specific compartment 
could be obtained with command `getAllGenes4Compartment`, which returns 
the table with main gene identifiers, like internal GeneIDs, MGI ID, 
Human Entrez ID, Human Gene Name, Mouse Entrez ID, Mouse Gene Name, 
Rat Entrez ID, Rat Gene Name.

If you need to check which genes of your list belong to specific compartment, 
you can use `getGenes4Compartment` command, which will select from your list 
only genes associated with specific compartment.
To obtain the PPI network for compartment one has to submit the list of 
Internal GeneIDs obtained with previous commands.

```{r compPPI}
#getting the list of compartment
comp <- getCompartments()
pander(comp)

#getting all genes for postsynaptic compartment
gns <- getAllGenes4Compartment(compartmentID = 1) 
pander(head(gns))

#getting full PPI network for postsynaptic compartment
ppi <- getPPIbyIDs4Compartment(gns$GeneID,compartmentID =1, type = "induced")
pander(head(ppi))

```

## 11.Get the molecular structure of the brain region.
Three are 12 brain regions considered in the database based on respective 
publications, so that each gene can belong to the single or to the several 
brain regions. Brain regions differ between species, and specie brain region 
information is not 100% covered in the database(e.g. we don't have yet studies 
for Human Striatum, but do have for Mouse and Rat), that's why when querying 
the database for brain region information you will need to specify the specie. 
The full list of genes for specific region could be obtained wuth command 
`getAllGenes4BrainRegion`, which returns the table with main gene identifiers, 
like internal Gene IDs, MGI ID, Human Entrez ID, Human Gene Name, 
Mouse Entrez ID, Mouse Gene Name, Rat Entrez ID, Rat Gene Name.

If you need to check which genes of your list were identified in 
specific region, you can use `getGenes4BrainRegion` command, which will 
select only genes associated with specific region from your list.

To obtain the PPI network for brain region you need to submit the list of 
Internal GeneIDs obtained with previous commands.

```{r regPPI}
#getting the full list of brain regions
reg <- getBrainRegions()
pander(reg)

#getting all genes for mouse Striatum
gns <- getAllGenes4BrainRegion(brainRegion = "Striatum",taxID = 10090)
pander(head(gns))

#getting full PPI network for postsynaptic compartment
ppi <- getPPIbyIDs4BrainRegion(
    gns$GeneID, brainRegion = "Striatum", 
    taxID = 10090, type = "limited")
pander(head(ppi))
```

## 12.Checking third-party list against Synaptic Proteome db
```{r check_list}
#check which genes from 250 random EntrezIds are in the database
listG<-findGenesByEntrez(1:250) 
dim(listG)
head(listG)

#check which genes from subset identified as synaptic are presynaptic
getCompartments()
presG <- getGenes4Compartment(listG$GeneID, 2) 
dim(presG)
head(presG)

#check which genes from subset identified as synaptic are found in 
#human cerebellum
getBrainRegions()
listR <- getGenes4BrainRegion(listG$GeneID, 
                              brainRegion = "Cerebellum", taxID = 10090) 
dim(listR)
head(listR)

```

## 13.Visualisatiion of PPI network with Igraph.

Combine information from PPI data.frame obtained with functions like 
`getPPIbyName`, `getPPIbyEntrez`, `getPPIbyIDs4Compartment` or 
`getPPIbyIDs4BrainRegion` with information about genes obtained 
from `getGenesByID` to make interpretable undirected PPI graph in 
igraph format. In this format network could be further analysed and 
visualized by algorithms from igraph package.

```{r PPI_igraph,fig.width=7,fig.height=7,out.width="70%",fig.align = "center"}
library(igraph)
g<-getIGraphFromPPI(
    getPPIbyIDs(c(48, 129,  975,  4422, 5715, 5835), type='lim'))
plot(g,vertex.label=V(g)$RatName,vertex.size=25)
```

## 1.Export of PPI network as a table.
If Igraph is not an option, the PPI network could be exported as an 
interpretible table to be processed with other tools, e.g. Cytoscape,etc.

```{r PPI_table}
tbl<-getTableFromPPI(getPPIbyIDs(c(48, 585, 710), type='limited'))
tbl
```

# References

1. Sorokina, O., Mclean, C., Croning, M.D.R. et al.  
*A unified resource and configurable model of the  synapse proteome and 
its role in disease*.  Sci Rep 11, 9967 (2021).
<https://doi.org/10.1038/s41598-021-88945-7>.

# Appendix {.tabset}
## Statistics graph code {#statistics .unnumbered}
```{r pre_histogram, eval=FALSE, include=TRUE}
```
```{r post_histogram, eval=FALSE, include=TRUE}
```
```{r sv_histogram, eval=FALSE, include=TRUE}
```
```{r utot_histogram, eval=FALSE, include=TRUE}
```
```{r grouped_histogram, eval=FALSE, include=TRUE}
```

### Session Info
```{r sessionInfo, echo=FALSE, results='asis', class='text', warning=FALSE}
c<-devtools::session_info()
pander::pander(t(data.frame(c(c$platform))))
pander::pander(as.data.frame(c$packages)[,-c(4,5,10,11)])
```