---
title: "Keywords from the UniProt database"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Keywords from the UniProt database}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


UniProt database provides a list of controlled vocabulary represented as keywords
for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way. This package
provides data of keywords hierarchy and gene-keyword relations.


First load the package:

```{r}
library(UniProtKeywords)
```

First the release and source information of the data:

```{r}
UniProtKeywords
```

The package has five data objects. The first contains basic information of every keyword term:

```{r}
data(kw_terms)
length(kw_terms)
kw_terms[["Cell cycle"]]
```


The other four contain hierarchical structures of the keyword terms:

```{r}
data(kw_parents)
kw_parents[1:2]

data(kw_children)
kw_children[1:2]

data(kw_ancestors)
kw_ancestors[1:2]

data(kw_offspring)
kw_offspring[1:2]
```

The **UniProtKeywords** package has also compiled genesets of keywords for some species, which can get by the function `load_keyword_genesets()`.
The argument is the taxon ID of a species. The full set of supported organisms can be found in the document of `load_keyword_genesets()`.

```{r}
gl = load_keyword_genesets(9606)
gl[3:4]  # because gl[1:2] has a very long output, here we print gl[3:4]
```

Argument `as_table` can be set to `TRUE`, then `load_keyword_genesets()` returns a two-column data frame.

```{r}
tb = load_keyword_genesets(9606, as_table = TRUE)
head(tb)
```

To make it more flexible, you can also provide the Latin name or the common name of the species.

```{r}
gl2 = load_keyword_genesets("mouse")
gl2[3:4]
```

We can simply check some statistics.

1. Sizes of keyword genesets:


```{r, fig.width = 7, fig.height = 5}
plot(table(sapply(gl, length)), log = "x", 
    xlab = "Size of keyword genesets",
    ylab = "Number of keywords"
)
```

2. Numbers of words in keywords:

```{r, fig.width = 7, fig.height = 5}
plot(table(sapply(gregexpr(" |-|/", names(gl)), length)), 
    xlab = "Number of words in keywords",
    ylab = "Number of keywords"
)
```

3. Numbers of characters in keywords:

```{r, fig.width = 7, fig.height = 5}
plot(table(nchar(names(gl))), 
    xlab = "Number of characters in keywords",
    ylab = "Number of keywords"
)
```

Session info:

```{r}
sessionInfo()
```
