# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The PMScanR package provides large-scale identification,
analysis, and visualization of protein motifs. The package integrates
various methods to facilitate motif identification, characterization,
and visualization. It includes functions for running PS-Scan, a PROSITE
database tool. Additionally, PMScanR supports format
conversion to GFF, enhancing downstream analyses such as graphical
representation and database integration. The library offers multiple
visualization tools, including heatmaps, sequence logos, and pie charts,
enabling a deeper understanding of motif distribution and conservation.
Through its integration with PROSITE, PMScanR provides
access to up-to-date motif data.
Proteins play a crucial role in biological processes, with their
functions closely related to structure. Protein functions are often
associated with the presence of specific motifs, which are short,
sometimes repetitive amino acid sequences essential for distinctive
molecular interactions or modifications. Most of the existing
bioinformatics tools focus mainly on the identification of known motifs
and often do not provide interactive analysis and visualization tools
during motif extraction. Moreover, they do not take into account the
effect of single variations on an entire domain or protein motif. These
limitations highlight the need for a tool that can automate and scale
the analysis. To address this, we have developed PMScanR,
an R-based package designed to facilitate and automate the prediction
and evaluation of the effect of single amino acid substitutions on the
occurrence of protein motifs on a large scale. However, existing tools
lacked the capability to perform comparative analysis of multiple motifs
across multiple sequences, a gap that PMScanR was
particularly developed to fill.
Here we show the most basic pipeline for protein motif analysis: scanning a sequence file, processing the results, and generating a occurrence plot visualization. This code chunk assumes you have a FASTA file ready for analysis.
If the user prefers to perform the analysis using a graphical user
interface (GUI), they can simply run the function
runPMScanRShiny(). This will launch a Shiny app that opens
an interactive window. The window can be used both within R and in a web
browser, providing a clickable, user-friendly interface that allows the
entire analysis, including visualizations, to be carried out without
needing to write code.
Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.
The first step of the analyses is to establish the working environment.
For the purpose of this vignette, we will use sample files included
with the PMScanR package. These files represent the input
(FASTA) and various outputs (GFF, PSA, TXT) generated by the PROSITE
analysis.
# 1. Load example FASTA file (Input for runPsScan)
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")
# 2. Load example GFF output
gff_file <- system.file("extdata", "out_Hb_gff.txt", package = "PMScanR")
# 3. Load example PSA output
psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")
# 4. Load example PROSITE text output (Scan format)
prosite_txt_file <- system.file("extdata", "out_Hb_PROSITE.txt", package = "PMScanR")Once the data is accessible, you can move on to the analysis.
runPsScan()This function runs the ps_scan tool to identify protein
motifs. It automatically detects the operating system and handles the
downloading and caching of required PROSITE databases (using
BiocFileCache) upon the first run.
The runPsScan function allows you to specify the output
format via the out_format argument. Available formats
include: scan (default), fasta,
psa, msa, gff, pff,
epff, sequence, matchlist,
ipro.
Regardless of the chosen output format (gff,
psa, or scan), PMScanR allows you
to import the data into R and perform the complete analysis.
The PMScanR package is designed to work flexibly with
different PROSITE output formats. Whether you have a GFF, PSA, or a
standard text file, you can load it into a standardized data frame in
R.
Option A: Loading GFF files If you generated a GFF
file, use rtracklayer::import.gff and convert it to a data
frame.
gff_data <- as.data.frame(rtracklayer::import.gff(gff_file))
# The data frame now contains all necessary columns (including Sequence)
head(gff_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * ps_scan|v1.89 PS00005 NA NA
## 2 NP_000508.1 138 140 3 * ps_scan|v1.89 PS00005 NA NA
## 3 NP_000508.1 4 7 4 * ps_scan|v1.89 PS00006 NA NA
## 4 NP_000508.1 19 24 6 * ps_scan|v1.89 PS00008 NA NA
## 5 NP_000508.1 59 62 4 * ps_scan|v1.89 PS00009 NA NA
## 6 NP_000508.1 2 142 141 * ps_scan|v1.89 PS01033 42.058 NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore
## 1 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 2 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 3 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 4 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 5 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 6 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] <NA> 0 4843
## FeatureFrom FeatureTo
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 1 -1Option B: Loading PSA files If you generated a PSA
file, use the readPsa() function.
psa_data <- readPsa(psa_file)
head(psa_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * PSA PS00005 NA NA
## 2 NP_000508.1 138 140 3 * PSA PS00005 NA NA
## 3 NP_000508.1 4 7 4 * PSA PS00006 NA NA
## 4 NP_000508.1 19 24 6 * PSA PS00008 NA NA
## 5 NP_000508.1 59 62 4 * PSA PS00009 NA NA
## 6 NP_000508.1 2 142 141 * PSA PS01033 NA NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore FeatureFrom FeatureTo
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> 0 <NA> <NA> <NA>Option C: Loading standard PROSITE text files If you
have a standard output file (format ‘scan’), use the
readProsite() function.
prosite_data <- readProsite(prosite_txt_file)
head(prosite_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * PROSITE PS00005 NA NA
## 2 NP_000508.1 138 140 3 * PROSITE PS00005 NA NA
## 3 NP_000508.1 4 7 4 * PROSITE PS00006 NA NA
## 4 NP_000508.1 19 24 6 * PROSITE PS00008 NA NA
## 5 NP_000508.1 59 62 4 * PROSITE PS00009 NA NA
## 6 NP_000508.1 2 142 141 * PROSITE PS01033 NA NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGL=0HGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore FeatureFrom
## 1 Protein kinase C phosphorylation site. <NA> <NA> <NA> <NA>
## 2 Protein kinase C phosphorylation site. <NA> <NA> <NA> <NA>
## 3 Casein kinase II phosphorylation site. <NA> <NA> <NA> <NA>
## 4 N-myristoylation site. <NA> <NA> <NA> <NA>
## 5 Amidation site. <NA> <NA> <NA> <NA>
## 6 Globin domain profile. <NA> <NA> <NA> <NA>
## FeatureTo
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>Once the data is loaded into a data frame (using any of the options above), the downstream analysis is identical. You can generate heatmaps, pie charts, and sequence logos from the same object.
gff2matrix()This function converts the data frame into a binary motif-occurrence
matrix. This matrix is the required input for the heatmap visualization
functions. Each row represents a unique motif, each column represents a
sequence, and a 1 indicates the presence of that motif.
Note: This function works with data loaded from GFF, PSA, or TXT files, as long as they are parsed into the standard format shown above.
# We can use the data loaded from Option A, B or C.
# Here we use 'gff_data' as an example.
motif_matrix <- gff2matrix(gff_data)
# Display the first few rows of the resulting matrix
head(motif_matrix)
## NP_000508.1 NP_000549.1 NP_000550.2 NP_000175.1 NP_001305150.1
## PS00001:125-128 0 0 0 0 0
## PS00001:148-151 0 0 0 0 0
## PS00001:152-155 0 0 0 0 0
## PS00001:177-180 0 0 0 0 0
## PS00001:182-185 0 0 0 0 0
## PS00001:189-192 0 0 0 0 0
## NP_001350615.1 NP_001138679.1 NP_599030.1 NP_071441.1
## PS00001:125-128 0 0 0 0
## PS00001:148-151 0 0 0 0
## PS00001:152-155 0 0 0 0
## PS00001:177-180 0 0 1 0
## PS00001:182-185 0 0 0 0
## PS00001:189-192 0 1 0 0
## NP_001305067.1
## PS00001:125-128 1
## PS00001:148-151 1
## PS00001:152-155 1
## PS00001:177-180 0
## PS00001:182-185 1
## PS00001:189-192 0After using this function a occurrence plot can be generated by using:
matrix2OP() and matrixOP()These functions generate interactive occurrence plots from the
motif-occurrence matrix. matrix2OP creates a standard
rectangular occurrence plot, while matrix2SquareOP creates
one with a square aspect ratio.
extractProteinMotifs()To generate sequence logos, you can use the
extractProteinMotifs() function. This function parses the
output file generated by runPsScan() (supporting GFF, PSA,
and TXT formats) and extracts all instances of each identified motif
into a list, where the keys are PROSITE IDs.
# This reads the PROSITE analysis output file from disk and extracts motifs.
# The format is detected automatically, but can also be specified explicitly
# (e.g., format = "gff").
protein_motifs <- extractProteinMotifs(psa_file)
# Check the PROSITE IDs (keys) found in the file
head(names(protein_motifs))
## [1] "PS00001" "PS00004" "PS00005" "PS00006" "PS00007" "PS00008"
# Generate sequence logo for the first motif found in the list
ggseqlogo::ggseqlogo(protein_motifs[[1]], seq_type='aa')
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.extractSegments()If you want to analyze raw sequences instead of identified motifs
(e.g., to look at a specific region across all proteins), use
extractSegments().
# Read the FASTA file into a list of sequences
sequences <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")
# Extract segments from position 10 to 20 from all sequences
segments <- extractSegments(sequences, from = 10, to = 20)
# Generate the sequence logo from the extracted segments
ggseqlogo::ggseqlogo(unlist(segments), seq_type = "aa")Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.12.0 ggplot2_4.0.1 seqinr_4.2-36 ggseqlogo_0.2.2
## [5] PMScanR_1.0.1 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] httr2_1.2.2 rlang_1.1.7
## [5] magrittr_2.0.4 ade4_1.7-23
## [7] otel_0.2.0 matrixStats_1.5.0
## [9] compiler_4.5.2 RSQLite_2.4.5
## [11] vctrs_0.7.1 reshape2_1.4.5
## [13] stringr_1.6.0 pkgconfig_2.0.3
## [15] crayon_1.5.3 fastmap_1.2.0
## [17] dbplyr_2.5.1 XVector_0.50.0
## [19] labeling_0.4.3 Rsamtools_2.26.0
## [21] promises_1.5.0 rmarkdown_2.30
## [23] purrr_1.2.1 bit_4.6.0
## [25] xfun_0.56 cachem_1.1.0
## [27] cigarillo_1.0.0 jsonlite_2.0.0
## [29] shinyFiles_0.9.3 blob_1.3.0
## [31] later_1.4.5 DelayedArray_0.36.0
## [33] BiocParallel_1.44.0 parallel_4.5.2
## [35] R6_2.6.1 bslib_0.10.0
## [37] stringi_1.8.7 RColorBrewer_1.1-3
## [39] rtracklayer_1.70.1 GenomicRanges_1.62.1
## [41] jquerylib_0.1.4 Rcpp_1.1.1
## [43] Seqinfo_1.0.0 SummarizedExperiment_1.40.0
## [45] knitr_1.51 IRanges_2.44.0
## [47] httpuv_1.6.16 Matrix_1.7-4
## [49] tidyselect_1.2.1 abind_1.4-8
## [51] yaml_2.3.12 codetools_0.2-20
## [53] curl_7.0.0 lattice_0.22-7
## [55] tibble_3.3.1 plyr_1.8.9
## [57] withr_3.0.2 Biobase_2.70.0
## [59] shiny_1.12.1 S7_0.2.1
## [61] evaluate_1.0.5 BiocFileCache_3.0.0
## [63] Biostrings_2.78.0 pillar_1.11.1
## [65] BiocManager_1.30.27 filelock_1.0.3
## [67] MatrixGenerics_1.22.0 stats4_4.5.2
## [69] generics_0.1.4 RCurl_1.98-1.17
## [71] S4Vectors_0.48.0 scales_1.4.0
## [73] xtable_1.8-4 glue_1.8.0
## [75] lazyeval_0.2.2 maketools_1.3.2
## [77] tools_4.5.2 BiocIO_1.20.0
## [79] sys_3.4.3 data.table_1.18.0
## [81] GenomicAlignments_1.46.0 buildtools_1.0.0
## [83] fs_1.6.6 XML_3.99-0.20
## [85] grid_4.5.2 tidyr_1.3.2
## [87] crosstalk_1.2.2 restfulr_0.0.16
## [89] cli_3.6.5 rappdirs_0.3.4
## [91] S4Arrays_1.10.1 viridisLite_0.4.2
## [93] dplyr_1.1.4 gtable_0.3.6
## [95] sass_0.4.10 digest_0.6.39
## [97] BiocGenerics_0.56.0 SparseArray_1.10.8
## [99] rjson_0.2.23 htmlwidgets_1.6.4
## [101] farver_2.1.2 memoise_2.0.1
## [103] htmltools_0.5.9 lifecycle_1.0.5
## [105] httr_1.4.7 mime_0.13
## [107] bit64_4.6.0-1 MASS_7.3-65