---
title: "Improving iscream performance"
name: James Eapen
output:
  BiocStyle::html_document:
    toc_float: true
abstract: >
  A guide to improving iscream's runtime and memory efficiency
vignette: >
  %\VignetteIndexEntry{Improving iscream performance}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: refs.bib
link-citations: yes
date: 10 July 2025
---



## Setup

### Input BED files

Running this vignette requires downloading 2GB of single-cell whole genome
bisulfite sequencing (WGBS) BED files and tabix indices from this Zenodo
record: <https://zenodo.org/records/14733834>.


``` r
library("BiocFileCache")
cachedir <- BiocFileCache()
snmc_zip_path <- bfcrpath(cachedir, "https://zenodo.org/records/14733834/files/sc_beds.zip")
snmc_unzip <- file.path(tempdir(), "sc_beds")
unzip(snmc_zip_path, exdir = snmc_unzip)
genes_file <- bfcrpath(cachedir, "https://zenodo.org/records/14733834/files/genes.bed")
```

Select 100 human cell WGBS data from the snmC-seq2 [@luo2018a] dataset:


``` r
snmc_dir <- file.path(snmc_unzip, "sc_biscuit")
bedfiles <- list.files(
  snmc_dir,
  pattern = "*.bed.gz$",
  full.names = TRUE
)[1:100]
```

### Regions

Here we'll be using 5000 gene body regions as the input:


``` r
library(data.table)
regions <- fread(
  genes_file,
  col.names = c("chr", "start", "end")
)[1:5000]
head(regions)
#>       chr    start      end
#>    <char>    <int>    <int>
#> 1:   chr1  1471764  1497848
#> 2:   chr1  3069167  3438621
#> 3:   chr1  2403963  2413797
#> 4:   chr1 10472287 10630758
#> 5:   chr1  2425979  2505532
#> 6:   chr1  9292893  9369532
```

## Multithreading

iscream uses multithreading to reduce querying runtime. The number of threads
can be set before or after loading the library. iscream will read the option
while loading and so can be set in `.Rprofile`


``` r
options("iscream.threads" = 8)
library(iscream)
#> iscream using 8 threads based on 'options(iscream.threads)' but parallelly::availableCores() detects 16 possibly available threads. See `?set_threads` for information on multithreading before trying to use more.
```

Queries are parallelized across the input BED files. While this reduces
runtime, it increases memory usage as data from multiple files are loaded into
memory at the same time. For more information on multithreading see
`?set_threads()`.

### `tabix()`

`tabix()` queries regions from BED files much like the *tabix* shell command,
but supports multiple files and can query them in parallel. Although fast,
`tabix()` may seem unresponsive on large queries as `parallel::mclapply()`
doesn't show progression, but a progress bar is in development.

iscream has two `tabix()` implementations, one using the command line `tabix`
executable and the other using the htslib API. The command line tool is faster
as it can stream to a file which can be read from R while the htslib API stores
the strings in memory and is slower. iscream will look for the tabix executable
on startup and will only fall back to the htslip API if the executable is not
found. See `?tabix` details for more information.


``` r
qt <- system.time(
  tbx_query <- tabix(bedfiles, regions, col.names = c("beta", "coverage"))
)
tbx_query
#>              chr     start       end  beta coverage            file
#>           <char>     <int>     <int> <num>    <int>          <char>
#>        1:   chr1    923949    923950 0.000        1 bisc_SRR6911624
#>        2:   chr1    923953    923954 0.000        1 bisc_SRR6911624
#>        3:   chr1    923959    923960 0.000        1 bisc_SRR6911624
#>        4:   chr1    923971    923972 0.000        1 bisc_SRR6911624
#>        5:   chr1    923973    923974 0.000        1 bisc_SRR6911624
#>       ---                                                          
#> 45733375:   chr4 190179369 190179370 0.000        2 bisc_SRR6911723
#> 45733376:   chr4 190179686 190179687 1.000        1 bisc_SRR6911723
#> 45733377:   chr4 190179687 190179688 0.500        2 bisc_SRR6911723
#> 45733378:   chr4 190179753 190179754 1.000        1 bisc_SRR6911723
#> 45733379:   chr4 190179754 190179755 0.333        3 bisc_SRR6911723
```

On 8 threads, retrieving 45,733,379 records
across 100 files took 11.814 seconds.

### `summarize_regions`

To get a summary of the information of the gene bodies use `summarize_regions`,
providing the gene name column as the feature column:


``` r
qt <- system.time(
  summary_query <- summarize_regions(
    bedfiles,
    regions,
    columns = 4,
    col_names = "beta",
    feature_col = "gene"
  )
)
#> [10:51:04.214132] [iscream::summarize_regions] [info] Summarizing 5000 regions from 100 bedfiles
#> [10:51:04.214257] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, count
#> [10:51:04.214269] [iscream::summarize_regions] [info] with columns 4 as beta
head(summary_query)
#>   feature            file beta.sum beta.mean beta.median beta.stddev
#> 1  ATAD3B bisc_SRR6911624   85.000 0.8585859           1   0.3502215
#> 2  PRDM16 bisc_SRR6911624  723.500 0.5288743           1   0.4976973
#> 3   PEX10 bisc_SRR6911624   15.000 0.2419355           0   0.4317514
#> 4   PEX14 bisc_SRR6911624  198.000 0.7764706           1   0.4174294
#> 5   PLCH2 bisc_SRR6911624  184.333 0.7228745           1   0.4474832
#> 6   SPSB1 bisc_SRR6911624   64.000 0.8648649           1   0.3442015
#>   beta.variance beta.min beta.max beta.range count
#> 1     0.1226551        0        1          1    99
#> 2     0.2477026        0        1          1  1368
#> 3     0.1864093        0        1          1    62
#> 4     0.1742473        0        1          1   255
#> 5     0.2002412        0        1          1   255
#> 6     0.1184746        0        1          1    74
```

On 8 threads, summarizing the 45,733,379
records across all 100 files took 4.948 seconds.
As with `tabix()`, runtime reduces and memory usage increases with increasing
thread counts.

### `make_mat`

The `make_mat()` function queries and stores every genomic location within the
input regions across input files. Unlike `summarize_regions()` the output matrix
dimensions are unknown at runtime. Although usually quite fast, if the number of
records falling into the input regions are large and there are few overlaps
between files, this can take a long time. Here, gene bodies are large and the
final matrix can contain millions of CpG loci. Further, with sparse data,
chances are new loci are found in every file.

Preallocating the number of rows, however, can drastically reduce runtime.
Since we got 45 million loci/CpGs from all the BED files with the tabix query
above, we can expect approximately between 5 and 10 million unique loci as
single-cell WGBS data has lower coverage than bulk. We already have the tabix
query so we can get the unique CpG count here and use it to preallocate the
matrix, reducing the number of matrix resizes. We'll add 100,000 extra rows on
top of the existing count to be safe since every avoided resize cuts off at
least a couple seconds from the runtime. Making tabix queries can be a
relatively quick way to approximate the CpG count of a dataset. If you haven't
done a tabix query of the full dataset, you can approximate how many CpGs to
expect based on CpG counts in one file and the coverage of your WGBS method.
Here we make a matrix of the beta-values in the 4th column:


``` r
cpg.count <- tbx_query$start |> unique() |> length()
qt <- system.time(meth_mat <- make_mat_se(
  bedfiles,
  regions,
  column = 4,
  sparse = TRUE,
  prealloc = cpg.count + 1e5
))
#> [10:51:11.215841] [iscream::query_all] [info] Querying 5000 regions from 100 bedfiles
#> 
#> [10:51:51.998902] [iscream::query_all] [info] Creating metadata vectors
#> [10:51:52.457885] [iscream::query_all] [info] 7276107 loci found - 16250 extra rows allocated with 0 resizes
#> [10:51:57.284487] [iscream::query_all] [info] Creating sparse matrix
meth_mat
#> class: RangedSummarizedExperiment 
#> dim: 7276107 100 
#> metadata(0):
#> assays(1): value
#> rownames: NULL
#> rowData names(0):
#> colnames(100): bisc_SRR6911624 bisc_SRR6911625 ... bisc_SRR6911722
#>   bisc_SRR6911723
#> colData names(0):
```

Making this 7,276,107 x 100
matrix took 50.429 seconds.

## Session info


``` r
sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: AlmaLinux 9.6 (Sage Margay)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.utf8        LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.utf8      LC_MONETARY=C.UTF-8    LC_MESSAGES=C.utf8    
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] iscream_0.99.0    data.table_1.17.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] crayon_1.5.3                httr_1.4.7                 
#>  [3] knitr_1.50                  xfun_0.52                  
#>  [5] UCSC.utils_1.4.0            generics_0.1.4             
#>  [7] DelayedArray_0.34.1         jsonlite_2.0.0             
#>  [9] RcppParallel_5.1.10         SummarizedExperiment_1.38.1
#> [11] S4Vectors_0.46.0            stringfish_0.16.0          
#> [13] stats4_4.5.0                MatrixGenerics_1.20.0      
#> [15] Biobase_2.68.0              grid_4.5.0                 
#> [17] abind_1.4-8                 evaluate_1.0.4             
#> [19] IRanges_2.42.0              GenomeInfoDb_1.44.0        
#> [21] compiler_4.5.0              Rcpp_1.1.0                 
#> [23] XVector_0.48.0              lattice_0.22-7             
#> [25] R6_2.6.1                    SparseArray_1.8.0          
#> [27] parallelly_1.45.0           parallel_4.5.0             
#> [29] GenomeInfoDbData_1.2.14     GenomicRanges_1.60.0       
#> [31] Matrix_1.7-3                tools_4.5.0                
#> [33] matrixStats_1.5.0           S4Arrays_1.8.1             
#> [35] BiocGenerics_0.54.0
```

## References

<!-- vim: set filetype=rmd: -->
