RBedMethyl provides disk-backed access to nanoporetech modkit bedMethyl files generated from ONT data, enabling ONT-scale workflows without loading full tables into memory. This vignette demonstrates a typical workflow: ingest, subset by region, compute per-region summaries, and coerce to downstream Bioconductor classes.
modkit, the bioinformatics tool produced by Oxford Nanopore Technologies, is one of the easiest and most direct ways to extract site-specific methylation information from a long-read BAM file produced by Dorado. As explained in UCSC, the bedMethyl format is an extension of the standard BED 9 format, with the following columns:
| column | name | description | type |
|---|---|---|---|
| 1 | chrom | name of reference sequence from BAM header | str |
| 2 | start position | 0-based start position | int |
| 3 | end position | 0-based exclusive end position | int |
| 4 | modified base code | single letter code for modified base | str |
| 5 | score | Equal to Nvalid_cov. | int |
| 6 | strand | ‘+’ for positive strand ‘-’ for negative strand, ‘.’ when strands are combined | str |
| 7 | start position | included for compatibility | int |
| 8 | end position | included for compatibility | int |
| 9 | color | included for compatibility, always 255,0,0 | str |
| 10 | Nvalid_cov | Number of reads with valid modification call | int |
| 11 | fraction modified | Percent of valid calls that are modified | float |
| 12 | Nmod | Number of calls with a modified base | int |
| 13 | Ncanonical | Number of calls with a canonical base | int |
| 14 | Nother_mod | Number of calls with a modified base, other modifications | int |
| 15 | Ndelete | Number of reads with a deletion at this reference position | int |
| 16 | Nfail | Number of calls where the probability of the call was below the threshold | int |
| 17 | Ndiff | Number of reads with a base other than the canonical base for this modification | int |
| 18 | Nnocall | Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call | int |
RBedMethyl package provides a minimalistic interface, which allows the users to process bedMethyl files. They can:
readBedMethylsubsetByChromosomes)subsetByRegion or using the [] bracketing index system)filterByCoverage)subsetBy)summarizeByRegion)as(x, name) coercion: RangedSummarizedExperiment and BSseqThe RBedMethyl object is an HDF5-backed structure. It is designed to keep a single modification (i.e, “h” or “m”). The user can select which fields to keep from the bedMethyl file during the reading process.
The correspondence is:
| bedMethyl name | RBedMethyl assay |
|---|---|
| Nvalid_cov | coverage |
| fraction modified | pct |
| Nmod | mod_reads |
| Ncanonical | unmod_reads |
| Nother_mod | other_reads |
| Ndelete | del_reads |
| Nfail | fail_reads |
| Ndiff | diff_reads |
| Nnocall | nocall_reads |
By default the readBedMethyl only loads “coverage”, “pct” and “mod_reads” assays.
# Example bedMethyl-like content, header is dropped
df <- data.frame(
r1 = c("chr1", 0, 1, "m", 0, "+", 0, 1, 0, 10, 0.5, 5, 5, 0, 0, 0, 0, 0),
r2 = c("chr1", 10, 11, "m", 0, "+", 10, 11, 0, 20, 0.25, 5, 15, 0, 0, 0, 0, 0),
r3 = c("chr2", 0, 1, "m", 0, "-", 0, 1, 0, 8, 0.375, 3, 5, 0, 0, 0, 0, 0),
r4 = c("chr2", 0, 1, "h", 0, "-", 0, 1, 0, 8, 0.375, 3, 5, 0, 0, 0, 0, 0)
)
df <- t(df)
colnames(df) <- c(
"chrom", "chromStart", "chromEnd", "mod", "score", "strand",
"thickStart", "thickEnd", "itemRgb", "coverage", "pct",
"mod_reads", "unmod_reads", "other_reads",
"del_reads", "fail_reads", "diff_reads", "nocall_reads"
)
df
#> chrom chromStart chromEnd mod score strand thickStart thickEnd itemRgb
#> r1 "chr1" "0" "1" "m" "0" "+" "0" "1" "0"
#> r2 "chr1" "10" "11" "m" "0" "+" "10" "11" "0"
#> r3 "chr2" "0" "1" "m" "0" "-" "0" "1" "0"
#> r4 "chr2" "0" "1" "h" "0" "-" "0" "1" "0"
#> coverage pct mod_reads unmod_reads other_reads del_reads fail_reads
#> r1 "10" "0.5" "5" "5" "0" "0" "0"
#> r2 "20" "0.25" "5" "15" "0" "0" "0"
#> r3 "8" "0.375" "3" "5" "0" "0" "0"
#> r4 "8" "0.375" "3" "5" "0" "0" "0"
#> diff_reads nocall_reads
#> r1 "0" "0"
#> r2 "0" "0"
#> r3 "0" "0"
#> r4 "0" "0"
bed <- tempfile(fileext = ".bed")
write.table(df, bed, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
# Read only "m" modification rows, and only coverage,pct and mod_reads assays
bm <- readBedMethyl(bed, mod = "m", fields = c("coverage", "pct", "mod_reads"))
bm
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 3 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
# By one or more chromosomes
subsetByChromosomes(bm, "chr1")
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 2 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
subsetByChromosomes(bm, c("chr1", "chr2"))
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 3 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
# By region
## Can supply the chromosome, start, end
subsetByRegion(bm, "chr1", 1, 12)
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 1 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
## Or a GRanges object
regions <- GenomicRanges::GRanges(
seqnames = c("chr1", "chr2"),
ranges = IRanges::IRanges(start = c(1, 1), end = c(12, 2))
)
subsetByRegion(bm, regions)
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 3 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
## The [] bracketing index system also works
bm[regions]
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 3 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
# By minimum coverage
filterByCoverage(bm, 15)
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 1 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
# By column using a function
## The following keeps only lines where the methylation percentage is higher than 0.3
subsetBy(bm, "pct", function(x) x > 0.3)
#> RBedMethyl object
#> Modification: m
#> Rows (active/total): 2 / 3
#> Chromosomes: 2 ( chr1, chr2 )
#> Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
regions <- GenomicRanges::GRanges(
seqnames = c("chr1", "chr2"),
ranges = IRanges::IRanges(start = c(1, 1), end = c(12, 2))
)
# For each region, generate summary statistics
## `mod_reads` is the total number of reads, `beta` stands for average methylation, `n_sites` is the number of sites
summarizeByRegion(bm, regions)
#> DataFrame with 2 rows and 4 columns
#> coverage mod_reads beta n_sites
#> <numeric> <numeric> <numeric> <integer>
#> 1 30 10 0.333333 2
#> 2 8 3 0.375000 1
# RangedSummarizedExperiment
rse <- as(bm, "RangedSummarizedExperiment")
rse
#> class: RangedSummarizedExperiment
#> dim: 3 1
#> metadata(0):
#> assays(3): coverage mod_reads pct
#> rownames: NULL
#> rowData names(0):
#> colnames: NULL
#> colData names(0):
# bsseq (if installed)
if (requireNamespace("bsseq", quietly = TRUE) &&
methods::hasMethod("coerce", c("RBedMethyl", "BSseq"))) {
bs <- as(bm, "BSseq")
bs
} else {
message("bsseq is not installed or BSseq coercion is unavailable; skipping BSseq coercion.")
}
#> An object of type 'BSseq' with
#> 3 methylation loci
#> 1 samples
#> has not been smoothed
#> All assays are in-memory
sessionInfo()
#> R Under development (unstable) (2026-03-05 r89546)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RBedMethyl_0.99.1 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.41.1 rjson_0.2.23
#> [3] xfun_0.57 bslib_0.10.0
#> [5] rhdf5_2.55.16 Biobase_2.71.0
#> [7] lattice_0.22-9 bitops_1.0-9
#> [9] rhdf5filters_1.23.3 tools_4.6.0
#> [11] generics_0.1.4 curl_7.0.0
#> [13] stats4_4.6.0 parallel_4.6.0
#> [15] R.oo_1.27.1 Matrix_1.7-5
#> [17] data.table_1.18.2.1 BSgenome_1.79.1
#> [19] RColorBrewer_1.1-3 cigarillo_1.1.0
#> [21] S4Vectors_0.49.0 sparseMatrixStats_1.23.0
#> [23] lifecycle_1.0.5 compiler_4.6.0
#> [25] farver_2.1.2 Rsamtools_2.27.1
#> [27] Biostrings_2.79.5 statmod_1.5.1
#> [29] Seqinfo_1.1.0 codetools_0.2-20
#> [31] permute_0.9-10 htmltools_0.5.9
#> [33] sass_0.4.10 RCurl_1.98-1.18
#> [35] yaml_2.3.12 crayon_1.5.3
#> [37] jquerylib_0.1.4 R.utils_2.13.0
#> [39] BiocParallel_1.45.0 DelayedArray_0.37.0
#> [41] cachem_1.1.0 limma_3.67.0
#> [43] abind_1.4-8 gtools_3.9.5
#> [45] locfit_1.5-9.12 digest_0.6.39
#> [47] restfulr_0.0.16 bookdown_0.46
#> [49] fastmap_1.2.0 grid_4.6.0
#> [51] cli_3.6.5 SparseArray_1.11.11
#> [53] S4Arrays_1.11.1 XML_3.99-0.23
#> [55] dichromat_2.0-0.1 h5mread_1.3.2
#> [57] DelayedMatrixStats_1.33.0 scales_1.4.0
#> [59] httr_1.4.8 rmarkdown_2.30
#> [61] XVector_0.51.0 matrixStats_1.5.0
#> [63] otel_0.2.0 R.methodsS3_1.8.2
#> [65] beachmat_2.27.3 HDF5Array_1.39.0
#> [67] evaluate_1.0.5 knitr_1.51
#> [69] GenomicRanges_1.63.1 IRanges_2.45.0
#> [71] BiocIO_1.21.0 rtracklayer_1.71.3
#> [73] rlang_1.1.7 Rcpp_1.1.1
#> [75] glue_1.8.0 BiocManager_1.30.27
#> [77] BiocGenerics_0.57.0 bsseq_1.47.1
#> [79] jsonlite_2.0.0 R6_2.6.1
#> [81] Rhdf5lib_1.33.6 GenomicAlignments_1.47.0
#> [83] MatrixGenerics_1.23.0