RnBeads 2.0: comprehensive analysis of DNA methylation data
DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples. Here, we describe a new version of our RnBeads software - an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer.
KeywordsDNA methylation Computational epigenetics Epigenome-wide association studies Bisulfite sequencing Epigenotyping microarrays Integrative data analysis Bioinformatics software tool R/Bioconductor package
Illumina Infinium HumanMethylation27 BeadChip
Illumina Infinium HumanMethylation450 BeadChip
Differentially methylated cytosine
Differentially methylated region
Differentially variable cytosine
Differentially variable region
Illumina Infinium MethylationEPIC BeadChip
Gene Expression Omnibus
Reduced representation bisulfite sequencing
Sequence Read Archive
Whole-genome bisulfite sequencing
DNA methylation at CpG dinucleotides is a widely studied epigenetic mark that is involved in the regulation of cell state and relevant for a broad range of diseases. Changes in DNA methylation at promoters and enhancers have been associated with cell differentiation, developmental processes, cancer development, and regulation of the immune system. The vast majority of current assays for DNA methylation profiling use bisulfite treatment to selectively convert unmethylated cytosines (including 5-formyl-cytosine and 5-carboxy-cytosine) into uracil (which is subsequently replaced by thymine), while methylated cytosines (including 5-hydroxy-cytosine) remain unconverted. Bisulfite conversion thus transforms DNA methylation information into DNA sequence information that can be read by next-generation sequencing or DNA microarrays [1, 2].
Whole-genome bisulfite sequencing (WGBS) constitutes the current gold standard for DNA methylation profiling, given its genome-wide coverage and single-basepair resolution . However, WGBS requires deep sequencing of the entire genome (which is a significant cost factor), while shallow sequencing leads to poor sensitivity for detecting small differences in DNA methylation. Reduced representation bisulfite sequencing (RRBS) offers a cost-effective alternative for profiling large sets of patient samples, by focusing the sequencing on a subset of the genome enriched using restriction enzymes . RRBS is particularly useful for studying DNA methylation heterogeneity, which profits from deep sequencing coverage and from analyzing many samples using a sequencing-based assay [5, 6, 7, 8]. Target-capture bisulfite sequencing enables the analysis of a defined set of genomic regions in large numbers of samples at low cost per sample, but with high setup cost [9, 10]. Finally, the microarray-based Infinium DNA methylation assays—including the MethylationEPIC BeadChip (EPIC) and its predecessor, the HumanMethylation450 BeadChip (450k)—facilitate standardized, high-throughput DNA methylation profiling of a pre-defined subset of CpGs in large sample cohorts [11, 12].
These assays have enabled DNA methylation mapping for a large number of cell types [13, 14] and, following the concept of epigenome-wide association studies (EWAS), for various diseases with a suspected role of epigenetic regulation [15, 16]. The resulting datasets typically comprise DNA methylation profiles as well as sample annotations such as tissue or cell type, phenotypic data (donor age, sex, etc.), and sample grouping (case vs. control, treated vs. untreated, etc.). The primary goal for the bioinformatic analysis of such datasets is to identify characteristic and reliable DNA methylation patterns, and to associate them with relevant annotation data. While various software packages exist that support individual steps of the DNA methylation analysis (reviewed in [17, 18, 19, 20] and summarized as a feature table in Additional file 1: Table S1), many users would benefit from an integrative analysis tool that provides extensive, easy-to-understand analysis reports and that requires minimal configuration and no detailed bioinformatic knowledge of the various analysis steps.
We have previously developed the RnBeads software package  as a start-to-finish pipeline for DNA methylation analysis in accordance with established standards and practices [16, 17, 18]. Initially released in 2012 and published in 2014 , RnBeads has become a popular and widely used tool for DNA methylation analysis (200–300 downloads per month from Bioconductor). Here, we present a new, substantially extended version of RnBeads, providing an up-to-date, user-friendly, feature-rich, and readily scalable workflow for the bioinformatic analysis of DNA methylation datasets. The new RnBeads 2.0 software package addresses feedback and feature requests from the tool’s active user community, implements new analysis methods, introduces a graphical user interface, and improves computational efficiency. With these advances, RnBeads provides state-of-the-art support for DNA methylation data analysis in an easy-to-use way, with high flexibility and performance.
Results and discussion
RnBeads overview and new features
New data types and cross-platform analysis: RnBeads now supports EPIC microarrays and enables seamless data integration across different DNA methylation assays (e.g., EPIC, 450k, and 27k microarrays as well as WGBS and RRBS), which facilitates DNA methylation meta-analyses that combine several data sources into a single set of results.
Extended analysis and inference methods: We added new functionality for handling incomplete data and missing values, for detecting genetic evidence of sample contamination or low data quality, for quantifying DNA methylation heterogeneity, and for DNA methylation-based inference of phenotypic information. We incorporated the LUMP algorithm , which estimates immune cell content of tumors and other heterogeneous tissue samples, and epigenetic age prediction  for both Infinium microarray and bisulfite sequencing data. These predictions are useful not only for inferring missing donor annotations, but also for detecting deviations indicative of accelerated aging  or evidence of sample mix-ups. Additional new features include the identification of genomic regions characterized by differential DNA methylation variability [25, 26] and genomic region enrichment analysis using the LOLA tool .
New user-friendly interface: We provide a graphical user interface for RnBeads that facilitates the configuration and execution of DNA methylation analyses. Together with RnBeads’ interactive and self-explanatory HTML reports, this new interface makes RnBeads analyses more readily accessible for users with limited R/Bioconductor knowledge.
Improved computational efficiency: Using parallelization and automatic distribution of RnBeads analyses across a high-performance computing (HPC) cluster, we were able to process datasets comprising hundreds of RRBS/WGBS profiles and thousands of microarray-based profiles in a single analysis run.
To illustrate the practical utility of these new RnBeads features, we present four use cases: (i) DNA methylation in human peripheral blood samples, (ii) cell type-specific DNA methylation in human hematopoiesis, (iii) DNA methylation heterogeneity in cancer samples, and (iv) cross-platform DNA methylation analysis. Detailed re-runnable versions of these analyses, including configurations and results, are available for visualization and download from the RnBeads website (https://rnbeads.org/methylomes.html). These pre-configured analyses and pre-calculated reports also provide a good starting point for learning about the use of RnBeads, thereby complementing the tutorials provided on the RnBeads website (https://rnbeads.org/tutorial.html), and for configuring custom analyses that integrate newly generated datasets with publicly available reference data.
Use case 1: Analyzing DNA methylation in a large cohort of peripheral blood samples
Use case 2: Dissecting the DNA methylation landscape of human hematopoiesis
Use case 3: Quantifying DNA methylation heterogeneity in a childhood cancer cohort
Use case 4: Analyzing DNA methylation data across different assay platforms
Comparison to other software tools for DNA methylation analysis
To assess the computational efficiency of RnBeads, we compared its performance to that of other software packages for DNA methylation analysis [37, 38, 39, 40], separately for DNA methylation microarray data, RRBS data, and WGBS data (see the “Methods” section for details and Additional file 2: Table S2 for tool configurations). Given that the different tools provide vastly different feature sets, we considered three scenarios: (i) data import only, (ii) core modules, and (iii) comprehensive analysis with most features activated (Additional file 3: Figure S1). RnBeads was the only tool that supported both microarray-based and bisulfite sequencing-based analysis. For microarray-based analysis, the low-level data processing packages minfi, methylumi, and wateRmelon were faster than ChAMP and RnBeads (which need to prepare the dataset for their more extensive downstream analyses). Compared to ChAMP, RnBeads was more memory-efficient and faster in the comprehensive setting. For bisulfite sequencing-based analysis, RnBeads showed better performance than methylKit on the WGBS dataset in the core module setting, but somewhat longer runtime and higher memory usage on the RRBS dataset. These differences can be attributed to the reformatting into memory-efficient data structures that RnBeads performs during data import. In summary, the runtime performance of RnBeads was similar to that of other tools with more limited functionality, suggesting that the choice of the most suitable tool for DNA methylation analysis depends mainly on the desired features and analysis modes. To assist with an informed selection, we thus surveyed a broad range of tools for DNA methylation analysis and assembled a detailed feature table based on the tool documentations (Additional file 1: Table S1). RnBeads emerged from this comparison as the software that implements the most comprehensive workflow for analyzing DNA methylation data, while also providing a user-friendly interface and extensive options for reporting and reproducibility.
RnBeads is an integrated software package for the analysis and interpretation of DNA methylation data. Due to its modular design and optional graphical user interface, the software is well suited for both beginners and experts in the field of DNA methylation analysis. Interactive reports provide a comprehensive overview of DNA methylation datasets while also fostering reproducibility (by documenting parameter settings) and robustness (by making it easy to evaluate different parameter settings). RnBeads 2.0 implements various new features that arose from technological advances and valuable user feedback. For example, support for the Infinium EPIC assay and for cross-platform data integration broaden the scope of RnBeads analyses; new and improved methods for the DNA methylation-based prediction of age and sex are useful for large cohort studies; a graphical user interface makes many RnBeads features more easily accessible; the analysis of DNA methylation variability adds a new dimension to epigenome-wide association studies; and LOLA-based region set enrichment analysis facilitates the biological interpretation of DNA methylation differences. RnBeads also establishes a convenient way of interfacing with reference epigenome datasets and integrating them into custom analyses, as demonstrated by the integration of sorted blood cell profiles in our first use case. We processed several such datasets and provide re-runnable analyses on the website (https://rnbeads.org/methylomes.html). The presented use cases provide concrete examples for RnBeads analysis of DNA methylation and illustrate some of the tool’s key features. Typical applications of RnBeads include epigenome-wide association studies, reference epigenome analysis, investigation of cancer heterogeneity, and epigenetic biomarker development.
RnBeads is implemented in R/Bioconductor and can be installed in the usual way for Bioconductor packages . Furthermore, the “Installation” section on the RnBeads website (https://rnbeads.org/installation.html) provides an RnBeads installation script, which validates that all package dependencies are installed and up to date. The FAQ page on the RnBeads website (https://rnbeads.org/faq.html) addresses typical problems and answers common questions. Finally, users can contact the RnBeads developers (email@example.com) for further information and assistance.
RnBeads supports any genome-wide or genome-scale assay that provides quantitative DNA methylation data at single-CpG resolution. This includes Infinium DNA methylation microarrays (27k, 450k, EPIC) and bisulfite sequencing protocols (WGBS, RRBS, etc.). Data obtained using enrichment-based assays (e.g., MeDIP-seq, MBD-seq, MRE-seq) can also be processed after their read-count output has been converted to single-CpG methylation levels using bioinformatic inference tools [42, 43, 44, 45]. We currently provide RnBeads annotation packages for the human, mouse, and rat genomes. In addition, users can prepare customized annotation packages for their species of interest, using the RnBeadsAnnotationCreator package (https://rnbeads.org/tutorial.html). To start an RnBeads analysis, the user provides a sample annotation table as well as the DNA methylation data. For microarray-based analyses, RnBeads accepts raw signal intensity files (IDAT) as well as preprocessed DNA methylation data in tabular form. Bisulfite sequencing data are imported directly from the output of DNA methylation calling software such as Bis-SNP or Bismark [46, 47], or as tabular text files providing genomic coordinates of individual CpGs together with their DNA methylation levels and read coverage. Finally, RnBeads can import DNA methylation data directly from the Gene Expression Omnibus (GEO) data portal.
Graphical user interface
A key feature of RnBeads is its self-configuring workflow, which supports launching of a full DNA methylation analysis with a single command (rnb.run.analysis(…)) on the R command line . However, some types of analysis may require more extensive configuration, for which RnBeads includes an extensive set of customizable parameters (which can be specified in R or imported from XML configuration files). Moreover, RnBeads provides various R functions for operating directly on the RnBSet R data objects used to store and process DNA methylation data and associated sample annotations in RnBeads. To make customized RnBeads analysis easier to configure, we have developed RnBeadsDJ, a graphical user interface for RnBeads that is based on the R Shiny toolkit (http://shiny.rstudio.com/). This interface allows the user to launch RnBeads through a web browser and interactively specify the input data and analysis options. The user can either launch the complete analysis pipeline or execute modules individually. A detailed tutorial for RnBeadsDJ is available on the RnBeads website (https://rnbeads.org/tutorial.html).
DNA methylation patterns have been linked to human aging and can be used to infer the chronological age of healthy individuals [23, 48]. Furthermore, the difference between (predicted) epigenetic age and (known) chronological age appears to reflect the speed of biological aging in a way that is predictive of various health issues . We have incorporated the DNA methylated-based prediction of epigenetic age into RnBeads, using elastic net regression at the single-CpG level (https://rnbeads.org/ageprediction.html). RnBeads includes pre-trained models for age prediction as well as the option for users to provide their own training data. It supports age inference based on both DNA methylation microarrays and bisulfite sequencing datasets.
RnBeads uses differences in copy number of genomic regions located on the sex chromosomes to predict the sex of sample donors, in order to infer missing annotation information or to detect sample mix-ups. For microarray data, sex prediction is based on the comparison of the average signal intensities for the sex chromosomes with those on the autosomes, calculating a predicted sex probability by logistic regression. For bisulfite sequencing data, RnBeads compares the sequencing coverage for the sex chromosomes with those for the autosomes, followed by logistic regression trained on datasets with known sex information.
In addition to analysis based on individual CpGs, RnBeads aggregates and compares DNA methylation levels across genomic regions of interest, which can enhance statistical power and interpretability . Extending the default region sets available in RnBeads (genes and gene promoters, CpG islands, and genomic tiling regions), the RnBeads website now provides a collection of additional region sets for automatic import (https://rnbeads.org/regions.html). This collection includes region sets defined based on consensus epigenome profiles such as putative regulatory regions in the Ensembl Regulatory Build  and regions associated with DNA methylation variability .
The presence of missing values in DNA methylation datasets constitutes an important analytical challenge, for which RnBeads implements several alternative solutions, namely: Sample-wise means and medians, CpG-wise means and medians, random replacement from other samples in the dataset, and k-nearest neighbor (KNN) imputation. KNN imputation tends to provide adequate estimates of missing values when enough nearby data points are available. It has been used extensively for gene expression microarray data , and it has also been applied to DNA methylation data . For those cases in which the model assumptions of KNN imputation are not met due to disproportionally high numbers of missing values (which is not uncommon for bisulfite sequencing datasets), we implemented the mean and median imputation approaches.
For the Infinium microarrays, RnBeads implements a new metric of sample quality that we call “genetic noise”. This metric quantifies the deviation of the signals of autosomal single nucleotide polymorphism probes on the microarray from the expected values of 0 and 1 (homozygosity) as well as 0.5 (heterozygosity) of a diploid cell. Such deviations can indicate technical problems of the microarray-based analysis, contamination with DNA samples from other individuals, or deviations from the diploid case (e.g., aneuploid cancer samples).
Cell type heterogeneity
RnBeads implements reference-based and reference-free methods for estimating intra-sample heterogeneity [21, 30, 52, 53]. This includes reference-based estimation of immune cell content  based on the DNA methylation profiles of purified blood cell populations  as well as the LUMP algorithm  for estimating immune cell invasion in bulk tumor samples based on a preselected set of CpGs that are exclusively unmethylated in blood cells. While this algorithm was developed specifically for the Infinium 450k assay, its implementation in RnBeads supports both microarray-based and bisulfite sequencing-based assays.
CpGs and genomic regions can differ between cases and controls not only in terms of their average DNA methylation levels, but also in terms of the variability of DNA methylation levels; for example, epigenetic variability may be higher or lower in tumors than in healthy tissue. In RnBeads, users can choose between two algorithms to quantify differential variability: diffVar  and iEVORA . DiffVar uses an empirical Bayes framework, while iEVORA is based on the Bartlett test, which tests for differences in variance (heteroscedasticity) across samples. Striking the right balance between reporting too few and too many differentially variable cytosines (DVCs) and differentially variable regions (DVRs) represents an unsolved statistical challenge, especially when the data do not follow a normal distribution. Therefore, we implemented a strategy analogous to the identification of differentially methylated cytosines (DMCs) and differentially methylated regions (DMRs) between sample groups in RnBeads: DVCs and DVRs are ranked by the worst (highest) rank of the following criteria: (i) the adjusted p value of the statistical test (either diffVar or iEVORA), (ii) the difference in variance between the groups, and (iii) the log-ratio of the two group-wise variances. RnBeads produces summary plots comparing group-wise variances, p values, and ranks, while also exporting detailed tables of DVCs and DVRs.
To investigate the biological processes relevant to observed DNA methylation differences, RnBeads implements region set enrichment analysis using LOLA , in addition to gene set analysis based on Gene Ontology terms. The LOLA tool compares a set of genomic regions of interest (i.e., DMRs and/or DVRs) to a potentially large reference catalogue of region sets using Fisher’s exact test and derives a ranked list of significantly enriched region sets. By default, RnBeads uses the LOLA Core database as a reference, which includes transcription factor binding sites, tissue-specific enhancer elements, and genome annotations such as CpG islands and repetitive elements. Moreover, other LOLA databases such as the LOLA Extended database (http://databio.org/regiondb) or user-created databases can be included via RnBeads option settings. Plots showing enrichment p values and log-odds ratios visualize the most enriched region sets in the RnBeads report.
We have successfully processed datasets comprising hundreds of RRBS and WGBS samples and thousands of Infinium microarrays with RnBeads. To handle large memory requirements, RnBeads uses disk-based matrices implemented in the ff R package (https://CRAN.R-project.org/package=ff). Tasks are parallelized using the foreach (https://CRAN.R-project.org/package=foreach) and doParallel R packages (https://CRAN.R-project.org/package=doParallel). We have also developed an interface that facilitates the automatic distribution of RnBeads analysis runs across an HPC cluster (e.g., managed through a “grid engine” or “slurm” job scheduler). Finally, to facilitate DNA methylation analysis on small computers including personal laptops, RnBeads provides options that disable the most resource-intensive steps; these configurations are available as pre-defined option profiles for low-, medium-, and high-resource settings.
We compared RnBeads in terms of its runtime performance and peak memory consumption with other software packages for DNA methylation microarray analysis highlighted in a recent review paper , namely minfi , methylumi (http://bioconductor.org/packages/release/bioc/html/methylumi.html), watermelon , and ChAMP , and with the methyKit  package for analyzing bisulfite sequencing data. The microarray-based tools were benchmarked on the first use case (732 blood samples), while the bisulfite sequencing tools were benchmarked on a mouse RRBS dataset (GSE45361, 6 adrenal gland and 11 liver samples) and on a human WGBS dataset (12 hepatocyte samples) from the DEEP project (http://www.deutsches-epigenom-programm.de/), thus covering a broad range of different scenarios for DNA methylation analysis. The benchmarking was performed on a Debian Wheezy machine with 32 cores (1.2 GHz) and 126 GB RAM using R-3.5.0. Three different tool configurations with different depths of analysis were evaluated (Additional file 2: Table S2): (i) data import only, (ii) core modules enabled, and (iii) comprehensive analysis with most features enabled. Furthermore, to complement the performance-oriented benchmarking with a feature-oriented comparison, we conducted a comprehensive survey of popular software tools for DNA methylation analysis in comparison to RnBeads (Additional file 1: Table S1). To that end, we manually reviewed the documentation of all Bioconductor packages for DNA methylation analysis that had a popularity ranking of 900 or better (https://www.bioconductor.org/packages/release/BiocViews.html#___DNAMethylation). We considered only packages that support DNA methylation microarrays and/or bisulfite sequencing data, while discarding data broker packages and packages for single, specialized tasks. We also considered selected DNA methylation analysis tools outside Bioconductor based on the literature review.
We would like to thank many users of RnBeads for discussions, feature requests, and helpful feedback; the BLUEPRINT project for providing epigenome data; and G. Friedrich, J. Büch, and the Information Services and Technology team at the Max Planck Institute for Informatics for excellent technical support.
This work was funded in part by the German Epigenome Project (DEEP, German Science Ministry grant no. 01KU1216A), the BLUEPRINT project (European Union’s Seventh Framework Programme grant 282510), and de.NBI-epi (German Science Ministry grant no. 031L0101A). Y.A. was supported by the DZHK (German Centre for Cardiovascular Research). C.B. is supported by a New Frontiers Group award of the Austrian Academy of Sciences and by an ERC Starting Grant (European Union’s Horizon 2020 research and innovation programme, grant agreement no. 679146).
Availability of data and materials
RnBeads is an open source R package licensed under GPL-3. It is freely and openly available from the RnBeads website (https://rnbeads.org/), and it is part of Bioconductor . The RnBeads website provides comprehensive documentation including installation instructions, user tutorials, DNA methylome reference datasets, and frequently asked questions. The datasets and analysis reports for the presented use cases are available from the RnBeads website. The first use case was based on Infinium 450k data for blood samples, which were obtained from GEO accession numbers GSE87571  and GSE35069 . The second use case was based on WGBS data for human hematopoietic cell types, which were obtained from the August 2016 release of the BLUEPRINT project (http://www.blueprint-epigenome.eu/). The third use case was based on RRBS data from a Ewing sarcoma study , downloaded from GEO/SRA accession numbers GSE88826 and SRP091715. The fourth use case was based on DNA methylation profiles from a benchmarking study , which were obtained from GEO accession number GSE86831. For the benchmarking of RnBeads against other software packages, mouse RRBS data were downloaded from GEO accession number GSE45361 .
FM, MS, PL, and YA developed the software, carried out the analyses, and prepared the use cases. FM, MS, and CB wrote the manuscript with input from PL and YA. JW, TL, and CB supervised the research. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 3.Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462:315–22.Google Scholar
- 35.Tomazou EM, Sheffield NC, Schmidl C, Schuster M, Schönegger A, Datlinger P, et al. Epigenome mapping reveals distinct modes of gene regulation and widespread enhancer reprogramming by the oncogenic fusion protein EWS-FLI1. Cell Reports. 2015;10:1082–95.Google Scholar
- 37.Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.Google Scholar
- 41.Bioconductor RnBeads software package. https://bioconductor.org/packages/release/bioc/html/RnBeads.html. Accessed 6 Feb 2019. doi: https://doi.org/10.18129/B9.bioc.RnBeads.
- 55.Schillebeeckx M, Schrade A, Löbs A-K, Pihlajoki M, Wilson DB, Mitra RD. Laser capture microdissection-reduced representation bisulfite sequencing (LCM-RRBS) maps changes in DNA methylation associated with gonadectomy-induced adrenocortical neoplasia in the mouse. Nucleic Acids Res. 2013;41:e116.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.