Background

The combination of next-generation sequencing (NGS) with complex biochemistry techniques enables profiling of distinct epigenetic and regulatory features of cells in a genome-wide manner. Two examples are chromatin immunoprecipitation followed by sequencing (ChIP-seq) for protein–DNA interaction [1] and assay for transposase-accessible chromatin using sequencing (ATAC-seq) for open chromatin [2]. These techniques allow the studying of epigenetic dynamics in cellular processes such as cell differentiation [3, 4] and the characterization of the regulatory landscape of diseases such as human cancers [5]. Analysis of such data typically requires multi-step computational pipelines that usually include:

  • low-level methods (read alignment, quality control),

  • medium-level methods for detection of genomic regions with relevant epigenetic signals (processing of genomic profiles, peak calling, differential peak calling, computational footprinting), and

  • high-level methods for visual representation and integrative analysis with further genomic data (association with gene expression and further epigenetic data, detection of transcription factor binding sites, and functional enrichment analysis).

Fig. 1
figure 1

Example of a typical pipeline for the analysis of a transcription factor ChIP-seq experiment. First, the reads are aligned to the genome (step 1, low-level analysis). A peak caller receives these aligned reads as input and typically creates an intermediary representation called genomic signal. Based on this genomic signal, the peak caller then detects regions with a higher value than the background. These candidate peaks represent the regions with DNA–protein interaction sites (steps 2 and 3, medium level). Several downstream analyses are then performed, such as the detection of motif-predicted binding sites inside the peaks (step 4, high-level analysis) or line plots displaying average genomic signals of other ChIP-seq experiments around the predicted peaks or binding sites (step 5, high-level analysis)

Figure 1 gives an example of a common ChIP-seq data analysis pipeline. It includes on the low level the use of a read aligner, such as BWA [6]; on the medium level a peak calling method, such as MACS2 [7], for the detection of regions with the presence of potential protein–DNA interactions; and on the high level a motif match procedure, such as FIMO [8], to find transcription factor binding sites inside peaks as well as R functions for the visualization of genomic signals, such as Genomics Ranges [9]. A similar pipeline for ATAC-seq data analysis is described in Additional file 1: Figure S1.

The definition of analysis pipelines depends on the biological study as well as on the used NGS technique. Its complexity, which includes the use of several bioinformatics tools that may require command-line usage and/or scripting skills, makes the analysis of epigenomics data so far less reproducible and inaccessible to non-experts. Moreover, the development of bioinformatics tools for medium-level analysis needs to take into account specific characteristics of the used NGS protocols [10, 11]. For example, ChIP-seq experiments require the computational estimation of the read extension sizes [11]. It also requires a signal correction with control experiments, as the local chromatin structure may influence the ChIP-seq signal [12]. In contrast, footprint analysis of ATAC-seq data does not require the estimation of read extension sizes, as the start of the read corresponds to the cleavage position. However, ATAC-seq analysis demands the correction of Tn5 cleavage bias [13]. Moreover, some aspects, such as PCR amplification artifacts, are shared by ChIP-seq and ATAC-seq experiments [11]. Clearly, the development of tools for the analysis of epigenetic data is greatly facilitated by a flexible and easy-to-handle computational library. This library should support genomic data I/O as well as usual pre-processing methods, such as fragment size estimation and the correction of sequence bias. Regarding high-level tasks, the library should provide structure to allow sequence analysis (i.e. motif matching), interval algebra (i.e. measuring overlap between peaks), or associating signals with regions (i.e. line plots showing signal strength around peaks).

Fig. 2
figure 2

Overview of RGT core classes and tools. RGT provides three core classes to handle the genomic regions and signals. Each genomic region is represented by GenomicRegion class and multiple regions are represented by GenomicRegionSet class. The genomic signals are represented CoverageSet class. These classes serve as the core data structures of RGT for handling genomic regions and signals. Based on these classes, we developed several tools for analyzing regulatory genomics data as represented by different colors, namely, HINT for footprinting analysis of ATAC/DNase-seq data; RGT-viz for finding associations between chromatin experiments; TDF for DNA/RNA triplex domain finder; THOR for differential peak calling of ChIP-seq data; Motif analysis for transcription factor binding sites matching and enrichment

Implementation

We developed RGT in Python by following the object-oriented approach. The core classes provide functionalities for handling data structures that are related to questions about regulatory genomics. Based on the cores, we implemented several computational tools to perform various downstream analyses (Fig. 2). These include previously described HINT tools for ATAC-seq/DNase-seq footprinting [13,14,15], the differential peak caller THOR [16] and a library to characterize triple helix mediated RNA-DNA interactions [17]. RGT also includes some functionalities such as motif binding sites prediction and enrichment analysis (Motif Analysis), as well as methods for association and visualization of genomic signals (RGT-Viz). We describe below the basic structures and the novel Motif Analysis and the RGT-Viz frameworks.

Core classes

Analysis of high-throughput regulatory genomics data is mostly based on the manipulation of two common data structures: genomic signals which represent the abundance of sequencing reads on the genome and genomic regions which represents candidate regions. In RGT, we implemented three classes, i.e., GenomicRegion, GenomicRegionSet, and CoverageSet, to represent a single region, multiple regions, and genomic signals, respectively. In each of the classes, we implemented several functions to perform basic data processing. For example, CoverageSet provides functions for fragment extension estimation, signal smoothing, GC-content bias correction, and input DNA normalization. These procedures are crucial for the particular downstream analysis of chromatin sequencing data, such as peak calling and footprinting. For computational efficiency, functions related to GenomicRegionsSet and interval-related algebra have been implemented in C. Moreover, RGT contains I/O functions of common genomic file formats such as Binary Alignment Map (BAM) files for alignments of reads, (big)wig files for genomic profiles, and bed files for genomic regions by exploring pysam [18, 19] related functions.

These core classes provide a powerful infrastructure for the development of methods dealing with regulatory genomics data. As an example of the simplicity, versatility, and power of RGT, we include a tutorial on how to build a simple peak caller with less than 50 lines of codes: https://reg-gen.readthedocs.io/en/latest/rgt/tutorial-peak-calling.html.

Finding associations between chromatin experiments with RGT-viz

A typical problem in regulatory genomics is to associate results of distinct experiments, i.e. overlap between distinct histone marks or a given histone mark in distinct cells. RGT-viz provides a collection of statistical tests and tools for the association and visualization of genomic data such as genomic regions and genomic signals (Fig. 3a).

In the tests of regions versus regions, a set of reference and query regions, both in BED format, are required as inputs. The aim is to evaluate the association between the reference and the query. For this, RGT-viz provides the following tests:

  • Projection test This test compares a query set, i.e. ChIP-seq of transcription factors with a larger reference set, i.e. ChIP-seq peaks of a regulatory region (H3K4me3 or H3K4me1 marks). It estimates the overlap of the query to the reference and contrasts with the coverage of the reference in the complete genome. A binomial test is then used to indicate if the coverage of the query in the reference is higher than the reference of the reference to the genome [20] (Additional file 1: Fig. S2a).

  • Intersection Test This test is based on measuring the intersection between a pair of genomic regions and comparing it to the expected intersection on random region sets. Random regions are obtained by evaluating permutations (with size equal to the input regions) of the union of regions in the pair of queries [21] (Additional file 1: Fig. S2b). The statistical test is based on empirical p-values.

  • Combinatorial Test The combinatorial test is appropriate for two-way comparisons. For example, you want to check the proportion of peaks of two (or more) transcription factors on two (or more) cell types. For this, it creates a background distribution per reference sets (cells) by considering the union of all query sets (TFs) in that cell. It then creates count statistics per cell and compares if the number of binding sites in a cell for a given TF is higher than in another cell by using a Chi-squared test (Additional file 1: Fig. S2c).

  • Jaccard Measure This measures the amount of overlap between the reference and the query using the Jaccard index (also called Jaccard similarity coefficient) [22]. Given two region sets A and B, it measures the ratio of intersecting base pairs in relation to the regions associated with the union of A and B. Through this Jaccard index, the amount of intersection can be expressed by a value between zero to one (Additional file 1: Fig. S2d). This test explores a randomization approach, i.e. random selection of genomic region sets with the same number/size regions, to estimate empirical p-values.

Another important functionality is the visualization of distinct genomic signals, as described below. To visualize the signals in different regions, the following tools are provided:

  • Boxplot It compares the number of fragments from different ChIP-seq experiments on the given region set. This can be used for example to contrast the signal of distinct ChIP-seq TFs over promoter regions (H3K4me3 peaks). Conceptually, the generation of a boxplot is simply counting the number of reads within the region set and then plotting these counts in boxplot (Additional file 1: Fig. S3a). RGT-Viz provides functionalities to normalize the individual libraries regarding library sizes.

  • Lineplot and heatmap Line plot and heatmap are used to display the distribution of reads within a given region set. Specifically, each region is first extended with the given window size which defines the boundaries for plotting. Next, the coverage of reads on the given regions is calculated based on the given bin size and step size. Finally, the line plot or heatmap is generated. The line plot shows average signals over all regions in the region set while the heatmap displays the signals of all regions (Additional file 1: Fig. S3b-c).

Fig. 3
figure 3

Overview of RGT-viz and motif analysis. a RGT-viz provides several tests for regions versus regions and visualization tools for regions versus signals by taking BED and BAM files as input. b Motif matching detects binding sites for a set of TFs against multiple genomic regions. The motifs were collected from public repositories such as UniPROBE, JASPAR, and HOCOMOCO. The position weight matrix (PWM) for each TF is used to calculate a binding affinity score per position. The genomic regions are usually obtained by peak calling based on ChIP-seq or ATAC-seq data

Transcription factor motif matching and enrichment with motif analysis

Motif analysis is a framework to perform transcription factor motif matching and motif enrichment. Motif matching aims to find transcription factor binding sites (TFBSs) for a set of TFs in a set of genomic regions of interest (Fig. 3b). For this, RGT has its own class, i.e., MotifSet for storing TF motifs from known repositories, such as UniPROBE [23], JASPAR [24] and HOCOMOCO [25]. In addition, users are also allowed to add new motif repositories. RGT uses an efficient Motif Occurrence Detection Suite (MOODS) algorithm to find binding site locations and bit-scores [26]. Note that MOODS was originally implemented in C++ and we have adapted it to a Python package (https://pypi.org/project/MOODS-python/). Next, RGT uses a dynamic programming algorithm [27] to determine a bit-score cut-off threshold based on the false positive rate of \(10^{-4}\). The predicted binding sites can be obtained with p values between \(10^{-5}\) and \(10^{-3}\).

Fig. 4
figure 4

Case study of RGT-viz and motif analysis for DC development. a Dendritic cell development. DC develop from multipotent progenitors (MPPs), which commit into DC-restricted common dendritic cell progenitors (CDP). CDP differentiate into classical DC (cDC) and plasmacytoid DC (pDC). b Intersection test shows that the IRF8 binding sites in cDC and pDC are associated with the PU.1 binding sites in MPP, CDP, cDC, and pDC. c Line plots showing genomic signals of different histone modifications on the PU.1/IRF8 peaks in cDC. d Screenshot showing the top 5 TFs identified by motif enrichment analysis from the overlapping peaks between PU.1 and IRF8 in cDC cells

The motif enrichment module evaluates which transcription factors are more likely to occur in certain genomic regions than in ”background regions” based on the motif-predicted binding sites (MPBS) from motif matching. To determine the significance, we performed Fisher’s exact test for each transcription factor and corrected the p values with the Benjamini–Hochberg procedure. More specifically, we provided three types of tests:

  • Input regions versus Background regions In this test, all input regions are verified against background regions that are either user-provided or randomly generated with the same average length distribution as the original input regions.

  • Gene-associated regions versus Non-gene-associated regions In this test, we would like to check whether a group of regions that are associated with genes of interest (e.g. up-regulated genes) is enriched for some transcription factors versus regions that are not associated with those genes. The input regions are divided into two groups by performing gene-region association that considers promoter-proximal regions, gene body, and distal regions. After the association, we perform a Fisher’s exact test followed by multiple testing corrections as mentioned in the previous analysis type.

  • Promoter regions of input genes versus Background regions In this test, we take all provided genes, find their promoter regions in the target organism, and create a “target regions” BED file from those. A background file is created by using the promoter regions of all genes not included in the provided gene list. Next, motif matching is performed on the target and background regions and a Fisher’s exact test is executed.

Finally, the enrichment regions are provided in an HTML interface.

Additional tools based on RGT

Several additional tools that explored and extended classes from RGT to tackle specific regulatory genomics problems are available. HINT is a framework that uses open chromatin data to identify the active transcription factor binding sites (TFBS). We originally developed this method for DNase-seq data [14, 15] and later extended it to ATAC-seq data by taking the protocol-specific artifacts into account [13]. Footprint analysis requires base pair resolution signals in contrast to peak calling problems, which are based on signals on windows with more than 50 bps. Therefore, HINT has a GenomicSignal class, which deals with ATAC-seq, and DNA-seq signals such as cleavage bias correction, base pair counting, and signal smoothing. Moreover, HINT makes use of the previously described motif-matching functionality provided by RGT to characterize motifs related to ATAC-seq footprints. These can be explored in differential footprinting analysis to detect relevant TFs associated with different biological conditions. This method has been widely used to study, among others, cell differentiation [13, 28] and diseases [29,30,31,32].

THOR is a Hidden Markov Model-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates [16]. As a first step, THOR needs to create and normalize ChIP-seq signals from distinct experiments. Among others, THOR extended functionalities of the base class CoverageSet to a MultipleCoverageSet class to deal with multiple signals at a time and to provide global normalization methods, such as trimmed means of M-values (TMM). Finally, Triplex Domain Finder (TDF) characterizes the triplex-forming potential between RNA and DNA regions [17]. TDF explores functionality provided by RGT/RGT-viz to build statistical tests for characterizing DNA binding domains in lncRNAs.

Results

Investigating dendritic cell development with RGT-viz

We here provided a case study using RGT-viz to investigate dendritic cell (DC) development (Fig. 4a). We collected ChIP-seq data of the transcription factors PU.1 and IRF8, and five histone modifications (i.e., H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K27ac) for each of the cell types [4, 16, 33, 34] (Additional file 1: Table S1). PU.1 is one of the master regulators of hematopoiesis and is expressed by all hematopoietic cells [35] and IRF8 is believed to co-bind with PU.1 to control the differentiation of DC progenitors towards specific DC sub-types [4, 36]. We mapped the sequencing reads to mm9 using BWA [6] and called the peaks with MACS2 [7].

We performed an intersection test between PU.1 and IRF8 peaks from different cell types to check for PU.1 and IRF8 co-binding during DC differentiation. Of note, IRF8 ChIP-seq only detected peaks in classical and plasmacytoid DC (cDC and pDC, respectively), as this TF is not expressed in multipotent progenitor (MPP) and expressed only at low levels in common DC progenitors (CDP).

This test reveals that PU.1 and IRF8 are significantly associated in all cell types, while the co-binding was two times higher as measured by \(\chi ^{2}\) statistics in cDC than pDC. Moreover, we observed that overlap of binding sites of cDC IRF8 peaks is already quite high with CDP PU.1 peak. This indicates that PU.1 binding prepares the chromatin for IRF8 binding already in CDP, showing DC priming in CDP (Fig. 4b and Additional file 1: Table S2).

We next asked if the co-binding regions are associated with different regulatory regions (enhancers vs. promoters). For this, we defined the set of peaks with both PU.1. and IRF8 binding, or only with PU.1. or only IRF8 binding in cDC and pDC cells by using intersect and subtracting functions from the core class GenomicRegionSet of RGT. We then generated line plots of PU.1, IRF8, H3K4me1, and H3K4me3 on these three sets of regions in cDC (Fig. 4c). We observed that peaks with PU.1-IRF8 co-binding have higher ChIP-seq peaks for either factor indicating that co-binding strengthens the binding affinity of both TFs. Moreover, H3K4me1 signals are strong for PU.1 and IRF8 co-binding, while IRF8 only has stronger H3K4me3 marks. This suggests an association of PU.1 and IRF8 co-binding with enhancers, while IRF8 exclusive binding is more associated with promoters. These examples demonstrate how RGT-Viz can be used to explore associations and interpretation of genomic data.

We next performed motif matching and enrichment analysis on the PU.1 and IRF8 co-binding peaks in cDC (Fig. 4d). We observed that PU.1 (and ETS family) motifs were ranked at the top and an IRF family motif at fifth (IRF1; MA0050.2.IRF1). This demonstrates how motif analysis can recover expected regulatory players from regulatory sequences.

Discussion

We presented the regulatory genomics toolbox (RGT), a versatile toolbox for analyzing high-throughput regulatory genomics data. RGT was programmed in an oriented-object fashion and its core classes provided functionalities to handle typical regulatory genomics data: regions and signals. Based on these core classes, RGT built distinct regulatory genomics tools, i.e., HINT for footprinting analysis, TDF for finding DNA–RNA triplex, THOR for ChIP-seq differential peak calling, motif analysis for TFBS matching and enrichment, and RGT-viz for regions association tests and data visualization. These tools have been used in several epigenomics and regulatory genomics works to study cell differentiation and regulation [28, 31, 37,38,39,40,41,42].

There are several methods providing functionality similar to RGT but they mostly focus on a subset of problems tackled by RGT (Additional file 1: Table S3). Bedtools is a well-known and efficient C tool for interval algebra. However, it provides no functionalities related to statistical tests, motif analysis, and visualization. Visualization and genomic signal processing are provided by the python-based Deeptools [43]. However, it lacks functionality related to interval algebra or motif analysis. pyDnase is a for genomic signal processing but with a focus on problems related to genomic footprinting [44]. Also, previous tools focus on providing command-line interfaces, while RGT provides both programming and command-line interfaces. Regarding R language, GenomicRanges is a library for interval algebra [45], while motif matching can be performed with motifmatchr [46]. We are not aware of any framework for genomic signal processing in R. Altogether, RGT is the most complete framework for chromatin sequencing data manipulation, which we are aware of.

We envision that RGT can facilitate the development of computational methods for the analysis of high-throughput regulatory genomics data as a powerful and flexible framework in the future.