Introduction

A rapidly growing body of research is continuing to reveal numerous gene-regulatory effects of covalent DNA modifications, such as 5‑methylcytosine (5mC). We now recognize 5mC as a stable epigenetic mark and as having diverse functions beyond transcriptional repression [12]. An increasing number of studies demonstrate the importance of other cytosine modifications, such as 5‑hydroxymethylcytosine (5hmC), 5‑formylcytosine (5fC), and 5‑carboxylcytosine (5caC) [3, 9, 26, 43, 46]. More recently, three analogous modifications of thymine were found to occur in mammals [38, 53] and can now largely be sequenced [19]. N6‑methyladenine, previously thought to mainly occur as an RNA modification in eukaryotes, has now been found in the DNA of multiple eukaryotes [24]. Bacteria, archaea, and especially bacteriophages have long been known to harbor a diverse array of modified bases [18, 51]. Their genomes can also have hypermodified bases—modified DNA bases that substitute for the unmodified base in many positions genome-wide [17, 51].

Multiple databases profile RNA modifications [4, 8, 54] and human histone modifications [56], but no database catalogues DNA modifications systematically. Some databases include particular classes of DNA modifications [44]. These include restriction endonucleases and DNA methyltransferases in REBASE [41]; methylation databases, like MethDB [1]; databases including DNA metabolic pathways, such as KEGG [27]; and those focused on DNA damage and repair, like REPAIRtoire [31].

Since DNA modifications are a key aspect of epigenetic regulation, there is a pressing need to organize them in a single location. We have accordingly created DNAmod: the DNA modification database (https://dnamod.hoffmanlab.org). DNAmod is the first database to comprehensively catalogue DNA modifications and provides a single resource to launch an investigation of their properties.

Database construction and visualization

DNAmod consists of two components: a relational database back-end and a web interface front-end. We used the Chemical Entities of Biological Interest (ChEBI) database [13, 22] to seed the DNAmod database. We imported a nucleobase-related subset of ChEBI, consisting of chemical entities and related annotations. We performed queries against the entities to construct a set of candidate DNA modifications for DNAmod, retaining most of these as a separate unverified set. Then, we filtered candidate entities into a manually curated set of verified DNA modifications, augmenting them with modification-specific annotations.

The web interface front-end allows users to either search or browse through the catalogue of DNA modifications, integrating ChEBI’s information with our own.

Identifying candidate DNA modifications from ChEBI

DNAmod leverages ChEBI [22] to define a set of modified DNA candidates for inclusion and to add preliminary information for each candidate. ChEBI is a database of small biologically relevant molecules, which affect living organisms. We queried ChEBI via ChEBI Web Services [22]. We used Biopython [10] and the Python Simple Object Access Protocol (SOAP) client, suds [35], to query ChEBI and construct the DNAmod database.

ChEBI provides an ontology which encodes the relationships between its compounds. We used this ontology to precisely define the notion of parents and children, which we used to hierarchically retrieve and display modifications. We used two kinds of relationships for this purpose, both of which have associated symbols, defined by ChEBI [13]: \(\mathcal {F}\) has functional parent and \(\triangle \) is a. We used these relationships to find candidate DNA modifications, by identifying entities related to the core nucleobases, which we represent by their symbols: {A,C,G,T,U}. We included uracil, since many of its descendants in the ontology are modifications of thymine (CHEBI:17821, which is equivalent to 5-methyluracil), and are not annotated as descendants of thymine itself. For each of these bases, we imported all entities that are annotated in the ontology as a child of one of these bases, via the \(\mathcal {F}\) has functional parent relationship. ChEBI ranks entities based on their degree of curation. We only imported entities with the highest rating—three stars—indicating manual curation by ChEBI. Whenever possible, we only included entities as nitrogenous bases (nucleobases). If ChEBI did not have the nucleobase, we then selected the nucleoside form and finally, if necessary, the nucleotide. These imported bases formed the candidate set of modifications (the unverified set), from which we created a curated set of DNA modifications (the verified set).

The ChEBI ontology does not generally encode \(\mathcal {F}\) has functional parent relationships for nucleobases beyond the children of the unmodified nucleobases. It instead encodes modified nucleobases with an \(\triangle \) is a relationship to their parent base. This is because descendant entities of specific modifications are generally subtypes of the class of modifications from which they originate. For example, 3-methyladenine \(\triangle \) is a methyladenine. Methyladenine, however, \(\mathcal {F}\) has functional parent adenine, since it is conceived of as possessing adenine as a characteristic group and as being derived via functional modification [13]. We therefore need to use both of these relationships, within the ChEBI ontology, to accurately capture the full nucleobase hierarchy.

ChEBI also provides selected citations, associated with some of its entities. We retrieved the citations from ChEBI as PubMed IDs [32]. We used the Biopython [10] package Bio.Entrez to query the PubMed citation database, using NCBI’s Entrez Programming Utilities [32]. We retrieved the details of each citation, and use them to construct a formatted citation. We currently support only publications indexed in PubMed.

Manual curation and annotation

We manually created and defined a whitelist, which contains our curated (or verified) set of candidates that we deem DNA modifications. For each of the bases enumerated in our whitelist, we also imported all descendants with an eventual \(\mathcal {F}\) has functional parent or \(\triangle \) is a relationship with any of the members of the verified set. We expanded the verified set to include any bases recursively imported in this manner, since they were children of verified DNA nucleobases. We also manually created and defined a distinct blacklist, which contains compounds that we deem to not be DNA modifications, also excluding any of their descendant compounds. Therefore, our above verification rule has the exception that it excludes any bases with an ancestor in our blacklist.

We can formalize the above description of bases imported from the ChEBI ontology [13] and subsequent filtering as follows. Let \(a\mathbin{\mathcal {F}}\,b\) specify that a has the \(\mathcal {F}\) has functional parent relationship with b. The definition of \(\mathcal {F}\) is transitive: for all n entities, \(l_{i}\), for \(i = 0\) to \(n - 1\), between a and b,

$$\begin{aligned} a\mathbin{\mathcal {F}}\,b \iff \bigl ( a\mathbin{\mathcal {F}}\,l_{n - 1} \bigr ) \wedge \bigl ( l_{i}\mathbin{ \mathcal {F}}\,l_{i - 1} \mathord {\forall } i \in \left( 0, n\right) \bigr ) \wedge \bigl ( l_{0}\mathbin{\mathcal {F}}\,b \bigr ). \end{aligned}$$

The analogous definitions hold for \(\triangle \).

We call each \(l_{i}\) a child of \(l_{i - 1}\) and call each \(l_{i - 1}\) a parent of \(l_{i}\). We refer to a as a descendant of b and refer to b as an ancestor of a. Let \(\mathcal {C}\) represent the first level of children of the unmodified nucleobases, such that \(\mathcal {C} = \left\{ x \mid x\mathbin{{\mathcal {F}}}\,y, y \in \{\tt{A, C, G, T, U}\} \right\} \). Let \(\mathcal {V} \subset \mathcal {C}\) represent the manually-annotated, verified proper subset of \(\mathcal {C}\).

We manually curated a blacklist of excluded entities, \(\mathcal {B}\), satisfying: \(\mathcal {B} \subseteq \left\{ b\mid\left( b\mathbin{{\mathcal {F}}}\,p \vee b \mathbin{\triangle} p \right) , p \in \mathcal {V} \right\} \). We imported the set of verified DNA modifications, \(\mathcal {M}\), defined in set-builder notation with predicates, as:

$$\begin{aligned} \mathcal {M}=\, {} \mathcal {V}\, \cup\, &\left\{ z \mid \left( \exists v\, {\in } \mathcal {V} \right) \left( \forall\, b\, {\in }\, \mathcal {B} \right) \right. \\&\left. \left[ \left( z\mathbin{{\mathcal {F}}}\,v \vee z \mathbin{\triangle} v \right) \wedge \lnot \left( z\mathbin{{\mathcal {F}}}\,b \vee z \mathbin{\triangle} b \right) \right] \right\} . \end{aligned}$$

Finally, we added a small number of bases manually, that do not have any of the DNA bases or uracil as a parent in their ontology, but are nonetheless notable modified bases, such as 2′-deoxyinosine.

We additionally provided two kinds of manual annotations: sequencing techniques and occurrence in nature, for each modified DNA base. We surveyed the literature of sequencing methods for covalent DNA modifications [6, 29, 37, 39, 45], and annotated the available methods for each base, providing curated citations. These annotations include the method’s name, our categorizations of the basis for the method (such as chemical conversion), its resolution, and any further qualifier (Table 1A). Qualifiers include limitations (such as applicability to only some genomic regions), enrichment methods, and advantages (such as optimization for single-cell sequencing). We considered any method which involves affinity-based recognition of targets to be of “low” resolution [5]. These methods can also suffer from low specificity or antibody cross-reactivity [6]. Conversely, we annotated any methods based principally upon the detection of a chemically converted modification as “high” resolution. This generally reflects the resulting resolution of the method’s output data and often corresponds to the necessity to bin genomic regions during downstream analyses of the detected analyte.

For each modified base, we investigated if it had been previously reported to occur in vivo. This included any endogenous occurrences, as well as those stimulated exogenously, such as from exposure to an environmental toxin. We annotated any modification observed in vivo as “natural”. We additionally provided non-exhaustive examples of some organisms in which the modifications have been reported. We based these annotations on our ability to find evidence of in vivo occurrence, as opposed to publications describing only the synthesis or physicochemical properties of a nucleobase. For each of these annotations, we also briefly annotated a primary biological function, if known (Table 1B). For any modification not observed in vivo, we annotated it as “synthetic” and listed a reference pertaining to its synthesis or in which the synthetic base was used.

We entered these annotations in two annotation source files (Table 1), which we later imported into our database. This decoupled them from the rest of our pipeline and allows outside experts to submit additions without requiring knowledge of our pipeline or programming workflow.

Table 1 Possible annotations within DNAmod’s curated (A) sequencing method data and (B) natural occurrence information

DNAmod integrates manually-curated nomenclature, including the name and abbreviation deemed most consistent and in common use [9, 11, 28]. We additionally provide recommendations for one-letter symbols of selected modified bases, and in some instances for their base-pairing complements, as previously described [49]. The DNAmod web interface displays recommended notation in an organized table (Fig. 1).

Fig. 1
figure 1

Manually-curated recommended notation, mapping techniques, and natural occurrence data for 5-formylcytosine (5fC). See Table 1 for an explanation of the mapping and natural occurrence table headers

We store all data, either imported from ChEBI or from our manual annotations, within a SQLite [25] database, used via the Python sqlite3 package [16].

Website generation

We created a static website to display and provide navigation for the information contained within the database. We generated it by formatting the database content using the templating engine Jinja2 [42]. Two templates were sufficient to generate all HTML files. We used a single template for all modification pages and another for the homepage. We also record the date of the most recent update to the database. The main footer contains this date, along with the current ChEBI and DNAmod versions. All web pages use the Bootstrap [36] framework, which provides a standardized, portable, and mobile-compatible viewing format. We visualized the chemical structure of each compound from its Simplified Molecular-Input Line-Entry System (SMILES) [52] data, if available from ChEBI, as a vector graphic. We did this using the cheminformatics toolkit Open Babel [34], via its Python wrapper Pybel [33].

Searching and navigation

DNAmod makes modifications accessible via three main navigation options, each provided on a tab of the DNAmod homepage. First, users may search for modifications by several fields. Second, users may find curated DNA modifications via a pie menu [7]. Third, users may find candidate entities as a list, categorized by their parent unmodified nucleobases.

Client-side search functionality provides a means of rapidly finding bases with differing nomenclature (Fig. 2a), while maintaining a static web page. This functionality relies on the elasticlunr.js JavaScript module [47]. Searches match to multiple fields: common or International Union of Pure and Applied Chemistry (IUPAC) names, all synonyms, any assigned abbreviation, and recommended notation symbol, when available. DNAmod displays curated DNA modifications in green, and others in magenta. The search results provide the field matched by the query, such as “abbreviation”, along with the common name of the associated hit.

Alternatively, users may browse the modifications in DNAmod through a pie menu [7] interface (Fig. 2b). This interface hierarchically arranges the bases according to their structure within the ChEBI ontology. The innermost ring consists of the four unmodified DNA bases, with an additional “other” category. This category encapsulates modified bases found in DNA, but which are not modifications of one of the four DNA bases. Consecutive outer rings represent children of the previous base or category. We demarcated natural versus synthetic bases by colouring natural bases in teal and synthetic bases in grey.

Fig. 2
figure 2

Finding 6-methyladenine by a searching for its abbreviation “6mA” or b via the pie menu

DNAmod structure and content

Individual modification pages visually represent the data contained within the backing database. We standardize and display all modifications in an identical format. DNAmod may omit some information, however, depending upon the extent of ChEBI’s annotations and whether the page describes a verified DNA modification or merely a candidate entry.

Modification pages begin with a header displaying the DNA modification’s ChEBI name. The top-right corner of the page lists the unmodified ancestor of the modification. For example, 5-hydroxymethyluracil is a modification of thymine (Fig. 3), whereas 6-dimethyladenine is a modification of adenine.

Each modification begins with a short textual description of its chemistry, followed by a table containing its chemical properties. We import these from ChEBI, which provides their chemical formula, net charge, and average mass.

We annotate entities with all names available from ChEBI, including: their IUPAC name, SMILES [52] string, International Chemical Identifier (InChI) and hashed InChIKey [23] strings, and common synonyms. We also provide a recommended abbreviation and in some instances a suggested single-letter symbol for bioinformatic purposes, from our proposed expanded alphabet [49] (Fig. 3).

We provide literature annotations for many DNA modifications, focusing upon those observed in vivo. We provide a list of methods that have been used to map the genomic locations of a modification (“Manual curation and annotation”). We additionally provide information on a modification’s occurrence, either naturally or only synthetically, where applicable, including some organisms in which it has been observed in vivo (“Manual curation and annotation”). Finally, each page ends with the ChEBI database reference and a ChEBI-derived list of related literature citations (Fig. 3). Our website has semantic web support, making use of the Resource Description Framework in Attributes (RDFa) [40] technique, augmented by Chemical Information Ontology (CHEMINF) [20] and PubChemRDF [15] Semanticscience Integrated Ontology (SIO) [14] annotations—providing machine-readable descriptions of key website features.

Fig. 3
figure 3

The full modification page for 5-hydroxymethyluracil (5hmU)

Discussion

DNAmod enables researchers to rapidly obtain information on covalently modified DNA nucleobases and assist those interested in profiling a modification. It additionally provides a reference toward standardization of modified base nomenclature and offers the potential to track recent developments within the field. We have kept DNAmod up to date for 3 yr and expect to continue to maintain it, particularly as new discoveries about DNA modifications are made. We also hope that DNAmod will serve to highlight underappreciated modifications that may have substantial biological importance.

The nomenclature used to describe a particular DNA modification is often inconsistent, with some early efforts toward standardization of particular classes [11, 28]. The ChEBI name, for instance, often corresponds to the common chemical name of the compound, which is occasionally distinct from its common name within the biological literature, in the context of a DNA modification. We address this and attempt to encourage standardization by endeavouring to ensure that other names are annotated, while providing specific nomenclature recommendations. In particular, the suggested name of verified DNA modifications, as displayed on the homepage and within the recommended notation section, is always manually-curated and sometimes differs from the name assigned by ChEBI.

Our database, like many others, relies upon the ChEBI ontology. Like any large and complex endeavour, curating ChEBI is a substantial undertaking, requiring protracted deployment of expertise and effort. While ChEBI has a dedicated team of expert curators, who assiduously and continually improve ChEBI, their resources are naturally limited. Accordingly, while ChEBI has an issue tracker where we and others can suggest changes, revisions to ChEBI are highly dependent on user reports and the team’s available bandwidth. ChEBI contains a non-negligible fraction of errors and omissions, across most entity categories [30, 55]. These works highlight the substantial effort and difficulty involved in maintaining high-quality annotations. Such errors naturally propagate to its downstream databases, including our own. While we have made efforts to further curate data and report relevant issues back upstream, we do inherit some errors and limitations. As in any project of this nature, we surely have our own errors and omissions. We lack a dedicated curator; accordingly, we curate this data on a best-effort basis. DNAmod has its own issue tracker, and we would appreciate if users could report any of our own errors or omissions, so that we can address them or facilitate reporting them upstream.

The inclusion of assays available to sequence different DNA modifications provides a means of assessing and selecting a sequencing method. It additionally attempts to track sequencing methods over time, as resolution improves, and especially to highlight recent developments, like direct-detection of various modifications via nanopore sequencing [50]. The sequencing annotations we provide annotate nucleobases which are directly elucidated by the method and only for the base or set of bases which the method independently maps. This includes those that are obtained in addition to another nucleobase. For instance, confounded mixtures are often obtained. For example, 5mC and 5hmC cannot be distinguished with only conventional bisulfite sequencing. Alternatively, some methods have the capacity to independently resolve between modifications, such as various nanopore-based methods. Therefore, while many use oxidative bisulfite sequencing (oxBS-seq) in combination with conventional bisulfite sequencing to elucidate 5hmC via subtraction, we only annotated it as a sequencing method for 5mC, which it directly elucidates [6]. Conversely, we only annotate TET-assisted bisulfite sequencing (TAB-seq) under 5hmC, which it directly elucidates [6], although many use it to also detect 5mC.

We demarcated bases found to occur in vivo, providing examples of organisms in which a modification has been found, along with associated citations. This merely substantiates its in vivo presence, however. We did not attempt to comprehensively list the organisms which contain any particular modification. Finally, we expect our brief annotations of the biological roles of various DNA modifications to change as further research is conducted.

Future work

We plan to keep DNAmod updated continuously, manually reviewing newly added ChEBI compounds, requesting appropriate additions to ChEBI, and curating any improvements. We also endeavour to annotate recently developed sequencing methods as we come across them.

Integrating additional external databases will further increase DNAmod’s utility. In particular, we envision potential integration with domain-specific DNA modification databases, such as those cataloguing compounds formed from the operation of particular biological pathways. For instance, modifications involved in DNA damage and repair could be linked to REPAIRtoire [31] data. We could also improve functional characterization using Gene Ontology (GO) [2] or Kyoto Encyclopedia of Genes and Genomes (KEGG) [27], but this would require extensive manual curation.

We used ChEBI Web Services [22] to obtain information from their database. ChEBI has, however, recently released a Python application programming interface (API), permitting us to directly access their data [48]. Switching from our current web-based queries to use of their API would likely result in a more robust system and expedite the database-building process.