Introduction

Tautomerism is defined as the isomerization of a chemical compound according to the general scheme shown in Fig. 1 [1].

Fig. 1
figure 1

General isomerization scheme for tautomers

During this rearrangement, G is typically a single atom or group being transferred from X to Y. G acts as a leaving group during isomerization, X and Y serve as a donor and acceptor for G, respectively. Simultaneously with the transfer of G, pi electrons migrate in the opposite direction of G. Each of the atoms X, Y or Z can be of any of the atom types C, N, O, or S; G can be H, methyl, CH2R, Br, NO, SR, or COR [2]. If it is a conjugated pi system, Y can be a larger group of atoms that allows the transfer of the pi electrons.

If the transferred group G is a proton the isomerization is called “prototropic tautomerism”. This will be the only type of tautomerism discussed in this paper.

Irrespective of the type of rearrangement, any two different tautomers of the same chemical compound differ in their location of atoms and distribution of pi electrons. Therefore, different tautomers of a chemical compound may differ in their pattern of functional groups, double bonds, stereo centers, conformation, shape, surface or hydrogen-bonding pattern.

This difference can have an effect in any area of chemistry in which the static, classical connectivity between atoms is of importance, including computer-aided molecular design, property predictions, and chemoinformatics. Specifically, it may affect:

  • calculation of physicochemical properties (pK a, lipophilicity, solubility etc.)

  • structure clustering and similarity searching (different fingerprints)

  • database registration (identification of different tautomers of the same compound)

  • virtual screening methods (different hydrogen-bonding patterns and H-donor or H-acceptor patterns)

  • assignment to substance classes, e.g. may change the property “aromatic”

  • predicted (or queried) reaction patterns which may be different for different tautomers [36].

A significant and often overlooked effect of tautomerism is that it can change the stereochemistry of a compound. There are two cases: The location of the double bond changes as illustrated in the scheme shown in Fig. 1, which might add or eliminate the presence of an E/Z stereo bond as shown in Fig. 2 (top). Likewise, migration of a double bond to a heretofore sp3 hybridized atom that was chiral removes this chirality, which, in a further tautomeric isomerization step, can be re-established, but with the opposite chirality—which effectively may result in the racemization of this stereo center (Fig. 2, bottom).

Fig. 2
figure 2

Tautomerism can change stereochemistry. Top: change of E/Z geometry. Bottom: change of chirality

It is well-established that tautomerism (as a difference in representation vs. truly separable isomers) depends on conditions including pH, temperature and solvent [7]; it gains crucial importance in the above areas when it is likely to lead to interconversion under “normal” conditions, in which case different tautomeric representation would lead to an erroneous differentiation between isomers that are in reality the same compound (“stuff in the bottle”). It is typically not possible to probe the tautomeric situation experimentally for databases of any significant size. Likewise, while an experienced chemist is likely to be able to recognize cases of tautomerism when visually inspecting small sets of compounds, this approach is obviously not applicable to databases of thousands or even millions of structures. Computational tools are needed instead, and those need to be made tautomerism-aware. This has not always been the case in the past.

Specifically, when registering a new compound in a database, be it a vendor catalog, a commercial aggregation such as the iResearch Library (iRL) from ChemNavigator (Sigma–Aldrich) [8], or a public database such as PubChem [9], it is typically necessary to know if this compound is truly new, or may already be present but may have been previously entered as a different tautomer. This issue likewise comes up in the context of drug design, e.g. as the question: Are some of the members of my contemplated, in silico-designed, library perhaps present in a commercial catalog, but shown as a different tautomer? It has even been reported that tautomeric pairs were found in vendor catalogs for the same compound sold (unwittingly) as different products at different unit pricing [4, 10]. A less common but more drastic case may occur if all structures in an existing database are recalculated, possibly from a different raw source; which is what happened around 2000 with the NCI Database [11, 12] with the effect that, to our great surprise, about 100,000(!) out of ~250,000 structures appeared to have changed. Subsequent analysis revealed that in most of these cases, the tautomer represented had changed. It is thus not only satisfactory from a theoretical chemoinformatics point of view to correctly handle tautomerism issues, but also of significant practical importance.

The arguably next-best approach to experimentation, sufficiently high-level quantum-chemical computations, are not possible for large numbers of structures due to the enormous amount of computer resources required, not to speak of the difficulties of faithfully representing all environmental conditions in such runs. A rule-based chemoinformatics treatment is therefore usually the only practical approach. This implies that the outcome of such tautomer calculations is dependent on the exact rules used, whether they are implemented in a fixed way in the chemoinformatics tool used, modified by the user starting from a predefined rule set, or completely created “de novo” by the user. It is therefore to be expected that the results of our analyses would look quantitatively somewhat different if they had been conducted with different tools and thus tautomerism definitions. It would make little sense to say, “Your analysis must be wrong since we do find a different degree of duplication by tautomer overlap in our database than you report in your study.”

It would be fascinating to try to compare computational tautomerism rules in an “experimental chemoinformatics” way: identify tautomer pairs (or n-tuples) among commercially available samples, based on different sets of tautomerism rules; purchase a number of such sample pairs; and test them by analytical chemistry methods such as NMR and mass spectrometry, possibly under systematically varied conditions (pH, temperature, solvent, etc.), to determine, at least statistically and based on that sample set, which rule sets better reproduce measured sample identity vs. sample difference. This can obviously not be done as part of this study since it is a large-scale project in its own right.

Our collection of small-molecule structures aggregated from numerous databases of very different origin, purpose, and size, has recently breached the 100 million record limit (see below). Its nature as one of the currently largest databases of existing small molecules, vs. very large databases of computer-generated structures [13], offers the unique opportunity to conduct all kinds of studies on a structure set that is not only highly statistically relevant by its sheer size but simply represents a good part of the real chemistry “out there.”

We attempt to give some quantitative answers in this paper on how prevalent tautomer overlap (according to our tautomerism definitions) is in specific databases that make up our aggregated collection. One primary approach to this is to find a canonical representation independent on which tautomer was originally submitted. Depending on how such an approach is implemented, it does not preclude the possibility of keeping different original tautomeric forms for the registration in databases. We present an approach that achieves this based on our specifically crafted identifiers.

Methods

Data set

The data set used for this study was the aggregated database of structures collected by the Computer-Aided Drug Design (CADD) Group of the National Cancer Institute (NCI). This collection has been put together from a diverse set of small-molecule databases, and is referred to by us as the Chemical Structure DataBase (CSDB). It serves as the central small-molecule repository at the NCI CADD Group. It is a source of both commercially and otherwise available screening samples as well as of structural ideas in general for our internal CADD-type work, the basis for many of our public web services, and convenient fount of structures for chemoinformatics studies such as this. The current main sources for chemical structure records in CSDB are the ChemNavigator iResearch Library of commercially available screening samples [14] and PubChem [9]. Additionally, a few small-molecule database coming from other sources such government agencies and academic groups, as well as some vendor catalogs have been included. In its current version (Jan 2010), the CSDB indexes approximately 103.5 million original structure records, which represent about 70.6 million unique chemical structures [15].

It should be noted that CSDB comprises essentially all well-defined (external) databases used in the CADD Group’s work, including proprietary and non-public ones. It is therefore a superset of the structure sets that are offered to the public in our various web-based tools and downloadable data sets [16]. The CSDB data set was used “as is” for this study (it is, after all, large enough); i.e. we did not specifically try to update all original databases in it, some of which are present in somewhat older versions. Our results can therefore not necessarily be taken as an assessment of the tautomeric situation of databases that are continuously being updated and/or curated and may be different in their current versions.

The PubChem data set was downloaded from the PubChem FTP site [17]. Since we generally need for our CADD work PubChem’s assay results, too, which are only available in PubChem’s Substance set (containing the original structures), we typically download this set. This gave us the structures from all of PubChem’s sub-databases in their “rawest” form, i.e. the form least processed by PubChem relative to the representation submitted by the original provider. We did not additionally download the so-called Compound set, which contains the de-duplicated structures based on the normalization applied by PubChem.

Since we register structure records in CSDB that come from various sources and different chemical structure databases, a crucial step during the registration process is the normalization of the chemical structures. Normalization is needed because what is actually the same chemical may be encoded in different ways in different input databases if not the same database, be it due to certain chemical features of the structure that can lead to variable representation, for instance different tautomers or different resonance structures, or be it caused by ill-defined parts of the structure such as misdrawn functional groups, missing hydrogen atoms, missing charges or incorrect valences.

Structure normalization

Figure 3 illustrates the registration process for a new structure record to be entered in CSDB. This process has been entirely implemented on the basis of the chemical data management system CACTVS [18, 19].

Fig. 3
figure 3

Calculation of the NCI/CADD Chemical Structure Identifiers (FICTS, FICuS, uuuuu)

CACTVS is able to read chemical structures from an extensive list of different file formats, which therefore could in principle be all used in the registration of new compounds in CSDB. However, it has so far only been necessary to process databases using the SD file format for the addition of new records into CSDB.

As illustrated in Fig. 3, at the end of the registration process different parent structures are produced from the original input structure. The first step in this process are a few types of structure correction, which are meant to remove mostly error-based differences in the representation of what is really the same chemical. In contrast hereto, the several variants of our parent structures are obtained by removal of different subsets of chemical features that, if correctly assigned, represent truly different chemicals such as different stereoisomers, different salt forms, differently isotopically labeled compounds etc. This is described in more detail below.

For all these parent structures the corresponding NCI/CADD Chemical Structure Identifiers are generated [20]. For this, the E_HASHISY hash code function [21] in CACTVS is used, which calculates a 16-digit hexadecimal (64-bit unsigned) number from an arbitrary chemical structure. E_HASHISY represents a chemical structure very exactly as drawn, i.e. the hash code value changes as soon as connectivity, bond orders, atom types (including isotopes) or stereochemistry changes in the structure.

The parent structures obtained by the structure normalization process represent the original input structure with different levels of sensitivity to chemical features in the original structure. The sensitivity to specific chemical features is adjusted by switching on or off different algorithmic modules during the structure normalization process. Although we have implemented in total a set of eight variants of our identifiers, the most important ones are the “FICTS,” “FICuS” and “uuuuu” parent structure and identifier, which are calculated for each structure registered in CSDB. The naming scheme behind these identifier designations has been explained elsewhere [20]. Briefly, the five letters “F,” “I,” “C,” “T,” and “S” stand for sensitivity to fragments, isotopic labeling, charges, tautomerism, and stereochemistry information, respectively, that may be present in the input structure. If, for a given identifier, sensitivity to any of these chemical features is switched off, the corresponding upper-case letter is replaced by a lower-case “u” (standing for “un-sensitive”).

The FICTS parent structure and its identifier are thus a very close representation of the original input structure. The normalization procedure for both the parent structure and the identifier consists here mainly of a few corrections that fix and unify some typical drawing deficiencies and variations of how certain chemical features in chemical structures coming from different sources are specified (e.g. different drawing variants of functional groups). The FICTS representation of a structure is sensitive to fragments (such as counterions), isotopes, charges, and stereochemistry in the input structure as well as to the specific tautomer drawn.

The FICuS normalization procedure starts with the same modules as used for the FICTS normalization. The additional step is that a canonical tautomer form is determined (see below). This structure is defined as the FICuS parent structure, whose hash code becomes the FICuS identifier, thus yielding a tautomer-invariant representation of the input structure. Since, e.g., different salt forms, differently isotopically labeled variants, and different stereoisomers of a compound are usually seen by chemists as different chemicals, whereas different tautomers drawn for the same compound are not, FICuS is probably the closest chemoinformatics representation among our identifiers of how a chemist perceives a chemical.

The uuuuu parent structure and identifier are a much generalized representation of the input structure. During the normalization of the uuuuu parent structure only the largest organic fragment is kept, i.e. in the case of (organic) salts and coordination compounds any counterions or the metal complex center, respectively, are removed. The input structure is “un-charged” to its most reasonable state. Finally any information about stereochemistry and isotope labels is deleted from the structure. The uuuuu parent structure and identifier are therefore useful to link together closely related forms of the same chemical compound.

All three variants of parent structure and identifier are calculated when a structure is registered in CSDB and stored in the database. This gives one quite fine-grained control over how each chemical compound present in CSDB can be represented as well as searched with different degrees of sensitivity to different chemical features.

Figure 4 illustrates the relationships between the different parent structures and identifiers after structure normalization.

Fig. 4
figure 4

Relationship between the different NCI/CADD parent structures and identifiers after structure normalization

CSDB currently stores 70.6 million parent structures that are unique by their tautomer-invariant FICuS identifier. Each FICuS parent structure is linked to one or more tautomer-sensitive FICTS parent structures, each of which is in turn linked to one or more original structure records. The current count of FICTS parent structures and original structure records in CSDB is 72.0 million and 103.5 million, respectively. The generic uuuuu parent structure links together different FICuS parent structures that are highly related to each other in the way described above. A total of 65.3 million uuuuu parent structures are currently stored in CSDB.

Enumeration of tautomers

If a structure normalization procedure is to include handling of possible tautomerism of small molecules, several components need to be in place: (1) a set of rules for the possible molecular transforms that define the scope of what is meant by “tautomerism” in the context of this approach; (2) a practical implementation of the generation of tautomers, such as exhaustive enumeration of all unique tautomers within reasonable limits, e.g. achieved by setting certain program parameters; (3) definition of a canonical tautomer, e.g. based on a scoring scheme for various chemical features present in one tautomer vs. another. This latter point is essential if a tautomer-invariant connectivity-based identifier is to be calculated for each input structure, such as is done with the NCI/CADD Chemical Structure Identifiers.

Tautomer rules (SMIRKS transforms)

For the enumeration of tautomers, CACTVS uses a set of 21 tautomer rules that cover a wide range of typical 1,2-, 1,3-, 1,5-, 1,7-, 1,9- and 1,11 hydrogen atom shifts. The transforms encoded in these tautomer rules are based on the SMIRKS line notation originally developed by Daylight Chemical Information Systems, Inc., for the description of reaction substructures and the transformation of atoms and bonds during reactions [22]. Table 1 lists the 21 rules and their SMIRKS transforms used by CACTVS for the tautomer generation.

Table 1 SMIRKS transforms for the enumeration of tautomers. CACTVS provides an extended set of attributes for the definition of SMIRKS that have no counterpart in the original SMIRKS syntax, e.g. the attribute zn indicates the number n of heteroatoms substituted to the corresponding carbon atom. Another additional attribute in CACTVS is the en attribute used in rule 5 (e6 on atom 2) which indicates that the corresponding carbon atom has to be member of a ring with at least n pi atoms

The SMIRKS in rule 1 and 2 address 1,3- and 1,5-keto-enol tautomerism of ketones and enols. Both rules are not restricted to keto and hydroxy groups but also include their sulfur, selenium end tellurium counterparts.

Rule 3 in Table 1 describes 1,3 hydrogen atom shifts of aliphatic imines. Rule 4 handles the special case of imines where a pyridine-type aromatic ring system is created or undone by an aliphatic hydrogen acceptor or donor carbon atom adjacent to the ring (atom 1 of rule 4).

The next seven rules cover hydrogen atom shifts on aromatic heterosystems or aliphatic heteroatoms. The first rule of this group, rule 5, addresses a special case of a short-range 1,3 hydrogen atom shift operation. The rule creates or undoes a heteroaromatic system if the central carbon atom (atom 2 in rule 5) is member of a ring system with six pi electrons. This constraint avoids the generation of unlikely high-energy tautomer forms in other ring systems. For rule 5, atom 1 must be a nitrogen atom, atom 2 has to be a carbon atom, and atom 3 has to be a nitrogen or oxygen atom.

Rule 6 handles 1,3 hydrogen migrations on aromatic hetero systems and aliphatic heteroatoms, however with fewer restrictions than the previous rule. In contrast to rule 5, here the central atom 2 is allowed to be a nitrogen or phosphorus atom, and for the two heteroatom positions 1 and 3 sulfur, oxygen, selenium, and tellurium atoms are additionally accepted.

The next five tautomer rules (rule 7–11) deal with long-range 1.5 hydrogen shifts (rule 7 and 8), or very long range hydrogen atom migrations across 7, 9, or 11 atoms (rule 9–11). While rule 8 is restricted to aromatic systems with nitrogen, oxygen, or sulfur atoms, rule 7 addresses specific aliphatic structures with selenium and tellurium as additional atom types. Rule 9–11 are quite similar to each other in what they do and vary only in the number of intermediate carbon atoms and the specification of element types at the terminal heteroatom positions.

The remaining rules 12–21 handle the tautomerism of very specific compound classes, functional groups and molecules. Rule 12 addresses the tautomerism occurring for furanones. This includes furanone-like molecules with a nitrogen or sulfur atom at terminal atom position 2. The interconversion between a keten and ynol group is governed by rule 13. This rule additionally accepts a sulfur, selenium or tellurium atom in place of the oxygen atom. The tautomerism of nitro groups defined in ionic form or with a pentavalent nitrogen atom is handled by rules 14 and 15. Rule 16 manages the tautomerism of simple oxim-nitroso groups; rule 17 handles the special case of oxim-nitroso tautomerism via a phenol system. The final group of rules (rule 18–21) addresses the tautomerism of cyanic/isocyanic acids, formamidinesulfinic acids, and phosphonic acids.

One type of tautomerism that was not included in this study is ring-chain tautomerism. To the best of our knowledge, no chemoinformatics tool in its standard implementation currently handles general ring-chain tautomerism, presumably because ring-chain tautomerism possesses—at least in the general case—more of a 3D nature than most other forms of tautomerism. I.e. while this type of tautomerism may be well-defined for, e.g., the standard carbohydrates, it is much less clear where, and whether at all, it can occur for any type of molecule for which, e.g., steric hindrance may prevent ring closing.

For the generation of all tautomers of a chemical compound, the SMIRKS rules in Table 1 have to be applied systematically to the structure, i.e. each side of each transform scheme has to be tested for a possible match to the structure and, if the match is successful, transformed to the other side. This has to be repeated systematically in case a new tautomeric center has been created by the previous step and the repeated application of the same transform or the application of another transform would generate yet another tautomer of the structure. If several SMIKRS transforms match the structure all possible combinations of tautomer transformations have to be executed during each step. This process has to be continued until no additional new tautomers can be found, a previously specified maximum number of tautomers has been generated, a specified maximum of transform operations has been performed, or a specified timeout is reached (though the latter was not used in this study).

Implementation of the tautomer generation process in CACTVS

The CACTVS command ens transform generates all tautomers of a structure when applied to a so-called molecular ensemble (in which a structure is stored in CACTVS). The command returns the full set of possible tautomers for this structure as a list of CACTVS molecular ensemble objects, each of which holding a single tautomer. The generation of all possible tautomers is accomplished by a systematic application of all SMIRKS transforms listed in Table 1. The way the transforms are applied is controlled by several parameters of the ens transform command, which can actually be used to perform any formal reaction that can be described as SMIRKS.

The specific parameters used for the tautomer generation during the calculation of our NCI/CADD Structure Identifiers are direction, reactionmode, selectionmode, and maxstructures or maxtransforms, which will be described in some detail because they can have significant influence on the generated results especially for larger and tautomerically more complicated molecules.

The first parameter, direction, is set to the value bidirectional, which means CACTVS attempts to match and execute each of the SMIRKS transforms in Table 1 in both possible directions of the formal reaction they describe. The parameter reactionmode, determining how multiple occurrences of the transform substructures in the original structure are handled, is used with the value multistep. This value specifies that a systematic application of all transforms is performed. Therefore, each new tautomer that has been generated by the application of one of the SMIRKS transforms is resubmitted again until no further new tautomers can be found.

The parameter selectionmode is set to the value all. This mode specifies that all SMIRKS transform in Table 1 are applied to any of the structures generated by any previous step of the tautomer generation, not just to the molecular ensembles obtained by the previous step and in the strict order the transforms have been specified (as would be the case with the mode value sequence).

The parameter maxstructures specifies the maximum number of tautomers that should be returned by the ens transform command. For some structures, the enumeration of tautomers runs into a combinatorial explosion of generated tautomer structures. For the calculation of our NCI/CADD Structure Identifiers, we set maxstructures to an upper limit of 1,000.

Because of the exhaustive application of the SMIRKS rules, in most cases at least a subset of tautomers resulting from a specific rule is identical to already generated tautomers. For the de-duplication of generated tautomer structures, the algorithm behind the ens transform command filters any tautomer duplicates by calculating one of the hash code variants available in CACTVS (E_HASHISY [21]) for each tautomer, thus confining the final set of generated tautomers to a unique set of structures.

If the limit of 1,000 generated tautomers has been reached before exhaustive application of the transform rules, the tautomer generation process is terminated and the corresponding identifier is then flagged as (possibly) unreliable. Such cases of a very high number of generated tautomers are mostly the result of long, complex sequences of transforms that result in tautomer structures of only minor practical interest. Analyses we performed, however, showed that for the majority of structures registered in the database the canonical form was reached within these limits. As mentioned above, for the calculation of our NCI/CADD Structure Identifiers it is not the entire set of tautomers that is of actual interest but instead to obtain one canonical tautomer. While determined by definition (since true, energy-based stability calculations can not be performed, as discussed above), such a canonical tautomer should obviously strive to be a very plausible structure by all accepted measures. Therefore, there is a high likelihood for the canonical tautomer to be found among the first 1,000 generated structures; i.e. even if a structure identifier is nominally flagged as unreliable after tautomer generation there is a high probability that it represents the correct canonical tautomer.

In addition to the program parameters described so far, several other ens transform command flags that have an influence on the enumerated tautomers are used. The flags checkaro is set globally (i.e. for all transforms), which undoes the special CACTVS modification of the original SMIRKS definition, i.e. to consider uppercase elements (in the parlance of the SMILES/SMIRKS syntax) as undefined with respect to aromaticity in a substructure definition, and reverts to the original Daylight implementation insofar as uppercase elements can only match aliphatic atoms, while lowercase elements can only match aromatic atoms. The second flag, preservecharges, controls whether a matched atom is changed to the charge of the matching atom in the specified SMIRKS transform. By default, CACTVS performs this change of charges as long as the corresponding atom has sufficient electrons. If the preservecharges flag is set, charges are not modified. This flag affects rules 14 and 20. For rule 14, the preservecharges flag is un-set since charges should be modified by this transform. For both rules 14 and 20 an additional flag, checkcharges, is set, which specifies that the number of formal charges on the matching side of the transform must be identical to the number of charges on the matched structure.

Definition of the canonical tautomer

After all tautomers of a given input structure have been enumerated, a canonical tautomer has to be defined among this set of generated tautomers. As mentioned, defining the truly chemically preferred tautomer is difficult since it requires treatment of effects such as dipole–dipole repulsion, electronic, and thermodynamic effects and is even quite likely dependent, for the same set of tautomers, on factors such as solvent, temperature, basic vs. acidic environment, etc. The influence of all these effects cannot be calculated easily and quickly. Therefore, we implemented a fast, empirical, rule-based rating algorithm in CACTVS instead. This rating system was established by analyzing several different sets of tautomers and the known preferred tautomer members included in these data sets. Table 2 shows the scoring rules obtained from this analysis.

Table 2 Scoring of structure fragments used for the definition of a canonical tautomer

Each scoring rule is based on the occurrence of certain structure fragment. The general scoring of a tautomer is increased or decreased by the number of scoring points of the corresponding fragment multiplied by its number of occurrences.

The tautomer that has garnered the best scoring of all tautomers in the set is defined as the canonical tautomer of the given structure. If more than one tautomer gets the maximum scoring, the tautomer with the largest hash code value is, quite arbitrarily from a structural point of view, selected as the canonical tautomer form. Generally, this approach does not guarantee that the tautomer defined as the canonical one is the chemically most reasonable or the lowest in energy in absolute terms; however, this is not needed here. The more important aspect is to always find the same tautomer form as the endpoint of the described enumeration process regardless of which tautomer form was given as the starting point to the algorithms. This is guaranteed if the list of SMIRKS transforms was applied exhaustively during the enumeration of tautomers. Even if the limit of 1,000 generated tautomers was hit, the algorithm displayed a still very high reliability of generating the true canonical tautomer (had exhaustive enumeration been done) for compounds of the sizes typically found in small molecule databases.

Normalization of stereochemistry in the canonical tautomer

As shown above (Fig. 2), stereochemistry of a chemical compound can be affected by tautomerism. In order to tackle this problem, we expanded the existing algorithm in CACTVS for defining the canonical tautomer by adding a step for the correction of stereochemistry. Figure 5 illustrates how this works for the example of methyl propenyl ketone.

Fig. 5
figure 5

Normalization of stereochemistry for the canonical tautomer, involving double bonds whose stereochemistry is disregarded in the final step producing the canonical tautomer when the original stereo bond does not have a fixed location during tautomer generation

Methyl propenyl ketone can be drawn as an E or Z stereoisomer, and both stereoisomers can be separated spectroscopically and have different CAS Registry numbers [23, 24]. Notwithstanding this, CACTVS creates a common set of formal tautomers in which the location of the original double bond—including its stereochemistry—has changed.

As mentioned above, whether structures such as these two stereoisomers of methyl propenyl ketone actually interconvert depends on conditions including pH, temperature, and solvent, and in general on structural effects such as steric hindrance [25], conformer energy differences, and barriers to internal rotation [26]. However, all these effects are way beyond the scope of chemical structure identifiers or a database registration process that should be usable for millions of compounds. Another aspect is that, arguably for aesthetic reasons, it is quite common for chemist to draw double bonds with unspecified or unknown stereochemistry in the E form. Therefore, for our tautomer-invariant FICuS parent structure and identifier, we decided to disregard stereochemistry on double bonds that do not have a fixed location during tautomer generation. In contrast to this, the tautomer-sensitive FICTS parent structure and identifier preserve both the specific tautomer and any stereochemistry on double bonds even if it could change position because of tautomerism.

We also developed an extension of the algorithm that removes stereochemistry assignment in a similar way for sp3 hybridized atoms that have assigned R/S stereochemistry but changed to an sp2 hybridized atom at least once during tautomer generation (see the thalidomide example in Fig. 2). For atoms of this type, racemization may occur because of tautomerism. However, general application of this module would be problematic. For instance, the R and S forms of amino acids would not be distinguishable by our tautomer-invariant FICuS identifier anymore since one formal tautomeric form contains an sp2 hybridized alpha carbon atom. We therefore currently do not use this module in our structure normalization algorithm.

An illustration of our exhaustive tautomer enumeration is shown in Fig. 6 for the example of 2-hydroxy-3,4-dimethoxy-6-methylbenzaldehyde (1).

Fig. 6
figure 6

Enumeration of all tautomers of 2-hydroxy-3,4-dimethoxy-6-methylbenzaldehyde (1)

CACTVS generates 12 additional tautomers (2-13). They are displayed in Fig. 6, which also shows the transform rule (from the set given in Table 1) that led to each interconversion listed, plus each tautomer’s scoring calculated on the basis of the scoring scheme elaborated in Table 2. The tautomer that received the highest scoring among the 13 tautomers was structure 1. It is therefore regarded as the canonical tautomer, which is used as the tautomer representing the FICuS parent structure. One should note that a number of double bonds are drawn as crossed bonds. This indicates that these bonds have been explicitly assigned “no stereochemistry” because, though generated in the specific tautomer, they are mobile thus do not have a fixed stereochemistry throughout the entire enumeration process.

Results

Tautomeric analysis of CSDB

Tautomeric overlap within each database (“local” overlap)

Table 3 lists all releases of original databases that are currently contained in CSDB, with various counts and percentages including the results of the analyses based on tautomeric overlap found within each individual database in CSDB.

Table 3 List of all databases present in CSDB. With: the original publisher and the source from where the database was obtained; the number of original structure records and the unique structure counts after de-duplication by the FICTS and FICuS identifiers, and the percentage of duplicate structures by both identifiers; the difference between unique structure counts between the FICTS and FICuS parent structure sets

As can be seen, the majority of individual databases were downloaded from PubChem. As source we used the full database dump provided as Substance SD files at PubChem’s FTP server [17]. This download was performed on 26-Jun-2007, and a second time on 10-Jun-2008 to update CSDB with substance records that had not been part of the first download. This also added a handful of entirely new databases that were only present in this second download.

For ChemNavigator’s iResearch Library, we followed the CADD Group’s quarterly update of this database. The latest update that was included for this paper is the July 2009 release of the iRL. Structure records coming from the iRL are registered on the basis of ChemNavigator’s Structure ID, which is their unique structure identifier. It should be noted that even for Structure IDs that ChemNavigator has marked as inactive or not available any more, we keep them registered in CSDB. Our main interest of having these structures (that were declared as being available at least at some point in time) available for in silico screening experiments overrides the consideration whether this structures are currently available or not (which needs to be individually ascertained before actual sample orders anyway).

ChemNavigator’s structure records currently represent approximately 56% of all structure records registered in CSDB, the percentage of PubChem records is approximately 38%, the remaining original database combine to approximately 6%. Some of these databases, especially some US Government databases such as the Open NCI database, the NIST WebBook, or the NLM ChemIDplus set had been in our collection for a long time (obtained directly as SD files from the original sources) and were not jettisoned for this study, therefore their substantial overlap with the same database’s release from PubChem is not surprising. The large difference in record counts between “our” NLM ChemIDplus version and the ChemIDplus set coming from PubChem stems from the fact that this database contains a lot of records that have no structure. PubChem registered all these records with their own Substance ID, whereas we added only those records that contained a structure in the original ChemIDplus SD file.

Table 3 lists the unique structure counts obtained after FICTS structure normalization (tautomer-sensitive) and FICuS structure normalization (tautomer-invariant), respectively, for each database release in CSDB, as well as the percentage of duplicates with regard to the number of original structure records calculated from these counts. As Table 3 shows, the major part of de-duplication is already achieved by the FICTS structure normalization, which does not include any tautomer normalization. The average percentage of duplicates found across all databases during this normalization step is approximately 6.8% (average of all values in column “% Duplicates by FICTS” in Table 3) when compared to the number of original structure records. For the tautomer-invariant FICuS identifier, the average percentage of duplicates found is at about 7.0% (average of all values in column “% Duplicates by FICuS”), i.e. the average difference between the de-duplication steps by FICTS structure normalization and FICuS normalization is surprisingly small for each release. Especially ChemNavigator seems to use a very strong algorithm for the normalization of structures in general, which also appears to include a very strong tautomer de-duplication step. For the numbers for all database releases obtained from PubChem, it is important to remember that we used the “raw” substance files which had not undergone any normalization by PubChem, thus these numbers represent the quality of the original database releases.

If the tautomer-invariant FICuS identifier hash code value for a chemical compound has a different value than the tautomer-sensitive FICTS identifier, this means that neither the original structure record nor the FICTS parent structure are identical to the canonical tautomer as represented by the FICuS parent structure. In the entire CSDB database, this occurred for 8.9% (9,224,751 records) of the 103,497,350 original structure records. Based on the number of unique FICTS parent structures (72,034,119 records), 8.6% of the FICTS parent structures (6,198,011 records) changed to a different tautomer during the FICuS normalization procedure.

From the perspective of unique FICuS parent structures, about 98.5% of them (69,561,639 records) had a one-to-one relationship to a FICTS parent structure, i.e. even though the tautomer-invariant FICuS parent structure may represent a different tautomer than the tautomer-sensitive FICTS parent structure, there were no conflicts in the sense that any other original tautomer structures (FICTS parent structures) were found assigned to this same canonical tautomer (FICuS parent structure). The group of FICuS parent structures with this one-to-one relationship to a FICTS parent structure represents about 96.6% of the FICTS parent structures and 95.2% of the 103,497,350 original structure records, respectively.

The remaining 1,078,853 FICuS parent structures (1.5%) had multiple FICTS parent structures assigned to them, which is an indication of a tautomer conflict. The frequency of such conflicts is not simply a function of the individual database size: The numbers in the last column of Table 3 range from exactly 0 to nearly 2%.This finding argues against the possible objection that our tautomerism definition is too “aggressive” and will therefore hit a certain percentage of structures in any database no matter how carefully that database was processed or curated.

These tautomer conflicts can be grouped into three classes: (a) the number of tautomer-sensitive FICTS parent structures assigned to one tautomer-invariant FICuS parent structure exceeded the number of original databases in which the FICuS parent structure occurred, (b) the number of FICTS parent structures was the same as the number of databases in which the corresponding FICuS parent structure occurred, or (c) the number of FICTS parent structure was smaller than the number of database. The explanations for these cases are: (a) tautomer conflicts occurred for a specific chemical compound across all databases, plus some tautomer conflicts occurred even within a single database, (b) the same chemical compound was represented as different tautomers in all databases but there were no conflicts within a particular database, and (c), there were several database groups such that each database in its group consistently shared one tautomer representation for a specific chemical compound with all other group members, but different database groups used different tautomer representations. Table 4 shows the different parent structure counts for these cases, listing the number of unique structures by FICuS structure normalization (column “FICuS parent structure count “) and the number of FICTS parent structures that are linked to structures regarded as unique by FICuS structure normalization (column “FICTS parent structure count”). Analogous numbers are shown for the count of structure records in the original databases (column “original structure record count” in Table 4). The percentage values are calculated with respect to the corresponding unique structure or record counts in Fig. 4.

Table 4 Number of tautomer conflicts. As conflict is regarded a case when a tautomer-invariant parent structure (FICuS) is assigned to more than one tautomer-sensitive parent structures (FICTS)

In Fig. 7 we show the tautomeric situation we found for the compound 1-phenyl-3-methyl-4-benzoyl-pyrazolone-5 (HPMBP) as a “real-life” example of a case in class (a). HPMBP is used in liquid membranes for the selective removal of metal ions or molecules from dilute solution [27]. The selectivity and efficiency for the extraction of metal ions seems to depend specifically on the tautomeric form of HPMBP, which in turn is dependent on solvent and the concentration of HPMBP in the liquid membrane. The structures 14-20 shown in Fig. 7 represent all formal tautomers enumerated by CACTVS. The interconversion between the tautomers in subset 14-17 seems to be energetically unhindered [27].

Fig. 7
figure 7

Example of a tautomer conflict found for 1-phenyl-3-methyl-4-benzoyl-pyrazolone-5 (HPMBP)

For five of the seven tautomers, a pair of stereoisomers can be drawn; the remaining two are achiral structures. Three of the tautomers have a CAS Registry Number (CASRN) assigned. CASRN 4551-69-3 has 859 references assigned in SciFinder, CASRN 33064-14-1 has 49 references and CASRN 12711-31-1 occurs with 3 references, which seems to be an indication that the first two structures are the most important tautomeric forms. Our algorithm in CACTVS defines 15 as the canonical tautomer.

Figure 7 also shows the occurrence counts of the HPMBP tautomers in CSDB. One can see that the majority of the seven formally possible tautomeric representations were actually found in one or more of the constituent databases of CSDB. Tautomer 14 was found in six databases (Ambinter, ChemDB, ChemSpider, DiscoveryGate, iResearch Library, Thomson Pharma), tautomer 15 in 12 databases (ACD 3D, Ambinter, BindingDB, ChemBank, ChemDB, ChemSpider, iResearch Library, MLSMR, NIAID HIV/OI, Scripps Research Institute Molecular Screening Center, Thomson Pharma, ZINC), tautomer 16 was found without indication of stereo configuration in 16 databases (ACD 3D, ACX, Ambinter, BioByte QSAR, ChemBank, ChemBridge, ChemDB, ChemSpider, DiscoveryGate, EPA GCES, MLSMR, NCI Open Database, NIST MS-Lib, NLM ChemIDplus, Sigma–Aldrich and Thomson Pharma), as R stereoisomer in three databases (ChemSpider, ECOTOX, and ZINC) and as S stereoisomer in two databases (ChemSpider and ZINC). Finally, tautomer 18 was found in ChemDB with no stereo information present.

Tautomeric overlap across databases in CSDB (“global” overlap)

Table 5 provides the results of a more “global” analysis of tautomerism in CSDB in the sense that we look at tautomeric multiplicity of each structure across all of the databases in CSDB and not just within each database (release) as it was done for Table 3.

Table 5 Analysis of “global” tautomerism in CSDB

The column “FICuS structures with formal tautomerism” lists the numbers and percentages of canonical tautomer structures (FICuS structure set) for which CACTVS generates at least one additional formal tautomer. The average percentage value of structures showing tautomerism in CSDB according to our (admittedly “aggressive”) definition is 68.3%. The next column, “occurrences of FICuS structures with multiple FICTS assignment” gives the numbers of FICuS parent structures that occurred with a global conflict, i.e. had more than one FICTS parent structure assigned somewhere in CSDB (see Table 4). The average percentage of FICuS structures for which this occurs in each database release of the CSDB is 9.5%. The last column in Table 5 lists the numbers and percentages of FICuS parent structures which occurred exclusively in that one database release. By definition, this is the fraction of structures for which it is not possible to have a tautomer conflict with other databases in CSDB.

Systematic enumeration of tautomers

Another aspect we were interested in was how large the “chemical space of formal tautomers” is that can be enumerated from the structures in CSDB according to complete set of tautomeric transform rules available in Table 1. By applying these rules to the set of 70,640,491 canonical tautomers (FICuS parent structure set), all possible tautomers were enumerated.

We used essentially the same setup as for the calculation of our NCI/CADD Structure Identifiers, for which we set a maximum of 1,000 tautomers to be generated per input structure. However, since CACTVS attempts to systematically generate tautomers until 1,000 structurally different tautomers are found, it can occur that CACTVS has to perform a large fraction of all possible combinations of SMIRKS transform until this limit is reached (or all possible combinations have been exhausted). For molecules with many tautomeric centers, this can take in the range of minutes. To limit CPU time to a manageable level for this experiment, we therefore set a limit of 1,000 transforms for each structure. This limit is typically reached earlier than the 1,000 tautomer limit, and thus leads to a more linear scaling of CPU time as a function of the number of tautomeric centers.

This procedure created a set of 680,556,829 chemical structures including the original FICuS parent structure set. Table 6 shows how often each CACTVS transform rule from Table 1 was used in the creation of this tautomer set. This may provide a useful statistics about the prevalence—and thus importance in algorithmic approaches—of the various tautomeric transforms encountered for a real database, not just assessed on theoretical grounds. As one can see, the distribution varies widely, and ranges from an order of 100 to more than 100 million.

Table 6 Frequency of application of CACTVS transforms in the systematic generation of all tautomers for the FICuS parent structure (canonical tautomer) set

The number of FICuS parent structures for which the generation of tautomers was not exhaustive (because the limit of 1,000 transforms had been reached) was approximately 1,2 million (~1.7%).

Table 7 shows the distribution of tautomers generated for all canonical tautomers (FICuS parent structures). For only 13.8% of the FICuS parent structures did the application of our rules not generate any tautomers. The maximum number of tautomers generated for one structure was 832. The majority of structures (62.7%) had between one and ten tautomers.

Table 7 : Distribution of the number of tautomers generated per FICuS parent structure

It bears repeating at this point that our definition of tautomerism is a quite “aggressive” one, i.e. quite a few of the tautomers generated by the SMIRKS rules described earlier would be regarded by a chemist as a minor or even entirely unlikely form. For instance, even if a molecule contains as the sole functional group only one single amide group, its imidic acid form is generated as a possible tautomer. Therefore, the number of molecules not showing any appreciable tautomerism—could the experiment be conducted for 70+ million substances—is in all likelihood quite a bit underestimated in Table 7. It must be emphasized, however, that our approach is not meant to most faithfully represent the experimental situation. Its main purpose is not to avoid any energetically unfavorable forms but to tie together as many of the conceivable tautomeric representations of a compound as possible. Our experience has shown that one has to expect to encounter any formally possible tautomer when working with many different databases and tens of millions of structures, and handling this correctly requires a systematic and comprehensive enumeration of tautomers. For this purpose, it is of no negative consequence if we equate tautomeric forms with each other among which there are high-energy forms that are not likely to exist under normal conditions.

This is in contrast to program packages that attempt accurate enumeration of tautomers and ligand protonation states under biological conditions. One such program is Schrödinger’s pK a prediction tool Epik [29, 30]. Epik purposely avoids energetically unfavorable forms (because they would create results for docking experiments that are undesired anyway), hence a much smaller number of generated tautomers can be expected. This is exactly what we found when we applied Epik to a small subset of 700 structures chosen to represent the tautomeric diversity in CSDB and compared the number of generated tautomers to the results of our approach. Such comparisons are therefore of limited relevance for the questions we tried to address in this study.

The price for our comprehensive approach is that we may, in some cases, tautomerically equate structures with each other that have such a high energy barrier for interconversion that they are in reality separate, stable compounds that do not interconvert even long-term. As already mentioned, to answer these questions quantitatively for individual compounds is beyond the means of current chemoinformatics since it requires careful analysis at the orbital level with consideration of the molecule’s environment.

Another important aspect of tautomerism is that different tautomers of the same chemical compound vary in their pattern of double bonds, the form of specific functional groups taking part in the tautomerism, and the position of hydrogen atoms. This has an important consequence for bit-vector representations of molecular structures based on absence or presence of specific fragments and paths in the structure as commonly used for the calculation of Tanimoto-type similarity indices, which thus can be strongly affected by these kinds of structural changes. Therefore, any database searches based on Tanimoto similarities can be affected by tautomerism. Our extremely large tautomer set offered the ideal opportunity to analyze the magnitude of this effect quantitatively.

As part of the generation of the 680 million tautomer structure set, we calculated the Tanimoto similarity between each tautomer and the corresponding canonical tautomer (FICuS parent structure). Table 8 lists the distribution of calculated Tanimoto indices. The Tanimoto similarities were calculated using the PubChem fingerprints [28], which are a fragment-based bit-vector type representation of a chemical structure based on the CACTVS E_SCREEN bit vectors.

Table 8 Distribution of Tanimoto similarities in the entire set of tautomers (680 million structures) between the FICuS parent structure (canonical tautomer) and all derived tautomers]

It stands to reason that calculating Tanimoto indices using fingerprints based on a different selection of fragments and paths, especially if these can be affected to a different degree by moving protons and double bonds in the context of tautomer enumeration, can be expected to yield different results. It was not possible to repeat this analysis with several different fingerprint types for the entire set of structures in CSDB. We did, however, conduct one, admittedly anectodal, comparison with one single molecule (Fig. 8) for the following seven other fingerprinting methods: the functional class fingerprints (FCFP), the extended connectivity fingerprints (ECFP), each calculated for “lengths” (size of evaluated atom spheres around each heavy atom) 2, 4, and 6, and the MDL Public Keys as implemented by Pipeline Pilot [3133].

Fig. 8
figure 8

Low Tanimoto similarity between different tautomers. Structure 21 is regarded as the canonical tautomer by CACTVS, structure 2224 are formal tautomers generated from 21. The italic numbers are the Tanimoto similarity indices between the canonical tautomer and respective tautomer structure calculated by a PubChem/CACTVS E_SCREEN fingerprints (881 bit fragment set), b extended connectivity fingerprints (FCFPs) with length of 2, 4, and 6 as implemented in Pipeline Pilot (1024 bit hashed), c functional class fingerprints (ECFPs) with lengths of 2, 4, and 6 as implemented in Pipeline Pilot (1024 bit hashed), and d MDL Public Keys as implemented in Pipeline Pilot

Figure 8 shows the Tanimoto similarities between the canonical tautomer (21) and the alternative tautomeric forms (2224) of the same chemical compound (NSC 68797). When compared by CACTVS/PubChem fingerprints (values (a)), Tanimoto similarity values all the way down to the 0.4 range were found. The corresponding values for the three (different-lengths) FCFP fingerprints and the three ECFP fingerprints, shown as sets of values (b) and (c), respectively, indicate that both the FCFP and ECFP fingerprints appear to be quite sensitive to tautomerism. The maximum similarity value found by FCFPs and ECFPs for tautomers 2224 compared to the canonical tautomer was 0.62, while the minimum similarity value of 0.10 (tautomer 22) was even lower than for the PubChem fingerprints. For the calculation of ECFPs, the number of connections, the number of bonds to non-hydrogen atoms, the atomic number, the atomic mass, the atomic charge, and the number of attached hydrogens are taken into account [34]. For the FCFPs, structural features like whether an atom is a hydrogen-bond donor, is positively ionizable, is negatively ionizable, is aromatic, or is a halogen are evaluated [34]. Most of these features changes if a different tautomer is used for the calculation.

On the other end, the MDL Public Keys turned out to be the fingerprints least sensitive to tautomerism. The lowest of the Tanimoto similarities for the structures in Fig. 8 (shown as values (c)) was 0.76. For the handful of structures we analyzed besides NSC 68797 (21) for this question the MDL Public Keys were usually less tautomerism-sensitive than the PubChem fingerprints. Both the PubChem fingerprints and the MDL Public Key fingerprints are calculated on the basis of predefined fragment sets. Without going to great lengths, our quick qualitative analysis of the fragments covered by the MDL Public Keys seemed to indicate fewer fragments with explicitly defined hydrogen atoms than for the PubChem fingerprints, which would explain their smaller sensitivity towards tautomers.

All these numbers are noteworthy, and perhaps even worrisome, if one ponders the following question: If we already may miss up to ~23% of matches by Tanimoto similarity at a cutoff of 0.8 due to tautomerism of one and the same compound (for PubChem fingerprints), how many more matches may be missed if we try to find truly just similar, i.e. not identical, compounds, using the same Tanimoto cut-off? To explore this question quantitatively is, however, beyond the scope of this paper. This finding begs the question if a tautomer-invariant form of the Tanimoto similarity (or, more accurately, of the bit vector representations used to calculate it) may be something that may be worthwhile developing. The MDL Public Keys seem to come closest to this among the fingerprints tested, though they certainly are not completely tautomer-invariant.

Analysis of stereochemistry

During the calculation of the FICuS parent structure set and the enumeration of the 680 million tautomers, we also analysed how stereochemistry is affected. We did this separately for E/Z stereochemistry of double bonds and chirality of atomic centers. In 72,556 cases among the 70.6 million canonical tautomers (FICuS parent structure set), explicitly defined stereochemistry on a double bond had been deleted in the way shown in Fig. 5. In 2,049,150 cases (2.9% of the FICuS parent structure set), we registered problematic stereochemistry on atomic centers. However, we did not apply the analogous treatment of removing the stereochemistry from these atomic centers, for the reasons discussed before. These numbers should be placed in the context of 8,079,330 of the FICuS parent structures (11.4%) being classified as possessing fully specified stereochemistry. During the generation of the 680 million tautomer set, we also tallied for how many canonical tautomer structures tautomers with potential stereocenters on atoms or bonds were generated. Both events occurred quite frequently: In 43,381,751 cases (61.4%), at least one tautomer was generated that had one or more double bonds being a potential E/Z stereocenter; potential atomic stereo centers were created for 30,818,806 of the canonical tautomers (43.6%).

Conclusions

According to the tautomerism definition used for the work described in this paper, tautomerism is not a rare occurrence in databases of truly existing compounds. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. For nearly 5% of the 103.5 million original structure records did we find a case of either local or global tautomerism overlap. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap within each individual database in CSDB occurred on average for 0.3% of each database’s entries, with values found as high as nearly 2% for some databases. When this analysis was extended to tautomeric overlap across all constituent databases in CSDB, the apparently more frequent occurrence of “tautomerism-critical” molecules across the 150+ individual databases caused the rate of overlap to jump to nearly 10%. In other words, unless one uses a tautomer-invariant approach, one has a nearly one-in-ten chance of missing a match when trying to match any one entry in CSDB with every other structure in our aggregated collection.

As discussed, the tautomerism definition (i.e. the ensemble of tautomeric transform rules) used in our tautomer-related work is rather comprehensive, certainly more so than in many other approaches and software used. Apart from very costly large-scale quantum-chemical calculations, it may take some ingenious and also not cheap experimental work to come to a verdict on which tautomerism definition best represents, at least in a statistical way, the practical situation encountered with databases of existing samples. In general, we believe it is important, from a chemoinformatics point of view, to have a tool for finding structures tautomerically linked to each other even if these tautomers may exist as separable compounds under certain conditions. As we have shown, a tautomerism analysis done right always should allow one to go back to the individual original structure (connectivity). The distribution of number of intra-database tautomer conflicts across the individual CSDB databases also argues against our tautomer definition being unreasonably broad.

We believe that these numbers also indicate that a more careful de-duplication of tautomeric multiples appears to be warranted for many existing databases, whereas some other small-molecule collections appear to be quite “clean” in this regard. Our analyses also seem to point to the necessity of considering the need for tautomer-invariant bit-vector structure representations and ensuing Tanimoto (and related) similarity calculations. All in all, tautomerism appears to be a topic that will be with the chemoinformatics and small-molecule database community for a while.