So you think you understand tautomerism?
- First Online:
- Cite this article as:
- Sayle, R.A. J Comput Aided Mol Des (2010) 24: 485. doi:10.1007/s10822-010-9329-5
It appears so simple at first glance, “tautomers are isomers of organic compounds that readily interconvert, usually by the migration of hydrogen from one atom to another”. If a chemist can describe the problem so succinctly, one might question why the complication of tautomerism remains a considerable challenge to cheminformatics and computer-assisted drug design. With a half-century of experience with representing molecules in computers, and almost limitless modern computational power, the problem should have been solved by now. The unfortunate answer is that the frustration and inconvenience of a database search failing to find matches due to differences in the tautomeric forms of the query and registered compounds is but the tip of an iceberg. Prototropic tautomerism, the movement of hydrogens around a molecule, is but just one aspect of an interconnected web of complications. These include mesomerism, aromaticity, protonation state, stereochemistry, conformation, polymerization, photostability, hydrolysis, metabolism and EOCWR (explodes on contact with reality). The common theme is that valence theory, which underlies all modern chemical informatics systems, is an approximate theoretical model for representing molecules mathematically, and, as with all models, it has limitations and domains of applicability. In the physical environments that chemists care about, small organic molecules are often dynamic, existing in multiple equivalent or interconvertible forms. A single connection table can at best represent a snapshot or sample from these populations. Although partial algorithmic solutions exist for handling the most common cases of tautomerism, this perspective hopes to argue that the underlying problems perhaps make tautomerism more complex than it might first appear.
KeywordsTautomer Tautomerism Mesomerism Protonation state Enumeration Resonance Aromaticity
The field of molecular chemistry owes a debt of gratitude to the work of August Kekulé (1829–1896) and Gilbert N. Lewis (1875–1946) for establishing the structural formula as a mathematical model of chemical structure . The representation of a chemical as an undirected graph, with atoms represented by vertices labeled by elements of a given valence, and bonds represented by edges annotated by bond order, revolutionized chemistry. The enormous influence of this view of chemistry cannot be overstated. Indeed, the connection tables and line notations of modern chemical information systems owe their origins to inventors who could not have foreseen the development of today’s computers.
Particularly striking is that the Lewis/Kekulé structure is the foundation of almost all modern chemical informatics systems, even though more advanced and more accurate theoretical models of chemistry, such as quantum mechanics, have been available for some time. The clue to the longevity of valence theory is its computational tractability. Although quantum mechanical representations, such as molecular orbital theory and valence bond theory, are universally acknowledged to be more faithful models of chemistry, they are considerably harder models to encode and reason about computationally. Perhaps the single greatest attribute of valence theory is concept of “identity”, followed closely by the less well-defined concepts of substructure, superstructure and similarity. Given two graphs representing molecules, it is a well-defined (and well studied) task to ask whether the two graphs are the same (termed isomorphic in graph theory). Alas, quantum mechanics does not have a comparable notion of equality; one can confirm that two “systems” have the same number of nuclei, with equivalent numbers of protons and neutrons, and that they have the same number of electrons with the same total spin, but each 3D configuration of nuclei gives rise to a different set of wave functions. In most quantum mechanical formulations, there’s no distinction between conformers, tautomers and structural isomers. Every arrangement in space is no more or less unique than any other. The development of next generation cheminformatics systems based on quantum mechanics rather than valence theory would mark a major advance for the field.
In Fig. 2, I’ve intentionally drawn both Kekulé forms of pyrimidin-4-ol to highlight the relationship between tautomerism and resonance. The aromatic nature of pyrimidine means that there exist two distinct Lewis structures for 4-hydroxypyrimidine, such that a pure valence theoretical representation is not unique or canonical. This is a well known and understood complication in cheminformatics that is readily solved by the introduction of aromatic bond types or canonical Kekulé forms. But the principal is the same for tautomerism; where all of the Lewis structures in Fig. 2 denote a single chemical entity such as a compound dissolved in water or a drug circulating inside a patient. A classic historical perspective on this “two sides of the same coin” is given by Linus Pauling, in the section “The relation between resonance and tautomerism” in his book “The Nature of the Chemical Bond”  (appropriately, as mentioned as the start of this article, dedicated to Gilbert Newton Lewis).
This example also highlights a second important point, that the simplification of a system to its major species is an approximation. The two major tautomeric forms considered in Fig. 1, are a reduction of the three tautomeric forms (four resonance forms) in Fig. 2, which in turn are a reduction of a much larger number of significant forms or states available to 4-pyrimidinone in dilute aqueous solution. It is unfortunately not uncommon for schemas given in papers on tautomerism to fail to mention, let alone identify, the predominant species for the conditions under discussion.
Tautomerism, mesomerism and ionization
All four Lewis structures in Fig. 3 denote the same compound, thioacetic acid or thioacetate, and would be expected to be represented by a single entry in a chemical registration system, for example identifier 10-08 in Chemical Abstract’s registry. The top two neutral forms are clearly tautomers of each other. However, the pKa of thioacetic acid is about 3.33 (at 25 °C) so in aqueous solution at room temperature it would predominantly exist as the bottom form, which are resonance forms of each other. Unlike carboxylates with two symmetric oxygen atoms, thiocarboxylates have two distinct Lewis structures, even though quantum mechanically (physically) the electrons are not associated with one particular atom or another.
Typically, software for registration in a database system will attempt to neutralize a molecule, stripping protons from charged amines and adding protons to “olates” in order to give a molecule no net charge, preferably without zwitterions. For virtual screening, however, software may prefer to normalize molecules to a plausible ionization state around pH 7. Correspondingly, the preferred valence representations of functional groups (such as hypervalent or charge separated forms of nitro groups) are identified by pattern matching alternate forms, and replacing them with their canonical preferred representation. In both processes, there is often no right answer. Mesomeric forms are artifacts of valence theory, so choice of “preferred” styles are purely a matters of convention or style, with different pharmaceutical companies or vendor catalogues often adopting different (competing) aesthetics. Likewise, for protonation state and tautomeric forms, compounds frequently exist in two (or more) forms, with neither being significantly more popular than another. Provided that the mechanism used to select the representative form is consistent, duplicates can be easily identified.
A convenient property of most examples of tautomerism is that atomic hybridization of each non-terminal atom is preserved. This conservation of local geometry at each atom (trigonal planar for sp2, tetrahedral for sp3, etc.) means that tautomers forms of molecules have highly similar conformations, and low RMS values for heavy atom superpositions. This general property provides an algorithmic advantage of the 3D analysis of tautomers, where the co-ordinates of one can be used as an approximation or starting point for the co-ordinates of another. The application of this technique in structure-based drug design (docking) is discussed by Sayle and Nicholls . The principle is that a single representative tautomer may be posed in a protein active site, and its family of tautomers rapidly scored using the same pose, rather than the more conventional approach of enumerating all tautomeric forms first, and then performing conformation generation and docking on each independently.
The sp2 example in Fig. 16, also demonstrates the potential interaction between tautomerism (or mesomerism) and stereochemistry [13, 14]. Based upon the Lewis structure, the central phosphorus atom may potentially be incorrectly perceived as a chiral center, even when X or Y are hydroxyl, hydroxylate, thiol or thiolate groups. Likewise, tautomers that conjugate through acyclic double bonds may have problems preserving/annotating cis vs. trans configurations without the ability to represent IUPAC’s s-cis and s-trans forms of stereochemistry (such as of buta-1,3-diene) .
Local versus global approaches
Algorithmic solutions to handling tautomerism may be loosely categorized into two broad classes; local methods and global methods. The local class of techniques is based upon pattern matching, encoding rules that apply to small substructures. These methods are easy to implement and catch the majority of important cases. Examples of this sort of approach include the Intervet CACTVS rule set , which contains 21 patterns, AstraZeneca’s Leatherface rule set  which contains 140, and the TauThor system of Milletti et al.  which iterates a single rule repeatedly, and the database search technique of Trepalin et al.  which only handles 1,3-tautomerism and annular tautomerism in 5-membered rings and aromatic carbocycles. Additional examples of local methods can be found in the literature [20, 21].
Global tautomerism approaches tackle the problem in a more holistic manner. In much the same way that computational quantum chemistry codes determine the locations of electrons around a configuration of nuclei, global tautomer algorithms place a specified number of protons on a topological scaffold of heavy atoms. And in much the same way as electrons populate the more energetically favorable orbitals first, the protons associate with the most favorable heavy atoms. One of the simplifying principles of quantum chemistry is the Born–Oppenheimer approximation, that states that electrons reorganize around atomic nuclei fast enough to consider the nucleic fixed and the electronic reorganization instantaneous. The author proposes a variation of this principle for tautomerism, that protons and electrons reorganize around heavy atom nucleic fast enough to consider the heavy atoms fixed and the proton/electronic reorganization instantaneous.
Global approaches to tautomerism necessarily identify a superset of tautomers to those found by local approaches, potentially allowing hydrogens and formal charges to migrate large distances. Example implementations of global tautomerism algorithms include OpenEye Scientific Software’s tautomers [22, 23] and Accelrys’ Pipeline Pilot enumerate tautomers component .
Comparing the results of “systematic” global algorithms to “pattern matching” local algorithms reveals a number of strengths and weaknesses to each approach. The use of specific (often hand-coded) patterns and rules in local methods produces few surprises, the tautomers and mesomers that are found are those that are looked for. Global techniques, on the other hand, are capable of identifying many obscure and delightful equivalences, never considered by most pattern libraries. The surprises whilst a benefit when searching for compounds, are often less appreciated during enumeration or canonicalization.
The five computations
Comparison. Given two molecules can we determine that one is a tautomer of the other?
Canonicalization. Given a molecule can we generate a unique encoding of its set of tautomers, such that the encodings of two molecules are identical if and only if those two molecules are tautomers of each other?
Enumeration. Given a molecule can we list all of the molecules that are tautomers of it?
Selection. Given a molecule can we list a subset of its energetically most likely tautomers, given a particular environment (solvent, temperature, pH, binding site, etc.)?
Prediction. Given a molecule can we predict its energetically most likely set of tautomers with their ratios, given a particular environment (solvent, temperature, pH, binding site)?
The first three are conceptually discrete cheminformatics problems that may have a well defined solution, but may differ based upon the operational definition of tautomer. The last two are computational chemistry problems that require some form of floating point energy calculation. The first three are problems in computer science; the last two are problems in chemical physics and physical chemistry.
Canonicalization (#2) is technically a superset of comparison (#1); a solution to either is sufficient for compound registration or duplicate removal, though canonicalization allows for more efficient implementation. An intermediate between these two is the use of tautomeric hash codes, that can be used to speed up comparison, but that do not have to be unique. With canonicalization, there is also a choice as to whether the canonical representation is a valid molecule, such as a representative member of the set of tautomers, or a more abstract encoding, such as chosen by IUPAC’s InChI identifier encoding. Of course, schemes that select a representative canonical form do not need to select a physically reasonable tautomer. Any tautomer will do provided it is selected consistently, indeed some methods choose the alphabetically first SMILES when sorted lexically .
Ultimately, enumerating all possible tautomers is a futile task, impractical for even moderately sized molecules. As a result, evaluating or comparing tautomer software purely by the number of exhaustively enumerated tautomers they can generate is of limited value. Instead, the intelligence of tautomerism software lies not only in the tautomeric forms it generates, but also in the tautomeric forms it discards (task #4).
Selection (#4) and prediction (#5) distinguish themselves from the earlier (perhaps simpler) tasks by relying on a scoring scheme for ranking tautomeric forms by likelihood. These two tasks are distinguished by the quality of the score evaluation. For some applications, a simple triage of plausible versus implausible may be sufficient. For others, a holy grail of the field would be to quantitatively estimate the expected ratios or energy differences between tautomers.
Unfortunately, the very small energy differences between tautomeric forms make them very difficult to accurately calculate. Methods such as quantum mechanics, or atom type based heat of formation/heat of solvation calculations involve the subtraction of two large numbers, to produce a result typically smaller than the expected computational error. Moran et al.  presents a state-of-the-art high-level quantum mechanical calculation of the 2-pyridinethiol/thione system describing the significant computational effort required to reproduce the results encoded by even simple rule-based systems.
Worse still the small energy differences that give rise to tautomeric preference can easily be dwarfed by environmental influences, such as choice of solvent or interaction in a protein active site. The influences of local charged groups in a protein active site can completely overwhelm any subtle preferences a molecule may have in solvent or vacuum. Flipping a tautomeric form to produce two complementary hydrogen bonds, where previously there was both a donor-donor clash and an acceptor-acceptor clash, is energetically so favorable that the induced tautomer might never be observed in bulk solvent. The literature is replete with examples; methotrexate binds to dihydrofolate reductase (DHFR) as the protonated (at N1) form, even though it prefers to be neutral in water. The active site histidine residues in zinc binding proteins are typically negatively charged imidazole anions, even though the pKa for that proton loss is about 14.5, or over 7 log units from physiological pH. An excellent example is given by Kenny and Sadowski  who describe the difficulty in reproducing the conformation of indoline in the binding site of cyctochrome C peroxidase (PDB 1aek). Although their Leatherface software correctly determines that indoline is expected to be neutral in bulk solvent (pKa of about 4.9), the proximity to ASP235 is sufficient to induce/recognize the protonated form. Indeed, the title of the paper describing the X-ray crystal structure in pdb1aek, “characterization of an engineered heterocyclic cation-binding site”, would appear to support the fact that indoline is bound as a “heterocyclic cation”. Finally, an abundant source of examples of how environment can influence the preferred tautomeric form is the Cambridge Structural Database . A study of the polymorphs and crystal packing in such small molecule crystallography databases reveals a significant number of instances where a tautomeric compound is observed in different tautomeric forms within the asymmetric unit of a unit cell. In these cases, the energetic penalty for adopting an alternate tautomer is overcome by the improvement in lattice energy for the resulting crystal packing.
Tautomeric preference and aromaticity
The previous sections have been careful to avoid (or limit) mentioning the principles by which one tautomeric form is preferred over another. Whilst this perspective article strongly argues for the need to develop tautomer “forcefields” or improvements in the methods of calculating Kt, the many terms and forces responsible for the phenomenon of tautomerism are beyond the scope of a single paper. Many greater minds than mine have struggled with Schrodinger’s equation to figure out how to place electrons on a structure, developing breakthroughs like Hartree–Fock self consistent field (HFSCF) theory and density functional theory (DFT). I would not expect the harder task of placing both electrons and protons on a structure to be any easier. However, I’ll take this opportunity to caution against “quick-fix” solutions, by drawing attention to a common misunderstanding (or incorrect assumption) in the field of tautomeric preference.
Researchers are now beginning to tackle the complex task of scoring tautomeric forms . Alas a recurring feature of the functional forms being proposed, including those of Oellien et al.  and Milleti et al.  is a strong preference towards the aromatic forms of tautomers. Unlike double bond conjugation, aromaticity has relatively little influence upon the tautomeric preference of small organic molecules. In fact, many of the known examples of aromatic versus aliphatic annular tautomerism are driven more by bond energies, hydrogen bonding and geometrical constraints than by aromatic resonance energy.
The tautomeric form on the left is aromatic, by many aromaticity models including those of OpenEye and Daylight, whilst the form on the right is not. It turns out that whilst the aromatic form is preferred for oxazoles (X=O), the non-aromatic form is preferred for thiazoles (X=S).
To summarize, a review by the author of published experimentally measured tautomeric ratios and tautomeric preferences involving both aromatic and non-aromatic forms reveals the vast majority of these equlibria prefer the non-aromatic form. This is true, for example, for the tautomer examples given in Katritzky’s “Handbook of Heterocyclic Chemistry” . This observation runs counter to all of the tautomer scoring functions described to date, where biasing for aromaticity potentially breaks more molecules than it fixes. Frustratingly, there remain counter-examples, such as the phenols in Fig. 13, where aromaticity is preferred, so one cannot just always blindly prefer the non-aromatic form instead.
The limited understanding of tautomeric principles by the general population of chemists represents perhaps one of the greatest dangers to the field. Software for handling tautomers may be dismissed by its users unless it returns the type of results that they expect to see. The reason why some tautomer enumeration programs prefer to generate aromatic tautomers, contrary to experimental evidence, is to satisfy customer demand and market forces, and not necessarily to produce the physically observed result.
Once one starts down the slippery slopes of considering energetics in tautomerism, things slowly become even more complicated. In addition, to the ground-state energy of each tautomeric form is becomes necessary to consider the energy barriers between them. Given this barrier energy between tautomeric states and the physical conditions (including solvent, temperature, pressure, etc.), one can estimate whether the rate of conversion between tautomers is sufficient to consider them equivalent.
By reductio ad absurdum, there must be something suspicious in our definition of tautomerism. Even if we restrict the above definition to “… the movement of a hydrogen atom”, a little thought shows that both of the above pairs can be converted by a sequence of single proton relocations. The critical missing ingredient with the more pedantic definitions of tautomerism is the qualification of “interconvert”. Two molecules are considered tautomers of each other if they readily interconvert by the movement of hydrogen atoms. Of course, the term “readily” then needs its own precise definition.
This view of tautomeric energy landscapes highlights both the potential asymmetry and non-reflexive nature of tautomer enumeration. Potentially, the enumerated set of tautomers of B may contain A, but the enumerated set of tautomers of A might not contain B. Likewise, the set of potential tautomers for a compound might not contain itself.
Inside the body or living cells, things are even more complex as various enzymes transform xenobiotic molecules into various metabolites . The entire field of prodrugs and active metabolites concerns the preparation of compounds that become equivalent at their point of therapeutic interaction.
The ray of hope amongst the dark litany of problems and pitfalls is that the pharmaceutical industry hasn’t done too badly at improving human healthcare to date, without fully capturing or rationalizing all of the subtlety of the underlying chemistry. Perhaps mastery of tautomerism and resonance forms is a mostly academic challenge and not a critical path issue for rational drug design. This may be much like the field of aerodynamics, where the lack of understanding of how bees fly (until relatively recently ), has not prevented the aerospace industry, and companies like Boeing, from making many billions of dollars and putting a man on the moon.
The author would like to acknowledge the patient mentoring on the complex subject of tautomers from Peter Taylor, Peter Kenny and John Bradshaw. I’d also like to thank Evan Bolton, Andrew Grant, Ben Ellingson, Jack Delany, Geoff Skillman, Jens Sadowski, Hugo Kubinyi and Yvonne Martin for many interesting and enlightening discussions and numerous perplexing tautomeric examples.