Background & Summary

Bio-orthogonal click reactions provide a straightforward method to tag and image certain biomolecules, enabling the study of their function in vivo1,2,3. Besides realtime imaging, bio-orthogonal click reactions have also been considered for advanced applications in medicine/healthcare, e.g., targeted cancer treatments and in situ drug assembly4,5. New candidate reactions–especially mutually orthogonal ones–are highly sought after6,7,8.

A wide variety of criteria need to be fulfilled for a reaction to be considered suitable for bio-orthogonal click chemistry: (a) the reactants should be stable and the reaction selective, i.e., reactant degradation due to biological conditions (pH, solvent, etc.) should be limited and side reactions with biomolecules should not occur at appreciable rates; (b) the selective reaction should be sufficiently fast at room/body temperature so that ligation can take place before clearance of the reactants from the studied organism; and (c) it should be possible to incorporate the reactants into biomolecules via some form of metabolic or protein engineering. Ideally the compounds should also be small enough to not disturb the native behavior of the biological system under investigation9. As such, data regarding thermodynamic stability as well as on- and off-target kinetics could facilitate a principled screening campaign to discover novel candidate bio-orthogonal click reactions.

Here, we present a computational dataset of hypothetical click reactions evaluated for their potential bio-orthogonality constructed via a high-throughput reaction profile evaluation. Due to the enormous success–and computationally accessible mechanism–of [3 + 2] dipolar cycloaddition reactions in bio-orthogonal chemistry so far9, we decided to focus on this reaction class. A diverse chemical space spanning close to 5 M synthetic dipole-dipolarophile pairs was defined (Fig. 1a) and the reaction profiles, i.e., activation and reaction energies, were evaluated for a representative subset of over 3000 individual reactions. We also probe the inherent reactivity of various dipoles towards a variety of “native” dipolarophiles, i.e., unsaturated motifs within biomolecules that are present in varying concentrations in living organisms. To this end, more than 2500 off-target reactions involving 12 distinct “biofragments” (green box in Fig. 1a), selected from the literature, were computed as well.

Fig. 1
figure 1

(a) Schematic overview of the search space for the considered dipoles and dipolarophiles. In the green box, the biologically relevant motifs considered in this study are presented. These motifs are inspired respectively by fatty acids and prostaglandins26, fumaric and maleic acid27, ubiquinone or co-enzyme Q (a common coenzyme family that is ubiquitous in animals and most bacteria)28, terpineols29, NADH/NADPH (an important and common co-enzyme playing a pivotal role in the metabolism of most species)30, retinal (a chromophore central to visual perception)31, β-carotene32, and the arginine amino-acid and argininosuccinic acid (an important intermediate in the urea cycle)33. (b) Histogram representing the distribution of the computed activation energies (ΔG). (c) Histogram representing the distribution of the computed reaction energies (ΔGr). (d) Correlation plot between activation and reaction energies.

A limited benchmarking, with respect to both computational reference data as well as experimental data, points to B3LYP-D3(BJ)/def2-TZVP//B3LYP-D3(BJ)/def2-SVP10,11,12,13,14,15 as the level of theory striking the optimal balance between performance and cost (vide infra). To mimic biological environments, water was taken as the solvent, which was modelled through the SMD polarizable continuum model, and standard temperature and pressure were assumed16. Histograms, depicting the distribution of the activation energies (ΔG) and reaction energies (ΔGr) obtained with these settings for the 5269 successfully computed reaction profiles (out of a total of 5974 considered reaction SMILES), are provided in Fig. 1b–d. A significant spread in both the activation energies and the reaction energies is clearly obtained (the standard deviations amount to 9.8 and 21.7 kcal/mol respectively), indicating that chemically diverse reacting systems were sampled. In Fig. 2, the reactions with respectively the three highest and lowest activation energies found are presented. Even though the dataset contains some diversity in dipole/dipolarophile scaffolds, reaction and activation energies correlate quite well (R2 = 0.71; see Fig. 1d) in accordance with the Bell-Evans-Polanyi principle17.

Fig. 2
figure 2

(a) The three reactions with the highest computed barriers. (b) The three reactions with the lowest computed barriers.

The produced dataset should prove useful for the construction of (surrogate) machine learning (ML) models for the discovery of promising dipole-dipolarophile combinations within the defined search space. Furthermore, the provided data will be useful for the development of methodologies for reactivity prediction in general, particularly due to the inclusion of high-quality 3D geometries for each species (see below).

Methods

To construct the dataset of reaction profiles for [3 + 2] cycloaddition reactions, an automated workflow based on the Python package autodE18, recently developed by the Duarte group, has been set up. autodE is a package that can be coupled to various popular electronic structure programs to automate the otherwise laborious task of computing full reaction profiles; here, we opted for Gaussian1619.

Concise overview of the autodE methodology

The starting point of an autodE computation can be a simple reaction SMILES string, together with a specification of the solvent, temperature and level of theory. In a first step, the provided SMILES string is converted into 3D structures for reactants and products with the help of RDKit20. Subsequently, a conformer search is performed for each of the generated species with the help of the ETKDGv3 algorithm21. Each unique conformer is then optimized at the GFN2-xTB level of theory22. Next, the conformers are ranked based on their relative energy. If the ade.Config.hmethod_conformers keyword is specified, the final GFN2-xTB energies are used for this ranking, otherwise single-point low-level DFT–here B3LYP-D3(BJ)/def2-SVP–refinements are used (vide infra). The lowest energy conformer is then further optimized with the same low-level DFT functional and basis set combination. Subsequently, graphs are constructed for both reactants and products based on bond length criteria, and the bond rearrangements throughout the reaction are identified. A guess for the transition state (TS) connecting reactants and products is then generated based on the adaptive path method, after which the TS is optimized (with low-level DFT). In the next step, a conformer search is performed on the TS with the help of the randomize and relax (RR) algorithm, after which the lowest energy conformer is selected, optimized with the low-level DFT method, and the validity of the TS is confirmed through a detailed analysis of the imaginary mode18,23.

Once validated structures for reactants, products and TSs have been generated, the energy of each species is refined at a higher DFT level–here B3LYP-D3(BJ)/def2-TZVP–and frequency calculations are performed for the reactants and products at the lower DFT level.

Definition of the chemical space of dipoles and dipolarophiles

First, a combinatorial dataset of 1,3-dipoles and dipolarophiles was generated. For the dipoles, 15 allyl and propargyl-allenyl scaffolds24, decorated with a variety of substituents (Fig. 1a), were selected. Additionally, (decorated) heterocyclic 5-membered rings based on 1,3-dipoles, e.g. sydnone- and münchnone-derivatives, were also included24. Since dipoles used in bio-orthogonal click applications usually require an attachment site that can be linked to either the target biomolecule or the (fluorescent) probe, only substituent combinations containing at least one substituent ending with an (extendable) methyl or phenyl group were considered.

For the dipolarophiles, 3 strained scaffolds (cyclooctyne, norbornene and norbornadiene)9 as well as regular ethylene and acetylene scaffolds, each with a variety of substituents, were selected (see Fig. 1a). For the ethylene and acetylene scaffold, the same constraint on substituent combination diversity as for the dipoles was introduced (the remaining dipolarophile scaffolds have several alternative attachment points which can be connected to a target molecule/fluorescent probe). Additional constraints were placed on the substituent combinations for the ethylene and cyclooctyne scaffold to reign in the combinatorial explosion of possibilities. For ethylene, the total number of distinct substituent types was fixed to two at most; for cyclooctyne, the substituents on at least one of the functionalized carbon sites was forced to be the same.

In total, 3555 dipoles and 1339 dipolarophiles were generated in this manner with the help of RDKit20, which results in a combinatorial chemical space of almost 4.8 M reacting systems and–taking regio-isomerism into account–9.6 M+ unique reactions. It should be noted that not every dipole and dipolarophile in the search space can be expected to be stable. However, the most extreme cases of instability, namely, those species that either do not exist at a local minimum in their respective potential energy surface (PES), or participate in dipole-dipolarophile combinations that give rise to no reaction barrier, will inherently be filtered out downstream in our workflow upon sampling during the electronic structure calculations and the associated isomorphism tests of the molecular graphs (vide infra)18.

From this vast chemical space, a computationally tractable subset was extracted. To ensure representation of each individual scaffold type, both dipoles and dipolarophiles were subdivided into individual buckets and random sub-samples were taken (with replacement) from each. For the dipoles, 1000 species were selected from the allyl-based scaffolds, 200 from the propargyl-allenyl scaffolds, and 300 from the heterocyclic 5-membered rings. For both the ethylene and acetylene dipolarophile scaffolds, 200 species were sampled, whereas 300 were sampled for both the norbornene and oxo-norbornadiene scaffolds. For the cyclooctyne scaffold, 500 species were selected.

Subsequently, the sampled list of dipolarophiles was shuffled, after which the entries of the dipole and dipolarophile lists were matched based on index values. The reactant systems generated in this manner were subsequently passed to the RunReactants function from RDChiral together with a set of SMARTS templates to generate full, atom-mapped reaction SMILES (vide infra)25. The SMARTS templates used are shown in Table 1.

Table 1 Overview of the SMARTS templates used.

To probe potential off-target reactivity, 12 unsaturated biologically relevant motifs that are present/are produced as metabolites in vivo were selected26,27,28,29,30,31,32,33. To construct a dataset of reactions involving these biofragments, a new sample of 1500 dipoles was taken in the same manner as before, as well as a sample (with replacement) of 1500 biofragments, see the green box in Fig. 1a for an overview of the motifs considered. Subsequently, the sampled biofragment list was again shuffled and the entries of the two lists were matched based on index value.

Solvent correction and thermal effects

To mimic biological environments, we model the reaction systems with water as a solvent through the SMD polarizable continuum model16. A recent computational study by Yang and co-workers investigated the role of explicit water molecules on 1,3-cycloaddition reactions using DFT-quality neural network potentials and compared to experimental data. They concluded that even in the most extreme cases considered, aqueous cycloaddition reactions tended to retain a (quasi-)concerted mechanism and most qualitative solvation trends could already be successfully recovered through implicit modelling34. Our own benchmarking results also corroborate the adequacy of implicit solvent modelling to recover experimental trends (Fig. 6).

The choice for water as solvent has been informed by its ubiquity in biological settings, though this may not always be an appropriate approximation for the actual environment in which these click reactions will take place (e.g., in the vicinity of cell membranes or in protein condensates). It is, however, well-established that the rates of 1,3-cycloaddition reactions are fairly insensitive to solvent polarity, so that deviations from the idealized solvent environment is not an issue of major concern35.

To retain the generality of our approach, we selected a standard state for the computation of thermal corrections as 1 mol/L. The temperature of each reaction was set to room temperature (298.15 K). It should be evident that in practice, significant deviations from these conditions can be expected, e.g., the body temperatures of most mammals hover around 310 K, which will make the reactions slightly more facile36. Nevertheless, the presented values under standard conditions are suitable reference values that can be transferred to the specific application under consideration.

Selection of DFT functional and basis set and validation of workflow

The DFT level of theory was selected based on a (limited) benchmarking study. In total, 4 different functionals and dispersion correction combinations were considered (B3LYP-D3(BJ)10,11,12,13, PBE0-D3(BJ)37, M06-2X38 and ωB97X-D)39,40, in combination with the def2-SVP basis set14 for optimizations and frequency/thermal correction calculations, and def2-TZVP15 for single-point calculations.

First, we selected the nine 1,3-dipolar cycloaddition reactions within the (revised)41 BHPERI dataset (see Fig. 3)42,43 and compared the electronic energies to the Wn-F12 values from the literature44. All of the level of theories tested resulted in almost perfect correlations (R2 ≥ 0.99), but B3LYP-D3(BJ) clearly outperformed the other functionals in terms of mean-absolute-error (MAE; 1.1 kcal/mol vs > 1.8 kcal/mol) and root-mean-square-error (RMSE; 1.5 kcal/mol vs > 2.0 kcal/mol), see Fig. 4. It should be noted that our findings here are also in line with the results presented in the very recent benchmarking study on pericyclic reactions by Bickelhaupt and co-workers: across the broader set of pericyclic reactions investigated, M06-2X was found to outperform B3LYP-D3(BJ), but specifically for the 1,3-dipolar cycloaddition barrier probed, B3LYP-D3(BJ) reproduced the CCSDT(Q)/CBS barrier within 1 kcal/mol, outperforming each of the other functionals by 1–3 kcal/mol45.

Fig. 3
figure 3

Reference reaction data extracted from the (revised) BHPERI dataset41.

Fig. 4
figure 4

Benchmarking functional + dispersion correction combinations against the (revised) BHPERI dataset41. The translucent bands around the regression line correspond to the 95% confidence interval determined through bootstrap resampling.

Next, we curated an experimental benchmarking dataset based on rate constants found in a series of papers authored by Huisgen and co-workers on the kinetics (and mechanism) of 1,3-dipolar reactions35,46,47,48. We transformed 11 reported experimental rate constants to Gibbs free energies of activation with the help of the Eyring equation assuming no recrossing (Fig. 5) and compared those to the corresponding values computed with our workflow (Fig. 6)49. Correlations between experimental and computed values were still excellent (R2 > 0.92), but the best result was now obtained for ωB97X-D (R2 = 0.94). However, M06-2X and ωB97X-D now yielded significant systematic errors (MAEs of 4.1–4.2 kcal/mol), whereas the experimental and computational barriers agreed much better for B3LYP-D3(BJ) and PBE0-D3(BJ) (MAEs of 2.7–2.8 kcal/mol). Putting all of this together, we decided to settle on the B3LYP-D3(BJ)/def2-TZVP//B3LYP-D3(BJ)/def2-SVP level of theory.

Fig. 5
figure 5

Reference reaction data extracted from the literature35,46,47,48. Temperatures of 0K indicate that the originally reported values are activation enthalpies Δ\({H}_{0K}^{\ddagger }\) (obtained through extrapolation from measurements at different temperatures).

Fig. 6
figure 6

Benchmarking functional + dispersion correction combinations against the experimental dataset extracted from the literature. The translucent bands around the regression line correspond to the 95% confidence interval determined through bootstrap resampling.

Generation of reaction profiles with autodE

For every reaction SMILES generated with RDChiral, an autodE18 workflow to compute the associated reaction profile with the help of Gaussian1619 was executed in a fully automated, high-throughput manner. In first instance, the default settings of autodE were retained, and only the functionals and/or dispersion corrections were adjusted based on the benchmarking results (vide supra). More information about this workflow can be found in the autodE documentation, and will not be discussed here at length.

We directly select the lowest energy conformer for reactant and product from GFN2-xTB22 optimization results for the generated conformer set, rather than from (single-point) DFT computations (by setting the ade.Config.hmethod_conformers keyword to False). While this approximation/simplification may be problematic when dealing with highly flexible compounds, it has a fairly limited effect on the results for our reactions, which involve mostly small and rigid compounds (cf. the excellent accuracy obtained during benchmarking and the tests performed on the azide reaction set; vide infra). At the same time, this simplification resulted in a significant speed up of the reaction profile generation.

We also checked whether reactant and product complexes should be considered in the computed reaction profiles. By default, autodE does not compute complexes when thermal corrections (Gibbs free energies) are requested for the species along the profile, primarily because loose complexes with extremely low frequencies result in a high uncertainty for these quantities. To justify this choice, we compute reaction profiles for the (gas-phase) BHPERI dataset41 with both complexes and thermal corrections at room temperature. For each of the reactions for which the profile calculation terminated successfully, shallow minima for the complexes were found on the electronic energy surface, but thermal corrections erased these minima completely; the Gibbs free energy of the complexes consistently lay several kcal/mol above the isolated reactants (Table 2). Since electrostatic attraction–which constitutes the bulk of the complexation interactions–tends to diminish sharply when going from the gas-phase to a (polar) solvent, this result suggests that complex formation will not impact the reactivity for our cycloaddition reactions in a significant manner and can thus be safely neglected.

Table 2 Complexation energies for the cycloaddition reactions of the BHPERI dataset41, where ΔEcomplex stands for the electronic energy (+zero-point correction), and ΔGcomplex stands for the Gibbs Free energy (at standard conditions, i.e., 298.15 K and 1 M).

Ensuring stereochemical consistency between reactants, transition states, and products

By default, autodE18 does not exchange stereochemical information between reactants and products in a reaction SMILES; it simply searches for the conformations of each species which have the lowest energy globally in an independent manner. The transition state (TS) therefore exclusively inherits the stereochemistry from one side. In the case of addition reactions, the product side stereochemistry is selected by default. Consequently, stereochemical compatibility between reactants, products and TSs is not guaranteed by default. As part of our workflow, a set of scripts was written to enforce such compatibility along the entire reaction profile.

For the reactive sites, i.e., the atoms undergoing a change in bonding situation throughout the reaction, stemming from the dipolarophiles, it is possible to readily enforce stereo-compatibility by setting the stereotags in the SMILES representations outputted from RDChiral25 before the reaction profile computation is initiated, since doing so places constraints on the respective conformer search spaces in autodE. More specifically, we aimed to ensure that cis substituents in the reactant end up on the same side of the plane defined by the formed ring in the product (and vice versa for the trans substituents), see Fig. 7a.

Fig. 7
figure 7

(a) Illustration of stereo-retention around the unsaturated dipolarophile bond during a 1,3-dipolar cycloaddition reaction. (b) Schematic overview of the steps taken to ensure selection of a compatible dipole conformer: first, an approximate reactive dipole conformer is determined based on the selected (most stable) product/TS conformer geometry emerging from the reaction profile calculation); subsequently, the rotatability of the dipole bonds are assessed and the original reactant dipole conformer in the reaction profile is replaced by the (optimized) reactive one whenever rotation is determined to be too hindered.

To this end, we first verify that both centers undergoing addition are recognized as stereocenters in RDKit20. Subsequently, the two potential product stereoisomers involving these centers are generated by setting the respective chiral tags, and guess structures are determined through a quick force field (MMFF9450) optimization. A similar optimization is performed for the reactant dipolarophile. Next, the dihedral angle for both potential products are compared to the corresponding dihedral angle in the reactants (0° or 180°). The product SMILES resulting in the lowest deviation in dihedral angle from the reactant geometry is then retained as the stereochemically correct product.

For the reactive sites stemming from the dipoles, the situation is significantly more complex. First and foremost, it should be noted that for a significant number of scaffold types, there are no stereocenters at all (cf. the propargyl-allenyl ones), or these centers are so rigid that they are pre-set in practice (cf. the cyclic dipoles). For some allyl-type scaffolds however, particularly those involving two terminal C-centers, stereochemical considerations are relevant and compatibility needs to be enforced. Doing so in a similar manner as outlined for the dipolarophiles in the previous paragraph is not possible since the delocalization present in these dipoles causes the bonds connecting the individual centers to have bond orders in between 1 and 2, depending on the specific scaffold and the respective substituents. Consequently, either, both or neither of these bonds may be rotatable at room/body temperature51. To complicate the situation even further, it is impossible to know a priori which conformation around these partial double bonds will result in the lowest addition barrier to the specific dipolarophile considered. If no stereotags are specified for the dipole centers, autodE readily searches for the relative substituent orientation in the product that is most energetically favorable and the resulting arrangement will generally also be the lowest in the TS so that stereochemistry is generally retained in the second half of the reaction profile (vide supra). From the product and TS geometries, the compatible reactant dipole conformer can be determined, but this can only be done after an initial version of the profile has already been generated.

As such, the following pragmatic approach was taken throughout this study. Initially, we did not fix any stereotag associated with the dipole in the reaction SMILES fed to autodE in either reactant or product. Once the full reaction profile was obtained, a script to correct potential stereochemical incompatibilities for the dipoles was executed (Fig. 7b). The coordinates of the subset of atoms in the TS geometry corresponding to the reactant dipoles are first extracted. Subsequently, the obtained geometry is optimized with GFN2-xTB22. Next, a scan around the individual (partially double) dipole bonds is performed in 60 uniformly-spaced increments (GFN2-xTB level of theory; associated force constraint of 0.1 Hartree/Bohr2) to obtain a crude estimate of the barrier for rotation. If the spread in energies (or the bias energy due to the constraining potential) along the profile exceeds 20 kcal/mol, then the bond is assumed to be more or less rigid, i.e., non-rotatable, under physiological conditions. Finally, a randomize and relax (RR) conformer search is performed for each dipole within autodE18. 1000 additional conformers (in addition to the one extracted from the TS geometry) are generated while constraining the dihedral angles around non-rotatable bonds. After initial pruning, conformers are optimized at GFN2-xTB level of theory and the most stable one is compared with the originally selected conformer in the reaction profile. If the RR conformer is lower in energy, it is retained as a new reference for the reaction profile: DFT optimization and single-point frequency calculations are performed in a follow-up step, the final species is included in the original output-folder for the reaction and updated activation and reaction energy values are computed. For dipoles with conformational restrictions (i.e., non-rotatable bonds), the lowest energy RR conformer is consistently selected as alternative reference for the reaction profile and the same procedure as described above is followed.

Data Records

All data files produced as part of this study are accessible through Figshare (https://figshare.com/articles/dataset/dipolar_cycloaddition_dataset/21707888)52. Reaction IDs and SMILES, activation energies (G; in kcal/mol) and reaction energies (Gr; in kcal/mol) for each computed reaction profile are provided in CSV format (full_dataset.csv). XYZ-files and LOG-files for both the final frequency and single-point calculation for each reactant (both the original and stereo-constrained versions), TS and product species, as well as a CSV file containing computed electronic energies and thermal corrections are available in a compressed archive file, full_dataset_profiles.tar.gz.

The files have been organized per reaction profile, identified through the reaction ID. Within each directory, reactant XYZ-files are of the form r#####.xyz, product XYZ-files are of the form p#####.xyz, and transition state XYZ-files are of the form TS_#####.xyz. If the reactant dipole conformer had to be corrected to enforce stereochemical compatibility, the latter XYZ-files are included under to form of r#####_alt.xyz. The frequency LOG-files can be found in a frequency_logs directory, and the single-point LOG-files can be found in a single_point_logs directory. The energies for all of these species are summarized per directory in energies.csv.

Additionally, all the benchmarking data are made available in the benchmarking_data.tar.gz directory (vide supra).

Technical Validation

The accuracy of the computed activation energies was assessed as part of the benchmarking study used to select the most appropriate DFT level of theory. As discussed in the Methods Section, errors on the (gas-phase) electronic energies were determined to be relatively small for the selected functional (MAE ~ 1 kcal/mol), and even when thermal and solvent corrections are included, the errors relative to experimental activation energies remain sufficiently low to extract chemically meaningful trends from the data (MAE ~ 2–3 kcal/mol).

The established workflow is also quite robust, with a failure rate of 3.5% during generation of the reaction SMILES from reactant combinations (mainly for the cylic dipoles) and 12.3% during the ensuing autodE reaction profile computation and postprocessing. Unsurprisingly, the main cause of failure of the workflow is related to an inability to locate appropriate transition states, with Gaussian16 either not finding a saddle point on the PES altogether, finding a saddle point but not the right one, or not finding a converged solution of the TS search. Since it is hard to distinguish between outright TS search failures and genuinely barrierless reactions in an automated manner, we only retained reaction profiles for which an unequivocal saddle point could be determined on the PES. Another common source of failure is spontaneous rearrangement of the reactant(s) or product, indicating that the species encoded by the initial SMILES string does not correspond to a minimum on the PES, or, in other words, that this species is not a stable compound. It should be noted that for one specific biologically inspired motif (the guanidinium moiety), we did encounter a much higher failure rate (~66%). Our analysis suggests that these failures are caused by the disappearance of the cycloaddition reaction mode for the involved species, not an actual failure of our methodology (see below).

To contextualize the numbers above, similar automated high-throughput reaction profile computation workflows achieve comparable or worse failure rates, e.g., a success rate of 85% was achieved by Friederich et al. in their study on dihydrogen activation of Vaska’s complex53 and Von Rudorff et al. were able to compute full reaction profiles for 25% of the E2/SN2 reaction systems considered in their search space54. At the same time, even higher success rates have been reported as well, e.g., Jorner et al. were able to obtain 98% of the profiles in their study of nucleophilic aromatic substitution reactions. In the latter example, only reactions with a precedent in the experimental literature were considered, so “unrealistic” reactions were filtered out a priori, which is not the case for most other workflows developed55.

Recovery of known bio-orthogonal click reactions from the presented workflow

To further validate our workflow, activation energies for two prototypical azide-based 1,3-dipoles (methyl and acyl azide respectively) with a couple of popular strained dipolarophiles, such as cyclooctyne and oxo-norbornadiene, as well as all the selected biofragments were computed (see Fig. 8 for an overview of the selected synthetic dipolarophiles).

Fig. 8
figure 8

The synthetic (strained) dipolarophiles included in the azide test reactions.

In line with the results for the dataset as a whole, we find that our calculations fail for 3 out of 4 reactions with the tested guanidinium motif. This can be rationalized by considering the extremely unfavorable thermodynamics expected for these reactions: cycloaddition for these species requires the resonance in both the dipole and the guanidinium species to be broken, which imposes a significant delocalization penalty on any reaction mode involving these two species, and can potentially wipe out a barrier along the reaction pathway altogether (see Fig. 8a)56. Support for this reasoning can be found in the fact that the reason for failure for each of these profiles was the inability by autodE to locate a stable molecule with the bonding pattern corresponding to the product, i.e., the expected product is not a minimum on the potential energy surface, and the observation that out of the–in total–61 reaction profiles involving guanidinium which we were, in fact, able to compute successfully throughout the entire dataset, close to 95% turned out to be endothermic, and over two-thirds are significantly so (endothermic by 20–65 kcal/mol). As an extra check, we attempted to compute reaction profiles for an alternative stepwise mechanism as well, but here the product tended to spontaneously decompose back into the reactants upon optimization.

Focusing on the successfully computed reaction profiles, we find that most reactions between acyl/methyl azide and strained dipolarophiles exhibit a relatively low activation energy (between 21 and 25 kcal/mol under standard conditions), suggesting that most will readily proceed under physiological conditions (Fig. 9). Furthermore, all of these reactions are highly exothermic, i.e., they are irreversible. The workflow readily identifies the reaction between (fluorinated) cyclooctyne and methyl/acyl azide as the most rapid transformations (ΔG = 18–22 kcal/mol; see the upper part of Fig. 9b), underscoring the experimentally observed excellent click-potential of this popular dipole-dipolarophile combination9. On the other hand, reactions of the same azides with the biofragments tested tend to involve higher barriers (25 kcal/mol and up) on average, indicating that these will proceed at much slower/non-competitive rates than the ones involving synthetic/strained dipolarophiles.

Fig. 9
figure 9

(a) A schematic depiction of the resonance loss associated with the reaction between the guanidinium motif and acyl azide. (b) Histogram depicting the distribution of the activation energies (ΔG) for both strained and non-strained dipolarophiles with methyl and acyl azide dipoles. (c) The three lowest reaction barriers computed for the strained dipolarophiles (top) and for the lowest non-strained dipolarophiles (bottom).

We found a single reaction involving biofragments with a relatively low reaction barrier (ΔG = 23.6 kcal/mol), which ought to be competitive with all but the most reactive ones involving strained dipolarophiles. This reaction involves a relatively exotic motif corresponding to (a fragment of) NADH/NADPH. While these compounds are intermediates of important biochemical processes57, their concentrations tend to be relatively low; NADH can typically be found in mammalian cells at concentrations of approximately 10−4–10−5 M, with most of it bound to proteins and hence only partially available for reaction58,59. Consequently, even for synthetic reactions exhibiting similar activation energies, these native reactions will likely not disrupt the bio-orthogonal click function completely: unless the synthetic reaction is (significantly) slower and/or the synthetic dipolarophile is (much) less prevalent/available for the dipole partner, the latter will still dominate.

The test case above underscores the subtleties–and limits–associated with bio-orthogonal click chemistry: even for tried and tested synthetic reactions, the difference in reactivity compared to the fastest native reactions tends to be rather subtle. Nevertheless, our computational approach enables us to retrieve the main reactivity trends for these [3 + 2] cycloadditions to identify promising/suitable bio-orthogonal click reactions.

Accuracy and reproducibility of the data

Some additional tests to probe the accuracy and reproducibility of the data obtained through our workflow were performed on the azide reaction list described in the previous subsection. First, we considered the robustness and reproducibility of the cycloaddition reaction profiles generated with autodE in general. Since the conformer generation algorithm used in RDKit is stochastic in nature, two consecutive runs of autodE on the same reaction SMILES will not necessarily yield the exact same result since the sampled conformers can differ slightly.

To assess the magnitude of the resulting uncertainty, barrier heights and reaction energies for the azide reaction list were computed twice with the default autodE settings, i.e., conformers were generated with RDKit, after which the unique ones were optimized with DFT and the lowest-energy one was selected, and the resulting values were compared. As can be seen from Fig. 10a,b, activation and reaction energies for our test set are reproduced well (MAE ~ 0.7 kcal/mol and RMSE ~ 1.2 kcal/mol for the activation energies and MAE ~ 0.8 kcal/mol and RMSE ~ 1.3 kcal/mol for the reaction energies).

Fig. 10
figure 10

(a) Correlation between the activation energies (ΔG) and (b) the reaction energies (ΔGr) for two consecutive autodE runs with conformer selection at DFT level of theory for the azide test reactions. (c) Correlation between the activation energies and (d) the reaction energies for an autodE run with conformer selection at DFT level of theory and a consecutive run with conformer selection at GFN2-xTB level of theory. (e) Correlation between activation energies and (f) the reaction energies for two consecutive autodE runs with conformer selection at GFN2-xTB level of theory.

Next, we assessed the effect of the approximation in conformer selection applied during the dataset generation. More specifically, all the barriers for the azide reaction list were again computed twice: in one run, conformers were selected based on GFN2-xTB conformer energies, in the second run, conformer selection was done in the default manner, i.e., based on DFT energies. As can be observed from Fig. 10c,d, the correlation between the barriers obtained in both runs is reasonably close to the reproducibility error at the most accurate conformer selection criterion (MAE ~ 1.0 kcal/mol and RMSE ~ 1.7 kcal/mol for the activation energies and MAE ~ 0.8 kcal/mol and RMSE ~ 1.3 kcal/mol for the reaction energies).

Finally, the reproducibility of the workflow with conformer selection at GFN2-xTB level of theory was checked. As can be observed from Fig. 10e,f, the reproducibility errors with and without the approximation are perfectly in line (MAE ~ 0.7 kcal/mol and RMSE ~ 1.1 kcal/mol for the activation energies and MAE ~ 0.7 kcal/mol and RMSE ~ 1.2 kcal/mol for the reaction energies).

Taking everything together, one can conclude that selecting conformers based on GFN2-xTB energies introduces only a small additional error relative to selection based on DFT energies, and thus we decided to consistently apply this approximation since this accelerates the workflow significantly, facilitating a broader exploration of the defined chemical space.

Usage Notes

The code used to generate the reaction SMILES and automate the autodE workflow is available on GitHub. The repository contains several Python scripts and Jupyter Notebooks:

  • Notebooks to generate the respective search spaces and extract samples from them.

  • A script to generate reaction SMILES with RDChiral and set the stereotags of the chiral centers associated with the dipolarophiles.

  • A script and auxiliary modules to launch high-throughput reaction profile computation with autodE in a parallellized manner.

  • A script to correct stereochemical incompatibilities in the selected reactant dipole conformers.

  • A script and auxiliary modules to launch high-throughput DFT optimization and single-point fine-tuning of corrected dipole conformers.