Chemical language modeling with structured state space sequence models

Özçelik, Rıza; de Ruiter, Sarah; Criscuolo, Emanuele; Grisoni, Francesca

doi:10.1038/s41467-024-50469-9

Chemical language modeling with structured state space sequence models

Article
Open access
Published: 22 July 2024

Volume 15, article number 6176, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Chemical language modeling with structured state space sequence models

Download PDF

536 Accesses
12 Altmetric
Explore all metrics

Abstract

Generative deep learning is reshaping drug design. Chemical language models (CLMs) – which generate molecules in the form of molecular strings – bear particular promise for this endeavor. Here, we introduce a recent deep learning architecture, termed Structured State Space Sequence (S4) model, into de novo drug design. In addition to its unprecedented performance in various fields, S4 has shown remarkable capabilities to learn the global properties of sequences. This aspect is intriguing in chemical language modeling, where complex molecular properties like bioactivity can ‘emerge’ from separated portions in the molecular string. This observation gives rise to the following question: Can S4 advance chemical language modeling for de novo design? To provide an answer, we systematically benchmark S4 with state-of-the-art CLMs on an array of drug discovery tasks, such as the identification of bioactive compounds, and the design of drug-like molecules and natural products. S4 shows a superior capacity to learn complex molecular properties, while at the same time exploring diverse scaffolds. Finally, when applied prospectively to kinase inhibition, S4 designs eight of out ten molecules that are predicted as highly active by molecular dynamics simulations. Taken together, these findings advocate for the introduction of S4 into chemical language modeling – uncovering its untapped potential in the molecular sciences.

De Novo Molecular Design with Chemical Language Models

GenUI: interactive and extensible open source software platform for de novo molecular generation and cheminformatics

Article Open access 25 September 2021

Chemist-Computer Interaction: Representation Learning for Chemical Design via Refinement of SELFIES VAE

Introduction

Designing molecules with desired properties from scratch is a “needle in the haystack” problem. The chemical universe – estimated to comprise up to 10⁶⁰ small molecules¹ – remains largely uncharted. Generative deep learning offers unprecedented opportunities to explore the chemical universe in a time- and cost-efficient manner², by enabling the production of desirable molecules without the need for hand-crafted design rules. In particular, chemical language models (CLMs) have yielded experimentally-validated bioactive designs^3,4,5,6,7 and stood out as powerful molecular generators^{2,8,9,10,11,12,13}.

CLMs adapt algorithms developed for sequence processing to learn the “chemical language”, that is, how to generate molecules that are chemically valid (syntax) and possess desired properties (semantics)⁷. This is achieved by representing molecular structures as string notations, such as the Simplified Molecular Input Line Entry Systems (SMILES¹⁴, Fig. 1a), among others^15,16. These molecular strings are then used for model training and subsequent generation of molecules in textual form. Compared to generative methods based on molecular graphs¹⁷, CLMs can learn more complex molecular properties better⁸, and generate increasingly larger molecules more efficiently^18,19. These aspects have made CLMs become one of the de facto approaches for de novo drug design.

**Fig. 1: Key concepts of structured state space sequence (S4) models for chemical language modeling.**

Several CLM architectures have been proposed for de novo design²⁰, the most popular of which are long short-term memory (LSTM)^3,4,5,21,22 models. LSTMs are trained to produce molecular strings element-by-element and have fast generation capabilities. However, the iterative structure forces those models to compress the sequence into an information bottleneck and challenges the learning of global sequence properties^23,24,25. Transformers²⁶, a more recent architecture, overcome this bottleneck by processing the entire input molecular string at once^27,28. LSTMs and GPTs present different – and somewhat complementary – strengths and weaknesses when it comes to de novo molecule design^{25,29,30,31,32}. The recurrent nature of LSTMs allows learning local properties better than GPTs, while GPTs capture global properties better thanks to their ‘holistic’ processing²⁵. Moreover, while LSTMs remain efficient, Transformers become increasingly compute-intensive when generating progressively longer SMILES strings, which might limit their broad applicability in the chemical sciences. These aspects make it necessary to stretch the boundaries of current CLM approaches further, to chart the chemical space more effectively in search for bioactive molecules²⁵.

Structured state space sequence models (S4s) are a recent member of the fast-growing family of state space architectures^33,34,35,36, which are gathering increasing attention in the deep learning community^37,38,39,40. S4s showed outstanding performance in audio, image, and text generation³⁵ and have a “dual nature”: they (1) are trained over the entire input sequences to learn complex global properties and (2) generate one string element at a time – thereby combining some respective strengths of Transformers and LSTMs. Motivated by such “best of two worlds” behavior, here we ask the following question: Can S4 advance the current state-of-the-art in chemical language modeling? We find evidence that it can.

Here, we apply S4 to chemical language modeling on SMILES strings and benchmark it on various tasks relevant to drug design – from learning bioactivity to chemical space exploration and natural-product design. Moreover, we further corroborate the promise of S4 via the prospective de novo design of kinase inhibitors, validated using molecular dynamics simulations. Our results show the potential of S4 for chemical language modeling, especially in capturing bioactivity and complex molecular properties. To the best of our knowledge, this is the first time that state space models have been applied to molecular tasks, and we expect their relevance for chemical language modeling to increase in the future.

Results and discussion

Structured state space sequence model (S4)

S4s are an extension of discrete state space models, which are widely adopted in control engineering⁴¹. Discrete state space models map an input sequence u to an output sequence y, through the learnable parameters ${{{\overline{{{{\boldsymbol{A}}}}}}}}\in {{\mathbb{R}}}^{N\times N}$, ${{{\overline{{{{\boldsymbol{B}}}}}}}}\in {{\mathbb{R}}}^{N\times 1}$, ${{{\overline{{{{\boldsymbol{C}}}}}}}}\in {{\mathbb{R}}}^{1\times N}$, and ${{{\overline{{{{\boldsymbol{D}}}}}}}}\in {{\mathbb{R}}}^{1\times 1}$, as follows:

$${x}_{k} = \,{{{\overline{{{{\boldsymbol{A}}}}}}}}{x}_{k-1}+{{{\overline{{{{\boldsymbol{B}}}}}}}}{u}_{k}\\ {y}_{k} = \,{{{\overline{{{{\boldsymbol{C}}}}}}}}{x}_{k}+{{{\overline{{{{\boldsymbol{D}}}}}}}}{u}_{k}.$$

(1)

In other words, discrete state space models define a “linear recurrence”: at any step k, the k-th element of the input sequence u_k is fed into the model and used to update the hidden state x_k and to generate an output, y_k. The matrices ${{{\overline{{{{\boldsymbol{A}}}}}}}},{{{\overline{{{{\boldsymbol{B}}}}}}}},{{{\overline{{{{\boldsymbol{C}}}}}}}}$, and ${{{\overline{{{{\boldsymbol{D}}}}}}}}$ control how the input and the hidden state are combined to provide an output (Fig. 1b).

Besides their recurrent formulation, discrete state space models can be formulated as a convolution with the same set of parameters. It can be demonstrated that, by “unrolling” the linear recurrence (Eq. (1)), the output sequence y can be obtained via a learnable convolution over the input sequence u:

$$y=u * {{{\overline{{{{\boldsymbol{K}}}}}}}},$$

(2)

where ${{{\overline{{{{\boldsymbol{K}}}}}}}}$ is the convolution filter, parameterized via ${{{\overline{{{{\boldsymbol{A}}}}}}}}$, ${{{\overline{{{{\boldsymbol{B}}}}}}}}$, and ${{{\overline{{{{\boldsymbol{C}}}}}}}}$ (see Supplementary Eqs. (1)–(4) for a detailed derivation). This convolutional representation reveals a key aspect of state space models: they learn explicitly from the entire sequence (via global convolution) while preserving recurrent generation capabilities (Fig. 1b).

Learning the optimal parameters of a discrete state space system, however, introduces vanishing gradients and numerical instabilities in recurrent and convolutional formulations, respectively. Structured state space sequence models, (S4s)³⁵, tackle those issues by introducing additional structure to the model parameters (via the so-called high-order polynomial projection operators³³) and reducing the unstable computations to the stable Cauchy kernel⁴² computation (see ref. ³⁵ for more detail). Ablation studies³⁵ have shown the relevance of the added structure to achieve computational feasibility and performance on long sequences. Moreover, such reduction allows S4 to address numerical instabilities encountered in model training and made S4 state-of-the-art in several generative tasks that require learning long-distance relationships^33,34,35. Motivated by its performance in other domains and the potential benefits of its dual structure, here we introduce S4 to the molecular sciences for the first time.

We evaluated S4 for its ability to learn from and generate drug-like molecules and natural products in an array of tasks, and in terms of multiple molecular properties. LSTMs and Generative Pretrained Transformers (GPTs) were used as benchmarks, since they are the de facto approaches in chemical language modeling for de novo design^2,7,8,25. Furthermore, LSTM (recurrent training and generation) and GPT (holistic training and generation) constitute the ideal benchmarks for S4, due to S4’s dual formulation (convolution during training and recurrence during generation), which allows inspecting the effect of each of these aspects on the overall performance. Finally, the prospective de novo design of putative mitogen-activated protein kinase 1 (MAPK1) inhibitors, corroborated by molecular dynamics simulations, was performed to test the potential of S4 in real-world drug discovery scenarios.

Designing drug-like molecules

S4 was analyzed for its ability to design drug-like small molecules (SMILES length lower than 100 tokens) extracted from ChEMBL database⁴³, by focusing on its ability to (1) learn the chemical syntax, (2) capture structural features relevant for bioactivity, and (3) designing structurally diverse molecules.

Learning the SMILES syntax

All investigated CLMs were trained on 1.9M canonical SMILES strings extracted from ChEMBL v31⁴³. The generated strings were evaluated according to their (1) validity, i.e., the number (and frequency) of SMILES corresponding to chemically valid molecules; (2) uniqueness, which captures the number (and frequency) of structurally-unique molecules among the designs; and (3) novelty, corresponding to the number (and frequency) of unique and valid designs that are not included in the training set. A high number of “chemically-valid” designs suggests that the model has learned how to generate plausible molecules, while high uniqueness and novelty values indicate little redundancy among the designs and with the training set, respectively. Although these metrics are vulnerable to trivial baselines⁴⁴, they provide insights into a model’s capacity to learn the SMILES “syntax”.

All CLMs generated more than 91% valid, 91% unique and 81% novel molecules (Table 1). Moreover, their designs approximated the training and test sets in terms of selected molecular properties (i.e., octanol-water partition coefficient⁴⁵, quantitative estimate of drug-likeness⁴⁶, Bertz complexity⁴⁷, and synthetic accessibility^48,49) with no notable differences among architectures (Supplementary Fig. 1 and Supplementary Table 1). These results agree with the literature on CLMs (e.g., refs. ^2,50) and demonstrate the robustness of the model training procedure. S4 designs the most valid, unique, and novel molecules, by generating more novel molecules than the benchmarks (from approximately 4000–12,000 more), and displays a good ability to learn the “chemical syntax” of SMILES strings. The potential of S4 in comparison with existing de novo design approaches was further corroborated on the MOSES benchmark⁵¹, where S4 consistently scored among the top-performing deep learning approaches (Supplementary Table 2).

Table 1 Designing drug-like molecules de novo with S4

Full size table

To shed additional light on the strengths and limitations of S4 in comparison with the benchmarks, we analyzed the sources of invalid molecule generation for all methods in terms of branching and ring errors, erroneous bond assignment, and other (miscellaneous) syntax issues (Fig. 2). Interestingly, each method seems to show different types of errors leading to SMILES invalidity. LSTM struggles the most with branching, and performs the best with bond assignment, while GPT struggles the most with rings and bond assignment, and has intermediate performance otherwise. S4 struggles more than LSTM with bond assignment, and generates remarkably fewer errors than both benchmarks in branching and ring design. Our hypothesis is that bond assignment indicates good learning of “short-range” dependencies, while branching and ring opening and closure require better capturing of the “long-range” relationships. This suggests that S4 captures relatively “long-distance” relationships well, in agreement with existing evidence in other domains^33,34,35.

**Fig. 2: SMILES design errors, grouped by category and CLM architecture.**

Capturing bioactivity

We evaluated S4 for its ability to learn elements of bioactivity. With CLMs this is often achieved with transfer learning⁵², which allows transferring knowledge acquired from one task to another task with fewer available data. Via transfer learning, after pre-training a CLM on a large corpus of SMILES strings, the model can be then “fine-tuned” on a smaller, and task-focused set (e.g., bioactive molecules) by additional training²². Here, we performed five fine-tuning campaigns, focusing on distinct macromolecular targets from the LIT-PCBA⁵³ dataset: (1) pyruvate kinase muscle isoform 2 (PKM2), (2) mitogen-activated protein kinase 1 (MAPK1), (3) glucocerebrosidase (GBA), (4) mechanistic target of rapamycin (mTORC1), and (5) cellular tumor antigen p53 (TP53).

Evaluating the bioactivity of de novo designs (besides synthesis and wet-lab testing) is non-trivial, since this property cannot be fully captured by traditional molecular descriptors, and might not be accurately predicted by quantitative structure-activity relationship models^54,55. Hence, we used experimentally-tested molecules to evaluate the capacity of a CLM to learn elements of bioactivity retrospectively. Several studies have shown that the likelihoods learned by a CLM during fine-tuning can be used to prioritize designs with high chances of being bioactive^6,56,57. Based on the same principle, here we used the likelihoods learned by the CLMs to rank existing molecules and evaluate their capacity to prioritize bioactive compounds over inactive ones.

For each of the selected targets, bioactive molecules (Supplementary Table 3) were used for fine-tuning, with ten random training-validation-test splits. After fine-tuning the CLMs on each target, for each training-test split, we proceeded as follows:

(1)
With each fine-tuned model and per each target, we predicted the likelihoods (Eq. (4)) of the SMILES strings in the respective test set. The considered test sets resemble a real-world scenario in terms of hit-rate, and they comprise 11 (mTORC) to 56 (PKM2) active molecules and 10,240 inactive molecules (except for TP53, containing 3301 inactive molecules, Supplementary Table 3);
(2)
We ranked the molecules of the test set according to the predicted likelihoods (Eq. (5));
(3)
For each target and each test set, we computed the fraction of actives ranked among the top 10, top 50, and top 100 molecules. The higher the number of active molecules ranked in early portions of the test set by a CLM, the better the model has learned what is relevant for bioactivity on the investigated target after fine-tuning.

Our results show variable performance depending on the target (Fig. 3). The most challenging target is TP53, on which no model could consistently retrieve actives among the top 10 scoring molecules. Notably, this target has the most challenging test set, where inactive molecules are similar to the actives of both the training and the test sets (Supplementary Fig. 2), potentially indicating the presence of activity cliffs⁵⁸. MAPK1 and mTORC1 also challenge the CLMs; here, S4 retrieved more active molecules than the benchmarks, especially in the early portions of the test set. PKM2 and GBA are the easiest datasets; here, all CLMs identified bioactive molecules in their top 10, with S4 achieving the highest median across the board. A Wilcoxon signed-ranked test⁵⁹ on the pooled scores across datasets supports the superior performance of S4 compared to the benchmarks (p [top 10] = 8.41e−6, p [top 50]= 2.93e−7, p [top 100] = 1.45e−7 compared to LSTM, and p [top 10] = 2.33e−3, p [top 50] = 3.72e−3, p [top 100] = 2.61e−2 compared to GPT), and of GPT compared to LSTM (p [top 10] = 5.22e−3, p [top 50] = 3.75e−5, p [top 100] = 2.02e−6).

**Fig. 3: Retrospective enrichment analysis for all models across five selected macromolecular targets.**

Under the constraints of the study design, these results indicate that processing the input SMILES “holistically” (as GPT and S4 do) leads to capturing complex properties like bioactivity better, with a better performance obtained by S4.

Chemical space exploration

We analyzed the ability of S4 to explore the chemical space, in terms of generating structurally diverse and bioactive molecules. To this end, we employed a commonly-used strategy with CLMs, that is, varying the sampling temperature (T) to control chemical diversity⁶⁰. T affects which elements of a string are generated by a weighted random sampling (Eq. (3)). When T → 0 the most likely element (based on the CLM prediction) is selected as the next element of the sequence, while the higher the T, the more random the selections. T = 1 corresponds to using the CLM predictions as the sampling probability of each element at each generation step.

We experimented with an increasing sampling temperatures (from T = 1.0 to T = 2.0 with a step of 0.25). Each T value was used to generate 10,240 SMILES strings per model across the five chosen targets and all training-test splits. Then, we evaluated the designs based on three metrics (Fig. 4):

The validity of the generated strings, which captures how robust the model is to increasing degrees of randomness in preserving a correct syntax. The higher the validity, the better.
Rediscovery rate: de novo design models are often evaluated for their capacity to reproduce existing molecules with experimentally verified biological activities^50,55. For this purpose, we used the held-out actives previously described for each target. Moreover, to “relax” the criterion of rediscovery, we considered held-out actives with substructure similarity higher than 60% to a de novo design (as computed via Tanimoto similarity on extended connectivity fingerprints⁶¹) to compute rediscovery. Higher rediscovery rates at increased temperature values indicate that the model can explore regions related to bioactivity despite increased randomness.
Scaffold diversity: designing molecules with novel scaffolds bears relevance in lead identification⁶², and can be used as a proxy to evaluate CLMs⁵¹. Here, to have a better evaluation of what constitutes a novel scaffold, the novel designs were grouped in clusters based on their scaffold similarity. This was achieved via hierarchical clustering, to group designs with similar Bemis-Murcko scaffolds⁶³ (as computed via the Tanimoto similarity on the corresponding extended connectivity fingerprints⁶¹ higher than 60%). Only novel and unique scaffolds were considered. We then counted the number of obtained scaffold clusters, the higher, the better.

**Fig. 4: Model performance when varying the temperature value.**

The models display similar trends with increasing T values for all the analyzed factors across datasets, with varying magnitude (Fig. 4). In general, the validity decreases with increasing temperature (as previously observed⁶⁰), with the highest effect observed for GPT (median validity across training setups getting lower than 40%, Fig. 4a).

Both S4 and LSTM show higher robustness than GPT to increasing temperature values (with LSTM performing slightly better for T ≥ 1.75), suggesting that sequential generation can boost chemical space exploration. S4 outperforms LSTM in terms of rediscovery rate (Fig. 4b), in agreement with our previous results on bioactivity (Fig. 3). We also compute the exact rediscovery rate (identical molecular structure) and observe that no model can consistently generate held-out actives. When it comes to the diversity of the designs (Fig. 4c), LSTM can generate the highest number of structurally unique scaffolds (median across datasets and setups: 6602 clusters, T = 1.75) and S4 is the close second-best model (6520 clusters, T = 1.75). While GPT obtains a suboptimal performance across the board, LSTM seems better for chemical space exploration when bioactivity is not the main objective, while S4 can better capture bioactivity and preserve a good chemical space exploration at the same time, combining the strengths of the two benchmarks with its dual structure. These results confirm the promise of S4 when it comes to generating structurally diverse and bioactive drug-like molecules.

Designing natural products

S4 was further tested on more challenging molecular entities than drug-like molecules. To this end, we evaluated its capacity to design natural products (NPs), which are invaluable sources of inspiration for medicinal chemistry^64,65. Compared to synthetic small molecules, NPs tend to possess more intricate molecular structures and ring systems, as well as a larger fraction of sp³-hybridized carbon atoms and chiral centers^66,67,68. These characteristics correspond to longer SMILES sequences on average, with more long-range dependencies, and make natural products a challenging test case for CLMs^19,69.

We trained the CLMs on large natural products (32,360 SMILES strings with length > 100, chosen to complement the previous analysis) from the COlleCtion of Open Natural ProdUcTs (COCONUT) database⁷⁰. We then used the CLMs to design 102,400 SMILES strings de novo and computed the fraction of valid, unique, and novel designs (Table 2). All CLMs can design natural products, with lower performance compared to drug-like molecules. S4 designs the highest number of valid molecules by approximately 6000 to 12,000 molecules (7–13% better), and LSTM achieves the highest novelty by approximately 2000 molecules (2%) over S4.

Table 2 Natural product design with CLMs

Full size table

To further investigate the characteristics of the designs, we computed the natural-product likeness⁷¹, which captures how similar a molecule is to the chemical space covered by natural products in terms of its substructures (the higher the NP-likeness, the more similar). The novel designs of S4 have significantly higher values of NP-likeness than the benchmarks (Mann–Whitney U test, p = 1.41e−53 compared to LSTM, and p = 1.02e−82 compared to GPT), closer to the values of the training and test sets on average (Table 2). Moreover, the NP-likeness values better match the distribution of the COCONUT molecules in terms of Kolmogorov–Smirnov (KS) distance⁷², which quantifies how much the cumulative distributions of two observations differ (between 0% and 100%; the lower, the closer the distributions).

In addition to NP-likeness, we evaluated the novel designs in terms of several structural properties important for natural products^66,67,68, namely: the number of sp³-hybridized carbon atoms, aliphatic rings, spiro atoms and heavy atoms, as well as the molecular weight and the size of the largest fused ring system. These properties provide additional evidence on the molecular characteristics of the designs, and their structural complexity in comparison with the training natural products. Here, S4 achieved the lowest KS distance to the training and test sets across properties, indicating that its designs match the training natural products best. These results confirm the ability of S4 to learn complex molecular properties for de novo design.

Finally, we analyzed the training and generation speed of the CLM architectures when increasing the SMILES length, to test their practical applicability when designing bigger molecules, like natural products. Our analysis highlighted that S4 is as fast as GPT during training (both are approximately 1.3 times faster than LSTM), and the fastest in terms of generation (Supplementary Fig. 4), thanks to its dual formulation. This further advocates for the introduction of S4 as an efficient approach for molecule design, that “makes the best of both worlds” compared to GPT and LSTM.

Prospective de novo design

We conducted a prospective in silico study with S4, focused on designing inhibitors of mitogen-activated protein kinase 1 (MAPK1), a relevant target for oncological therapies⁷³. The putative bioactivity of the designs was then evaluated via molecular dynamics (MD).

The S4 model previously pre-trained on ChEMBL was fine-tuned with the SMILES strings of 68 manually-curated inhibitors from ChEMBL, having an experimental constant of inhibition (K_i) lower than 1 μM on MAPK1. The last five epochs of the fine-tuned model were then used to generate 256K molecules (51,200 designs per each T value, ranging from 1.0 to 2.0 with a step of 0.25).

The designs were ranked and filtered via log-likelihood score (Eq. (5)) and scaffold similarity to the training set (see “Materials and methods” for further details). The ten top-scoring molecules (1–10, Fig. 5a and Table 3) were considered for further characterization using MD simulations. As a reference for evaluation, we performed MD simulations also for the closest fine-tuning neighbor of the considered designs (compounds 11–16, selected based on scaffold similarity; Fig. 5a and Table 3). The absolute protein-ligand binding free energy (expressed as ΔG – the lower the stronger the predicted binding) for molecules 1–16 was computed via Umbrella Sampling⁷⁴ (Table 3). The computed ΔG values for known bioactive molecules (11–16) have a good correspondence with experimental K_i values from ChEMBL (Table 3), confirming the validity of the chosen MD protocol.

**Fig. 5: Prospective de novo design of putative MAPK1 inhibitors with S4.**

Table 3 In silico prospective study on designing mitogen-activated protein kinase (MAPK1) inhibitors with S4

Full size table

Eight out of ten designs (except 1 and 5) showed a high predicted affinity (Table 3), with ΔG values ranging from ΔG = −10.3 ± 0.6 kcal mol⁻¹ (7) to ΔG = −23 ± 4 kcal mol⁻¹ (2). Interestingly, these affinities are comparable or even surpass those of the closest active neighbor (ΔG = −9.1 ± 0.8 kcal mol⁻¹ to ΔG = −13 ± 2 kcal mol⁻¹). The global substructure similarity (measured on extended connectivity fingerprints) of the designs to their closest neighbor ranges from 31% (10) to 87% (4, Table 3).

The most potent design according to MD predictions is molecule 2 (ΔG = −23 ± 4 kcal mol⁻¹, Table 3). This molecule – which is the largest one among the designs (Fig. 5a) – engages extensively with the binding pocket of MAPK1 (Fig. 5b), which explains the remarkably favorable predicted affinity. Design 2 has a limited substructure similarity to its closest bioactive neighbor (molecule 12, similarity equal to 57%); however, its synthetic accessibility may be limited. Design 3 is predicted with the second highest affinity (ΔG = −19.6 ± 0.9 kcal mol⁻¹), and it shares the same scaffold of compound 13. Design 3 differs from 13 by the replacement of the ether and hydroxy moieties with two fluorine atoms, and the addition of a methoxy group (Fig. 5a, global similarity equal to 65%). Interestingly, this structural modification leads to an improvement of the predicted ΔG value (of approximately −10 kcal mol⁻¹), possibly due to the ability to penetrate deeply into the binding pocket thanks to the fluorine atoms (Fig. 5c). Halogens are, in fact, favorable for MAPK1, as evident from the fine-tuning molecules (91% of them containing at least one halogen) and existing literature (e.g., refs. ^75,76,77,78). Evidence of a favorable positioning of halogens is shown on both the “top”^75,76 and “bottom”^77,78 of the binding pocket, further supporting the predicted affinity of compound 3.

Design 9 (ΔG = −17 ± 2 kcal mol⁻¹) features halogens on both sides, unlike its closest neighbor, molecule 13 (ΔG = −10.5 ± 0.7 kcal mol⁻¹, global similarity equal to 33%), from which it also differs in the moiety attached to the pyridonic ring (Fig. 5a). When inspecting the predicted binding pose, it can be observed that the aromatic ring with halogen substituents, hydroxyl, and carbonyl of pyridone are situated in the same region of the binding groove (Fig. 5d). The difference in ΔG values (approximately 6.5 kcal mol⁻¹ in favor of design 9) could be ascribed, like with molecule 3, to the presence of halogens in the lower binding pocket region. This might also explain the high predicted affinity of design 10 (ΔG = −15 ± 2 kcal mol⁻¹) – which differs from 9 by a carbonyl and a methyl group.

With 8 out of 10 designs predicted as bioactive on the intended target by MD, with comparable or higher predicted affinities than their closest fine-tuning molecules, these results further corroborate the potential of S4 for de novo drug design.

Opportunities for molecular S4

In conclusion, this study pioneered the introduction of state space models into chemical language modeling, with a focus on structured state spaces (S4s). The unique dual nature of S4s, involving convolution during training and recurrent generation, makes them particularly intriguing for de novo design starting from SMILES strings.

Our systematic comparison with GPT and LSTM on a variety of drug discovery tasks revealed S4’s strengths: while recurrent generation (LSTM and S4) is superior in learning the chemical syntax and exploring diverse scaffolds, learning holistically on the entire SMILES sequence (GPT and S4) excels in capturing certain complex properties, like bioactivity. S4 with its dual nature, makes the best of both worlds”: it demonstrated comparable or better performance than LSTM in designing valid and diverse molecules, and systematically outperformed both benchmarks in capturing complex molecular properties – all while maintaining computational efficiency.

The application of S4 to MAPK1 inhibition, validated by MD simulations, further showcases its potential to design potent bioactive molecules. In the future, we will apply S4 prospectively in combination with wet-lab experiments to enhance its impact in the field. Strategies to increase the structural diversity of the considered designs, such as SMILES augmentation⁷⁹ and improved ranking protocols, could further boost its potential in medicinal chemistry.

Several aspects of S4 await to be explored in the molecular sciences, such as its potential with longer sequences (e.g., macrocyclic peptides and protein sequences) and on additional molecular tasks (e.g., organic reaction planning⁸⁰ and structure-based drug design⁸¹).

In the future, we envision the relevance of S4 for molecule discovery to increase, and to potentially replace widely established chemical language models like LSTM and GPT. We believe that the provided open-access code will contribute to the adoption and expansion of S4, to further stretch the boundaries of chemical language modeling.

Methods

Designing drug-like molecules

Data curation

The pre-training set was generated starting from ChEMBL v31⁴³. Fine-tuning datasets were extracted from LIT-PCBA⁵³. All sets were generated by (1) retaining molecules containing selected atoms (C, H, O, N, S, P, F, Cl, Br, and I), (2) removing salts and disconnected structures, as well as stereochemistry annotations and charge, (3) retaining molecules whose canonicalized SMILES strings contained 100 tokens or fewer. After sanitization, canonicalization, label encoding, and padding (to 100), molecules were randomly split into training, validation, and test sets. For ChEMBL, this led to a training set of 1,900,000, and a validation and a test set of 100,000 and 23,680 molecules, respectively. The number of compounds for each fine-tuning campaign is reported in Supplementary Table 3.

Training

Pretraining

The hyper-parameters of the LSTM and GPT were tuned with random search for 5 days on a single NVIDIA A100 40GB GPU. The defined hyper-parameter space is based on previous work^27,57,60,82 (Supplementary Table 4). 40 LSTM and 35 GPT models were optimized within a 5-day limit. Hyper-parameter search was conducted to maximize the validity during pre-training.

To account for the lack of previous information on optimal hyper-parameters for molecule generation with S4, we implemented a two-step procedure for hyper-parameter tuning. First, 242 models were trained to prioritize hyper-parameters (see Supplementary Table 4). High-performing hyper-parameter values in terms of validation accuracy were advanced to the second phase, where 108 experiments were conducted. Hyper-parameter search was conducted for 10 days on multiple NVIDIA A100 40GB GPUs to maximize the validity during pre-training.

Fine-tuning

Five fine-tuning campaigns were conducted on five targets: PKM2, MAPK1, GBA, mTORC1, and TP53. For each target, ten runs with different training (80%), validation (10%), and test (10%) splits were performed (except for PKM2 where we used 70%–15%–15% due to limited data). Early stopping on the validation cross-entropy was adopted with a patience of five epochs and a tolerance of 10⁻⁵.

Temperature sampling

The sampling probability (p) of each i-th element at any step of the sequence was computed as follows:

$${p}_{i}=\frac{{e}^{(\,{y}_{i}/T)}}{{\sum }_{j}{e}^{(\,{y}_{j}/T)}}$$

(3)

where y_i is the predicted probability of the ith element, T is the sampling temperature, and j runs over all tokens in the vocabulary.

Molecule ranking with log-likelihoods

The molecules were ranked based on the joint likelihood of the tokens (i.e., SMILES characters) they contain⁸². For each test molecule, the joint log-likelihood (${{{{{{\mathcal{L}}}}}}}$) by a model (M) was computed as:

$${{{{{{\mathcal{L}}}}}}}({{{{{{\bf{M}}}}}}})=\sum\limits_{i}\log p({t}_{i})$$

(4)

where t_i is the ith token of the SMILES string of a given test molecule and p(t_i) is the probability of that token as predicted by the model M; i runs over all the elements in the molecular string.

To only consider the fine-tuning information and remove potential pre-training bias (as previously observed⁸²), the pre-training log-likelihood was subtracted from the fine-tuning likelihood, to obtain a final score:

$${{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}({{{{{{\bf{M}}}}}}})={{{{{{\mathcal{L}}}}}}}({{{{{{{\bf{M}}}}}}}}_{{{{{{{\bf{ft}}}}}}}})-{{{{{{\mathcal{L}}}}}}}({{{{{{{\bf{M}}}}}}}}_{{{{{{{\bf{pt}}}}}}}})$$

(5)

where M_ft is the fine-tuned model and M_pt is the pre-trained model. The obtained ${{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}$ was used to rank each test molecule, the higher the ${{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}$, the better the rank.

Natural product design

The COlleCtion of Open Natural ProdUcTs (COCONUT)⁷⁰ database was used for model training. Salts, disconnected structures, stereochemistry, and charge annotations were removed. Molecules with canonical SMILES strings longer than 100 characters were used to train the models. A random search strategy was adopted to tune the hyper-parameters of all models, as previously explained. The models were given a 5-day limit on a cloud NVIDIA A100 GPU and 1024 strings were generated by each model. The models with the highest SMILES validity were selected for further evaluation.

Prospective de novo design

Data curation

Fine-tuning data were collected from ChEMBL v33⁴³. All annotations for MAPK1 were retained (target ID: CHEMBL4040). Available assay descriptions were manually inspected and analyzed. Molecules whose inhibitory constant (K_i) was lower than 1 μM on reliable inhibition assays (CHEMBL3412886, CHEMBL917079) were retained. SMILES canonicalization and removal of stereochemistry and duplicates led to a set of 68 unique SMILES strings for fine-tuning (SMILES strings available in the dedicated GitHub repository).

Model fine-tuning and de novo design

The fine-tuning dataset was split into ten train and validation splits (80–20%) to find the optimal number of fine-tuning epochs. Early stopping on validation loss was used, with patience of five epochs and tolerance of 10⁻⁵. The experiments suggested 45 epochs to be optimal; the pre-trained model was fine-tuned on the whole dataset accordingly.

The models of the last five fine-tuning epochs were used to design molecules. A total of 10,240 designs for temperature values ranging from 1.0 to 2.0 (step size 0.25) were generated per temperature and model, totaling 5 × 5 × 10, 240 = 256K designs. The novel and unique molecules among those designs were ranked by their fine-tuning log-likelihood (Eq. (4)) and the top 5000 molecules were selected for further analysis.

The 5000 top-scoring molecules were divided into two groups, based on their similarity to the fine-tuning set. The similarity was measured via Tanimoto similarity on the extended connectivity fingerprints⁶¹ of the Bemis-Murcko scaffolds⁶³ (using a radius of 3 bonds and 2048 bits), and a threshold of 60% similarity. The designs in the lists were grouped by their most similar training molecule and ranked by the log-likelihood score (Eq. (5)). The highest-scoring molecule in each group was picked. The top five molecules of the design lists (i.e., 1–5 for higher similarity, and 6–10 for lower similarity) and their most similar actives (11–16, based on scaffold similarity, Table 3) were selected for molecular dynamics simulations.

Molecular dynamics simulation

The protein structure of MAPK1 was sourced from the Protein Data Bank under the accession code 2Y9Q, characterized by a Resolution of 1.55Å and an R-Value Free of 0.177. Initial complex structures resulted from Docking simulations using Vina⁸³, establishing the binding pose for subsequent investigation via Umbrella Sampling. The setup of the simulation system was facilitated by the CHARMM-GUI web-based graphical interface⁸⁴. A cubic water box with an edge distance of 13Å encapsulated the system, supplemented by a 0.15 M ionic NaCl solution for solvation neutralization. Gromacs software version 2021⁸⁵, operationalized on the Dutch supercomputer Snellius, facilitated all simulations. The energy minimization of solvated systems involved a sequence of steps utilizing the steepest descent method and the conjugate gradient algorithm. Subsequently, equilibration occurred through 5 ns NPT (constant Number of atoms, Pressure, Temperature) ensemble after the first 1 ns NVT (constant Number of atoms, Volume, Temperature) ensemble.

Binding free energy calculation

The last conformations of the equilibration phase were used as the starting structures of ligand unbinding simulation. The distance-based Steered MD simulation (center-of-mass-pulling method) was used to pull the ligand away from the protein by approximately 30 Å over the course of 4 ns by using a 1000 kJ/(mol nm²) force along the reaction coordinate (ξ), with a pulling speed (ν) set at 0.001 nm/ps. Snapshot intervals of 10 ps generated 400 configurations from these pulling simulations. Different ligands prompted the extraction of varying conformations, ranging from 22 to 28, along the reaction coordinate (ξ) at approximately 0.1 nm intervals. These distinct configurations were then employed as the initial points for individual Umbrella Sampling simulations, differing in quantity depending on the specific ligand under study. Each conformation underwent independent NPT equilibration for 5 ns, followed by a 20 ns MD run in triplicate for each ligand. The potential mean force (PMF) was determined via the weighted histogram analysis method (WHAM)⁸⁶, a component of Gromacs. The resultant PMF graphs depicted force in kcal mol⁻¹, representing the force required to dissociate the ligand from the binding pocket, against the corresponding distance. The computation of the binding free energy (ΔG) for each ligand involved comparing the plateau region of the PMF curve to the energy minimum obtained from each simulation. In total, the Umbrella Sampling simulations spanned 400 to 560 ns, comprising 3 replicates, thereby accumulating simulation times ranging from 1.2 μs to 1.6 μs for each ligand.

Software and code

Data preprocessing, scaffold determination, and molecular fingerprint and descriptor calculation were performed with default settings (unless otherwise noted), using RDKit v2020.09.01 in a Python environment. LSTM and GPT were implemented in Keras v2.7.0 (Tensorflow v2.7.1). The S4 code was extracted from the existing Pytorch-lightning v1.15.0 implementation³⁵ and simplified to rely solely on Pytorch v1.13.1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data used in our study are available on GitHub at the following URL: https://github.com/molML/s4-for-de-novo-drug-design. Source data are provided with this paper. The molecule designs and source data are also available at: 10.5281/zenodo.12666371.

Code availability

The Python code to replicate and extend our study is available on GitHub at the following URL: https://github.com/molML/s4-for-de-novo-drug-design. The code at the time of publishing is available at: 10.5281/zenodo.12666371⁸⁷.

References

Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).
Article CAS PubMed Google Scholar
Skinnider, M. A., Stacey, R. G., Wishart, D. S. & Foster, L. J. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759–770 (2021).
Article Google Scholar
Yuan, W. et al. Chemical space mimicry for drug discovery. J. Chem. Inf. Model. 57, 875–882 (2017).
Article CAS PubMed PubMed Central Google Scholar
Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37, 1700153 (2018).
Article PubMed PubMed Central Google Scholar
Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7, eabg3338 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Ballarotto, M. et al. De novo design of Nurr1 agonists via fragment-augmented generative deep learning in low-data regime. J. Med. Chem. 66, 8170–8177 (2023).
Article CAS PubMed PubMed Central Google Scholar
Grisoni, F. Chemical language models for de novo drug design: challenges and opportunities. Curr. Opin. Struct. Biol. 79, 102527 (2023).
Article CAS PubMed Google Scholar
Flam-Shepherd, D., Zhu, K. & Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nat. Commun. 13, 3293 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Hong, Y.-B., Lee, K.-J., Heo, D. & Choi, H. Molecule generation for drug discovery with new transformer architecture. https://ssrn.com/abstract=4195528 (2022).
Wang, Y., Zhao, H., Sciabola, S. & Wang, W. cMolGPT: a conditional generative pre-trained transformer for target-specific de novo molecular generation. Molecules 28, 4430 (2023).
Article CAS PubMed Google Scholar
He, Z. et al. TD-GPT: target protein-specific drug molecule generation gpt. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2355–2359 (IEEE, 2024).
Hu, X., Liu, G., Zhao, Y. & Zhang, H. De novo drug design using reinforcement learning with multiple gpt agents. Advances in Neural Information Processing Systems 36 (2024).
Gummesson Svensson, H., Tyrchan, C., Engkvist, O. & Haghir Chehreghani, M. Utilizing reinforcement learning for de novo drug design. Mach. Learn. 113, 1–33 (2024).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Krenn, M. et al. SELFIES and the future of molecular string representations. Patterns 3, 100588 (2022).
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of smiles for use in machine-learning of chemical structures. ChemRxiv (2018).
Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).
Article Google Scholar
Abate, C., Decherchi, S. & Cavalli, A. Graph neural networks for conditional de novo drug design. Wiley Interdiscip. Rev. Comput. Mol. Sci. 13, e1651 (2023).
Ochiai, T. et al. Variational autoencoder-based chemical latent space for large molecular structures with 3d complexity. Commun. Chem. 6, 249 (2023).
Article PubMed PubMed Central Google Scholar
Wang, M. et al. Deep learning approaches for de novo drug design: an overview. Curr. Opin. Struct. Biol. 72, 135–144 (2022).
Article CAS PubMed Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS PubMed Google Scholar
Segler, M. H., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120–131 (2018).
Article CAS PubMed Google Scholar
Bahdanau, D., Cho, K. H. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 (2015).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article PubMed PubMed Central Google Scholar
Chen, Y. et al. Molecular language models: RNNs or transformer? Brief. Funct. Genomics 22, 392–400 (2023).
Article PubMed Google Scholar
Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS, 2017).
Bagal, V., Aggarwal, R., Vinod, P. & Priyakumar, U. D. MolGPT: molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 62, 2064–2076 (2021).
Article PubMed Google Scholar
Yang, L. et al. Transformer-based generative model accelerating the development of novel braf inhibitors. ACS Omega 6, 33864–33873 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 429–436 (2019).
Honda, S., Shi, S. & Ueda, H. R. SMILES transformer: pre-trained molecular fingerprint for low data drug discovery. Preprint at arXiv https://doi.org/10.48550/arXiv.1911.04738 (2019).
Lim, S. & Lee, Y. O. Predicting chemical properties using self-attention multi-task learning based on SMILES representation. In 2020 25th International Conference on Pattern Recognition (ICPR), 3146–3153 (IEEE, 2021).
Jiang, J. et al. TranGRU: focusing on both the local and global information of molecules for molecular property prediction. Appl. Intell. 53, 15246–15260 (2023).
Article Google Scholar
Gu, A., Dao, T., Ermon, S., Rudra, A. & Ré, C. Hippo: recurrent memory with optimal polynomial projections. Adv. Neural Inf. Process. Syst. 33, 1474–1487 (2020).
Google Scholar
Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 34, 572–585 (2021).
Google Scholar
Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR) (2022).
Fu, D. Y. et al. Hungry hungry hippos: towards language modeling with state space models. Preprint at arXiv https://doi.org/10.48550/arXiv.2212.14052 (2022).
Lu, C. et al. Structured state space models for in-context reinforcement learning. Adv. Neural Inf. Process. Syst. 36 (2024).
Nguyen, E. et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Adv. Neural Inf. Process. Syst. 36 (2024).
Gu, A. & Dao, T. MAMBA: linear-time sequence modeling with selective state spaces. Preprint at arXiv https://doi.org/10.48550/arXiv.2312.00752 (2023).
Ma, J., Li, F. & Wang, B. U-MAMBA: enhancing long-range dependency for biomedical image segmentation. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.04722 (2024).
Hamilton, J. D. State-space models. Handb. Econ. 4, 3039–3080 (1994).
MathSciNet Google Scholar
Pan, V. Fast approximate computations with cauchy matrices and polynomials. Math. Comput. 86, 2799–2826 (2017).
Article MathSciNet Google Scholar
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).
Article CAS PubMed Google Scholar
Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S. & Klambauer, G. On failure modes in molecule generation and optimization. Drug Discov. Today Technol. 32, 55–63 (2019).
Article PubMed Google Scholar
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Article CAS Google Scholar
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Article CAS Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics 1, 1–11 (2009).
Article Google Scholar
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. SCScore: synthetic complexity learned from a reaction corpus. J. Chem. Inf. Model. 58, 252–261 (2018).
Article CAS PubMed Google Scholar
Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).
Article CAS PubMed Google Scholar
Polykovskiy, D. et al. Molecular sets (moses): a benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Article CAS PubMed PubMed Central Google Scholar
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1–40 (2016).
Article Google Scholar
Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. Lit-pcba: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).
Article CAS PubMed Google Scholar
van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
Article PubMed PubMed Central Google Scholar
Weng, G. et al. Rediscmol: benchmarking molecular generation models in biological properties. J. Med. Chem. 67, 1533–1543 (2024).
Article CAS PubMed Google Scholar
Laban, P., Wu, C.-S., Liu, W. & Xiong, C. Near-negative distinction: giving a second life to human evaluation datasets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2094–2108 (2022).
Moret, M., Helmstädter, M., Grisoni, F., Schneider, G. & Merk, D. Beam search for automated design and scoring of novel ror ligands with machine intelligence. Angew. Chem. Int. Ed. 60, 19477–19482 (2021).
Article CAS Google Scholar
Maggiora, G. M. On outliers and activity cliffs why qsar often disappoints. J. Chem. Inf. Model. 46, 1535–1535 (2006).
Article CAS PubMed Google Scholar
Woolson, R. F. Wilcoxon signed-rank test. In Wiley Encyclopedia of Clinical Trials (eds D’Agostino, R. B., Sullivan, L. & Massaro, J.) 1–3 (John Wiley & Sons, Ltd., 2007).
Moret, M., Friedrich, L., Grisoni, F., Merk, D. & Schneider, G. Generative molecular design in low data regimes. Nat. Mach. Intell. 2, 171–180 (2020).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Schneider, G., Schneider, P. & Renner, S. Scaffold-hopping: how far can you jump? QSAR Comb. Sci. 25, 1162–1171 (2006).
Article CAS Google Scholar
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Article CAS PubMed Google Scholar
Harvey, A. L., Edrada-Ebel, R. & Quinn, R. J. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discov. 14, 111–129 (2015).
Article CAS PubMed Google Scholar
Atanasov, A. G., Zotchev, S. B., Dirsch, V. M. & Supuran, C. T. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lee, M.-L. & Schneider, G. Scaffold architecture and pharmacophoric properties of natural products and trade drugs: application in the design of natural product-based combinatorial libraries. J. Comb. Chem. 3, 284–289 (2001).
Article CAS PubMed Google Scholar
Henkel, T., Brunne, R. M., Müller, H. & Reichel, F. Statistical investigation into the structural complementarity of natural products and synthetic compounds. Angew. Chem. Int. Ed. 38, 643–647 (1999).
Article CAS Google Scholar
Chen, Y., Rosenkranz, C., Hirte, S. & Kirchmair, J. Ring systems in natural products: structural diversity, physicochemical properties, and coverage by synthetic compounds. Nat. Prod. Rep. 39, 1544–1556 (2022).
Article CAS PubMed Google Scholar
Merk, D., Grisoni, F., Friedrich, L. & Schneider, G. Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid x receptor modulators. Commun. Chem. 1, 68 (2018).
Article Google Scholar
Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminformatics 13, 1–13 (2021).
Article Google Scholar
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, 68–74 (2008).
Article CAS PubMed Google Scholar
Smirnov, N. On the estimation of the discrepancy between empirical distribution for two independent samples. Bull. Math. Univ. Mosc. 2, 2 (1939).
MathSciNet Google Scholar
Braicu, C. et al. A comprehensive review on MAPK: a promising therapeutic target in cancer. Cancers 11, 1618 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kästner, J. Umbrella sampling. Wiley Interdiscip. Rev. Comput. Mol. Sci. 1, 932–942 (2011).
Article Google Scholar
Aronov, A. M. et al. Flipped out: structure-guided design of selective pyrazolylpyrrole erk inhibitors. J. Med. Chem. 50, 1280–1287 (2007).
Article CAS PubMed Google Scholar
Chaikuad, A. et al. A unique inhibitor binding site in ERK1/2 is associated with slow binding kinetics. Nat. Chem. Biol. 10, 853–860 (2014).
Article CAS PubMed PubMed Central Google Scholar
Blake, J. F. et al. Discovery of 5, 6, 7, 8-tetrahydropyrido [3, 4-d] pyrimidine inhibitors of ERK2. Bioorg. Med. Chem. Lett. 24, 2635–2639 (2014).
Article CAS PubMed Google Scholar
Liu, F. et al. Structure-based optimization of pyridoxal 5’-phosphate-dependent transaminase enzyme (bioa) inhibitors that target biotin biosynthesis in mycobacterium tuberculosis. J. Med. Chem. 60, 5507–5520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. Preprint at arXiv https://doi.org/10.48550/arXiv.1703.07076 (2017).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article CAS PubMed PubMed Central Google Scholar
Özçelik, R., van Tilborg, D., Jiménez-Luna, J. & Grisoni, F. Structure-based drug discovery with deep learning. ChemBioChem 23, e202200776 (2023).
Article Google Scholar
Moret, M., Grisoni, F., Katzberger, P. & Schneider, G. Perplexity-based molecule ranking and bias estimation of chemical language models. J. Chem. Inf. Model. 62, 1199–1206 (2022).
Article CAS PubMed PubMed Central Google Scholar
Eberhardt, J., Santos-Martins, D., Tillack, A. F. & Forli, S. Autodock vina 1.2. 0: new docking methods, expanded force field, and python bindings. J. Chem. Inf. Model. 61, 3891–3898 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lee, J. et al. CHARMM-GUI input generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM simulations using the CHARMM36 additive force field. Biophys. J. 110, 641a (2016).
Article ADS Google Scholar
Abraham, M. J. et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015).
Article ADS Google Scholar
Hub, J. S., De Groot, B. L. & van der Spoel, D. g_wham: a free weighted histogram analysis implementation including robust error and autocorrelation estimates. J. Chem. Theory Comput. 6, 3713–3720 (2010).
Article CAS Google Scholar
Özçelik, R., de Ruiter, S., Criscuolo, E. & Grisoni, F. Chemical language modeling with structured state space sequence models. https://github.com/molML/s4-for-de-novo-drug-design, https://doi.org/10.5281/zenodo.12666371 (2024).

Download references

Acknowledgements

This research was co-funded by the European Union (ERC, ReMINDER, 101077879). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. The authors also acknowledge support from the Irene Curie Fellowship, the Centre for Living Technologies, and SURF (NWO grant EINF-5406). The authors thank Selen Parlar and the Molecular Machine Learning team (H. Brinkmann, C. Izquierdo-Lozano, M. Reksoprodjo, L. Rossen, Y.G. Nana Teukam, D. van Tilborg, L. van Weesep) for their feedback on the manuscript.

Author information

Authors and Affiliations

Institute for Complex Molecular Systems and Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
Rıza Özçelik, Sarah de Ruiter, Emanuele Criscuolo & Francesca Grisoni
Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
Rıza Özçelik & Francesca Grisoni

Authors

Rıza Özçelik
View author publications
You can also search for this author in PubMed Google Scholar
Sarah de Ruiter
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Criscuolo
View author publications
You can also search for this author in PubMed Google Scholar
Francesca Grisoni
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: R.Ö. and F.G. Data curation: R.Ö. Formal analysis: all authors. Investigation: all authors. Methodology: all authors. Software: R.Ö. and S.d.R. Visualization: R.Ö., F.G. and E.C. Writing – original draft: R.Ö. and F.G. Writing – review and editing: all authors.

Corresponding author

Correspondence to Francesca Grisoni.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Chiranjib Chakraborty and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Özçelik, R., de Ruiter, S., Criscuolo, E. et al. Chemical language modeling with structured state space sequence models. Nat Commun 15, 6176 (2024). https://doi.org/10.1038/s41467-024-50469-9

Download citation

Received: 22 September 2023
Accepted: 05 July 2024
Published: 22 July 2024
DOI: https://doi.org/10.1038/s41467-024-50469-9
Springer Nature Limited

Chemical language modeling with structured state space sequence models

From

Abstract

Similar content being viewed by others

De Novo Molecular Design with Chemical Language Models

GenUI: interactive and extensible open source software platform for de novo molecular generation and cheminformatics

Chemist-Computer Interaction: Representation Learning for Chemical Design via Refinement of SELFIES VAE

Introduction

Results and discussion

Structured state space sequence model (S4)

Designing drug-like molecules

Learning the SMILES syntax

Capturing bioactivity

Chemical space exploration

Designing natural products

Prospective de novo design

Opportunities for molecular S4

Methods

Designing drug-like molecules

Data curation

Training

Pretraining

Fine-tuning

Temperature sampling

Molecule ranking with log-likelihoods

Natural product design

Prospective de novo design

Data curation

Model fine-tuning and de novo design

Molecular dynamics simulation

Binding free energy calculation

Software and code

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation