Introduction

Designing molecules with desired properties from scratch is a “needle in the haystack” problem. The chemical universe – estimated to comprise up to 1060 small molecules1 – remains largely uncharted. Generative deep learning offers unprecedented opportunities to explore the chemical universe in a time- and cost-efficient manner2, by enabling the production of desirable molecules without the need for hand-crafted design rules. In particular, chemical language models (CLMs) have yielded experimentally-validated bioactive designs3,4,5,6,7 and stood out as powerful molecular generators2,8,9,10,11,12,13.

CLMs adapt algorithms developed for sequence processing to learn the “chemical language”, that is, how to generate molecules that are chemically valid (syntax) and possess desired properties (semantics)7. This is achieved by representing molecular structures as string notations, such as the Simplified Molecular Input Line Entry Systems (SMILES14, Fig. 1a), among others15,16. These molecular strings are then used for model training and subsequent generation of molecules in textual form. Compared to generative methods based on molecular graphs17, CLMs can learn more complex molecular properties better8, and generate increasingly larger molecules more efficiently18,19. These aspects have made CLMs become one of the de facto approaches for de novo drug design.

Fig. 1: Key concepts of structured state space sequence (S4) models for chemical language modeling.
figure 1

a Simplified Molecular Input Line Entry System (SMILES) strings14, used as the chemical language. SMILES strings are obtained by traversing the molecular graph and annotating atom types, rings, and bond types in the form of text. b S4 for de novo SMILES design. During training, S4 is formulated as a global convolution and processes the whole molecular string simultaneously. The global convolution filter \({{{\overline{{{{\boldsymbol{K}}}}}}}}\) is parameterized via the matrices \({{{\overline{{{{\boldsymbol{A}}}}}}}}\), \({{{\overline{{{{\boldsymbol{B}}}}}}}}\), and \({{{\overline{{{{\boldsymbol{C}}}}}}}}\) (Eq. (2)). During the generation, S4 switches to the recurrent formulation (via the same parameters, (Eq. (1)) and produces SMILES strings element-by-element for more efficient and effective chemical space exploration. c Computational pipeline, where S4 was used to learn from known SMILES strings and generate new molecules de novo.

Several CLM architectures have been proposed for de novo design20, the most popular of which are long short-term memory (LSTM)3,4,5,21,22 models. LSTMs are trained to produce molecular strings element-by-element and have fast generation capabilities. However, the iterative structure forces those models to compress the sequence into an information bottleneck and challenges the learning of global sequence properties23,24,25. Transformers26, a more recent architecture, overcome this bottleneck by processing the entire input molecular string at once27,28. LSTMs and GPTs present different – and somewhat complementary – strengths and weaknesses when it comes to de novo molecule design25,29,30,31,32. The recurrent nature of LSTMs allows learning local properties better than GPTs, while GPTs capture global properties better thanks to their ‘holistic’ processing25. Moreover, while LSTMs remain efficient, Transformers become increasingly compute-intensive when generating progressively longer SMILES strings, which might limit their broad applicability in the chemical sciences. These aspects make it necessary to stretch the boundaries of current CLM approaches further, to chart the chemical space more effectively in search for bioactive molecules25.

Structured state space sequence models (S4s) are a recent member of the fast-growing family of state space architectures33,34,35,36, which are gathering increasing attention in the deep learning community37,38,39,40. S4s showed outstanding performance in audio, image, and text generation35 and have a “dual nature”: they (1) are trained over the entire input sequences to learn complex global properties and (2) generate one string element at a time – thereby combining some respective strengths of Transformers and LSTMs. Motivated by such “best of two worlds” behavior, here we ask the following question: Can S4 advance the current state-of-the-art in chemical language modeling? We find evidence that it can.

Here, we apply S4 to chemical language modeling on SMILES strings and benchmark it on various tasks relevant to drug design – from learning bioactivity to chemical space exploration and natural-product design. Moreover, we further corroborate the promise of S4 via the prospective de novo design of kinase inhibitors, validated using molecular dynamics simulations. Our results show the potential of S4 for chemical language modeling, especially in capturing bioactivity and complex molecular properties. To the best of our knowledge, this is the first time that state space models have been applied to molecular tasks, and we expect their relevance for chemical language modeling to increase in the future.

Results and discussion

Structured state space sequence model (S4)

S4s are an extension of discrete state space models, which are widely adopted in control engineering41. Discrete state space models map an input sequence u to an output sequence y, through the learnable parameters \({{{\overline{{{{\boldsymbol{A}}}}}}}}\in {{\mathbb{R}}}^{N\times N}\), \({{{\overline{{{{\boldsymbol{B}}}}}}}}\in {{\mathbb{R}}}^{N\times 1}\), \({{{\overline{{{{\boldsymbol{C}}}}}}}}\in {{\mathbb{R}}}^{1\times N}\), and \({{{\overline{{{{\boldsymbol{D}}}}}}}}\in {{\mathbb{R}}}^{1\times 1}\), as follows:

$${x}_{k} = \,{{{\overline{{{{\boldsymbol{A}}}}}}}}{x}_{k-1}+{{{\overline{{{{\boldsymbol{B}}}}}}}}{u}_{k}\\ {y}_{k} = \,{{{\overline{{{{\boldsymbol{C}}}}}}}}{x}_{k}+{{{\overline{{{{\boldsymbol{D}}}}}}}}{u}_{k}.$$
(1)

In other words, discrete state space models define a “linear recurrence”: at any step k, the k-th element of the input sequence uk is fed into the model and used to update the hidden state xk and to generate an output, yk. The matrices \({{{\overline{{{{\boldsymbol{A}}}}}}}},{{{\overline{{{{\boldsymbol{B}}}}}}}},{{{\overline{{{{\boldsymbol{C}}}}}}}}\), and \({{{\overline{{{{\boldsymbol{D}}}}}}}}\) control how the input and the hidden state are combined to provide an output (Fig. 1b).

Besides their recurrent formulation, discrete state space models can be formulated as a convolution with the same set of parameters. It can be demonstrated that, by “unrolling” the linear recurrence (Eq. (1)), the output sequence y can be obtained via a learnable convolution over the input sequence u:

$$y=u * {{{\overline{{{{\boldsymbol{K}}}}}}}},$$
(2)

where \({{{\overline{{{{\boldsymbol{K}}}}}}}}\) is the convolution filter, parameterized via \({{{\overline{{{{\boldsymbol{A}}}}}}}}\), \({{{\overline{{{{\boldsymbol{B}}}}}}}}\), and \({{{\overline{{{{\boldsymbol{C}}}}}}}}\) (see Supplementary Eqs. (1)–(4) for a detailed derivation). This convolutional representation reveals a key aspect of state space models: they learn explicitly from the entire sequence (via global convolution) while preserving recurrent generation capabilities (Fig. 1b).

Learning the optimal parameters of a discrete state space system, however, introduces vanishing gradients and numerical instabilities in recurrent and convolutional formulations, respectively. Structured state space sequence models, (S4s)35, tackle those issues by introducing additional structure to the model parameters (via the so-called high-order polynomial projection operators33) and reducing the unstable computations to the stable Cauchy kernel42 computation (see ref. 35 for more detail). Ablation studies35 have shown the relevance of the added structure to achieve computational feasibility and performance on long sequences. Moreover, such reduction allows S4 to address numerical instabilities encountered in model training and made S4 state-of-the-art in several generative tasks that require learning long-distance relationships33,34,35. Motivated by its performance in other domains and the potential benefits of its dual structure, here we introduce S4 to the molecular sciences for the first time.

We evaluated S4 for its ability to learn from and generate drug-like molecules and natural products in an array of tasks, and in terms of multiple molecular properties. LSTMs and Generative Pretrained Transformers (GPTs) were used as benchmarks, since they are the de facto approaches in chemical language modeling for de novo design2,7,8,25. Furthermore, LSTM (recurrent training and generation) and GPT (holistic training and generation) constitute the ideal benchmarks for S4, due to S4’s dual formulation (convolution during training and recurrence during generation), which allows inspecting the effect of each of these aspects on the overall performance. Finally, the prospective de novo design of putative mitogen-activated protein kinase 1 (MAPK1) inhibitors, corroborated by molecular dynamics simulations, was performed to test the potential of S4 in real-world drug discovery scenarios.

Designing drug-like molecules

S4 was analyzed for its ability to design drug-like small molecules (SMILES length lower than 100 tokens) extracted from ChEMBL database43, by focusing on its ability to (1) learn the chemical syntax, (2) capture structural features relevant for bioactivity, and (3) designing structurally diverse molecules.

Learning the SMILES syntax

All investigated CLMs were trained on 1.9M canonical SMILES strings extracted from ChEMBL v3143. The generated strings were evaluated according to their (1) validity, i.e., the number (and frequency) of SMILES corresponding to chemically valid molecules; (2) uniqueness, which captures the number (and frequency) of structurally-unique molecules among the designs; and (3) novelty, corresponding to the number (and frequency) of unique and valid designs that are not included in the training set. A high number of “chemically-valid” designs suggests that the model has learned how to generate plausible molecules, while high uniqueness and novelty values indicate little redundancy among the designs and with the training set, respectively. Although these metrics are vulnerable to trivial baselines44, they provide insights into a model’s capacity to learn the SMILES “syntax”.

All CLMs generated more than 91% valid, 91% unique and 81% novel molecules (Table 1). Moreover, their designs approximated the training and test sets in terms of selected molecular properties (i.e., octanol-water partition coefficient45, quantitative estimate of drug-likeness46, Bertz complexity47, and synthetic accessibility48,49) with no notable differences among architectures (Supplementary Fig. 1 and Supplementary Table 1). These results agree with the literature on CLMs (e.g., refs. 2,50) and demonstrate the robustness of the model training procedure. S4 designs the most valid, unique, and novel molecules, by generating more novel molecules than the benchmarks (from approximately 4000–12,000 more), and displays a good ability to learn the “chemical syntax” of SMILES strings. The potential of S4 in comparison with existing de novo design approaches was further corroborated on the MOSES benchmark51, where S4 consistently scored among the top-performing deep learning approaches (Supplementary Table 2).

Table 1 Designing drug-like molecules de novo with S4

To shed additional light on the strengths and limitations of S4 in comparison with the benchmarks, we analyzed the sources of invalid molecule generation for all methods in terms of branching and ring errors, erroneous bond assignment, and other (miscellaneous) syntax issues (Fig. 2). Interestingly, each method seems to show different types of errors leading to SMILES invalidity. LSTM struggles the most with branching, and performs the best with bond assignment, while GPT struggles the most with rings and bond assignment, and has intermediate performance otherwise. S4 struggles more than LSTM with bond assignment, and generates remarkably fewer errors than both benchmarks in branching and ring design. Our hypothesis is that bond assignment indicates good learning of “short-range” dependencies, while branching and ring opening and closure require better capturing of the “long-range” relationships. This suggests that S4 captures relatively “long-distance” relationships well, in agreement with existing evidence in other domains33,34,35.

Fig. 2: SMILES design errors, grouped by category and CLM architecture.
figure 2

Each CLM was trained on ChEMBL and used to design 102,400 SMILES strings. The invalid designs were categorized per error, and the reported values indicate the number of errors in each category.

Capturing bioactivity

We evaluated S4 for its ability to learn elements of bioactivity. With CLMs this is often achieved with transfer learning52, which allows transferring knowledge acquired from one task to another task with fewer available data. Via transfer learning, after pre-training a CLM on a large corpus of SMILES strings, the model can be then “fine-tuned” on a smaller, and task-focused set (e.g., bioactive molecules) by additional training22. Here, we performed five fine-tuning campaigns, focusing on distinct macromolecular targets from the LIT-PCBA53 dataset: (1) pyruvate kinase muscle isoform 2 (PKM2), (2) mitogen-activated protein kinase 1 (MAPK1), (3) glucocerebrosidase (GBA), (4) mechanistic target of rapamycin (mTORC1), and (5) cellular tumor antigen p53 (TP53).

Evaluating the bioactivity of de novo designs (besides synthesis and wet-lab testing) is non-trivial, since this property cannot be fully captured by traditional molecular descriptors, and might not be accurately predicted by quantitative structure-activity relationship models54,55. Hence, we used experimentally-tested molecules to evaluate the capacity of a CLM to learn elements of bioactivity retrospectively. Several studies have shown that the likelihoods learned by a CLM during fine-tuning can be used to prioritize designs with high chances of being bioactive6,56,57. Based on the same principle, here we used the likelihoods learned by the CLMs to rank existing molecules and evaluate their capacity to prioritize bioactive compounds over inactive ones.

For each of the selected targets, bioactive molecules (Supplementary Table 3) were used for fine-tuning, with ten random training-validation-test splits. After fine-tuning the CLMs on each target, for each training-test split, we proceeded as follows:

  1. (1)

    With each fine-tuned model and per each target, we predicted the likelihoods (Eq. (4)) of the SMILES strings in the respective test set. The considered test sets resemble a real-world scenario in terms of hit-rate, and they comprise 11 (mTORC) to 56 (PKM2) active molecules and 10,240 inactive molecules (except for TP53, containing 3301 inactive molecules, Supplementary Table 3);

  2. (2)

    We ranked the molecules of the test set according to the predicted likelihoods (Eq. (5));

  3. (3)

    For each target and each test set, we computed the fraction of actives ranked among the top 10, top 50, and top 100 molecules. The higher the number of active molecules ranked in early portions of the test set by a CLM, the better the model has learned what is relevant for bioactivity on the investigated target after fine-tuning.

Our results show variable performance depending on the target (Fig. 3). The most challenging target is TP53, on which no model could consistently retrieve actives among the top 10 scoring molecules. Notably, this target has the most challenging test set, where inactive molecules are similar to the actives of both the training and the test sets (Supplementary Fig. 2), potentially indicating the presence of activity cliffs58. MAPK1 and mTORC1 also challenge the CLMs; here, S4 retrieved more active molecules than the benchmarks, especially in the early portions of the test set. PKM2 and GBA are the easiest datasets; here, all CLMs identified bioactive molecules in their top 10, with S4 achieving the highest median across the board. A Wilcoxon signed-ranked test59 on the pooled scores across datasets supports the superior performance of S4 compared to the benchmarks (p [top 10] = 8.41e−6, p [top 50]= 2.93e−7, p [top 100] = 1.45e−7 compared to LSTM, and p [top 10] = 2.33e−3, p [top 50] = 3.72e−3, p [top 100] = 2.61e−2 compared to GPT), and of GPT compared to LSTM (p [top 10] = 5.22e−3, p [top 50] = 3.75e−5, p [top 100] = 2.02e−6).

Fig. 3: Retrospective enrichment analysis for all models across five selected macromolecular targets.
figure 3

The fine-tuned models were used to rank the held-out actives and inactives of the respective protein targets. The percentage of known actives ranked per considered number of test set molecules (10, 50, 100) was computed across ten runs. Bar heights report the median across runs and error bars report the first and third quartiles (n = 10). Source data are provided as a Source Data file.

Under the constraints of the study design, these results indicate that processing the input SMILES “holistically” (as GPT and S4 do) leads to capturing complex properties like bioactivity better, with a better performance obtained by S4.

Chemical space exploration

We analyzed the ability of S4 to explore the chemical space, in terms of generating structurally diverse and bioactive molecules. To this end, we employed a commonly-used strategy with CLMs, that is, varying the sampling temperature (T) to control chemical diversity60. T affects which elements of a string are generated by a weighted random sampling (Eq. (3)). When T → 0 the most likely element (based on the CLM prediction) is selected as the next element of the sequence, while the higher the T, the more random the selections. T = 1 corresponds to using the CLM predictions as the sampling probability of each element at each generation step.

We experimented with an increasing sampling temperatures (from T = 1.0 to T = 2.0 with a step of 0.25). Each T value was used to generate 10,240 SMILES strings per model across the five chosen targets and all training-test splits. Then, we evaluated the designs based on three metrics (Fig. 4):

  • The validity of the generated strings, which captures how robust the model is to increasing degrees of randomness in preserving a correct syntax. The higher the validity, the better.

  • Rediscovery rate: de novo design models are often evaluated for their capacity to reproduce existing molecules with experimentally verified biological activities50,55. For this purpose, we used the held-out actives previously described for each target. Moreover, to “relax” the criterion of rediscovery, we considered held-out actives with substructure similarity higher than 60% to a de novo design (as computed via Tanimoto similarity on extended connectivity fingerprints61) to compute rediscovery. Higher rediscovery rates at increased temperature values indicate that the model can explore regions related to bioactivity despite increased randomness.

  • Scaffold diversity: designing molecules with novel scaffolds bears relevance in lead identification62, and can be used as a proxy to evaluate CLMs51. Here, to have a better evaluation of what constitutes a novel scaffold, the novel designs were grouped in clusters based on their scaffold similarity. This was achieved via hierarchical clustering, to group designs with similar Bemis-Murcko scaffolds63 (as computed via the Tanimoto similarity on the corresponding extended connectivity fingerprints61 higher than 60%). Only novel and unique scaffolds were considered. We then counted the number of obtained scaffold clusters, the higher, the better.

Fig. 4: Model performance when varying the temperature value.
figure 4

Each model was analyzed for its performance when varying the temperature values from 1.0 to 2.0, with a step of 0.25, and by sampling 10,240 molecules. a Analysis of the SMILES validity across temperature. b Variation of rediscovery rate. The models were evaluated for their capability to rediscover bioactive molecules not used for model training or design molecules similar in structure (i.e., with a Tanimoto similarity on extended connectivity fingerprints higher than 60%). c Analysis of the number of diverse groups of scaffolds generated per method. Scaffolds were clustered together if they had a Tanimoto similarity (computed on extended connectivity fingerprints) larger than 60%. For each plot, the solid line indicates the median obtained across the five analyzed protein targets (PKM2, MAPK1, mTORC1, and TP53) with ten runs each (n = 50), and the shaded area indicates the inter-quartile range. The statistics per individual target can be found in Supplementary Fig. 3. Source data are provided as a Source Data file.

The models display similar trends with increasing T values for all the analyzed factors across datasets, with varying magnitude (Fig. 4). In general, the validity decreases with increasing temperature (as previously observed60), with the highest effect observed for GPT (median validity across training setups getting lower than 40%, Fig. 4a).

Both S4 and LSTM show higher robustness than GPT to increasing temperature values (with LSTM performing slightly better for T ≥ 1.75), suggesting that sequential generation can boost chemical space exploration. S4 outperforms LSTM in terms of rediscovery rate (Fig. 4b), in agreement with our previous results on bioactivity (Fig. 3). We also compute the exact rediscovery rate (identical molecular structure) and observe that no model can consistently generate held-out actives. When it comes to the diversity of the designs (Fig. 4c), LSTM can generate the highest number of structurally unique scaffolds (median across datasets and setups: 6602 clusters, T = 1.75) and S4 is the close second-best model (6520 clusters, T = 1.75). While GPT obtains a suboptimal performance across the board, LSTM seems better for chemical space exploration when bioactivity is not the main objective, while S4 can better capture bioactivity and preserve a good chemical space exploration at the same time, combining the strengths of the two benchmarks with its dual structure. These results confirm the promise of S4 when it comes to generating structurally diverse and bioactive drug-like molecules.

Designing natural products

S4 was further tested on more challenging molecular entities than drug-like molecules. To this end, we evaluated its capacity to design natural products (NPs), which are invaluable sources of inspiration for medicinal chemistry64,65. Compared to synthetic small molecules, NPs tend to possess more intricate molecular structures and ring systems, as well as a larger fraction of sp3-hybridized carbon atoms and chiral centers66,67,68. These characteristics correspond to longer SMILES sequences on average, with more long-range dependencies, and make natural products a challenging test case for CLMs19,69.

We trained the CLMs on large natural products (32,360 SMILES strings with length > 100, chosen to complement the previous analysis) from the COlleCtion of Open Natural ProdUcTs (COCONUT) database70. We then used the CLMs to design 102,400 SMILES strings de novo and computed the fraction of valid, unique, and novel designs (Table 2). All CLMs can design natural products, with lower performance compared to drug-like molecules. S4 designs the highest number of valid molecules by approximately 6000 to 12,000 molecules (7–13% better), and LSTM achieves the highest novelty by approximately 2000 molecules (2%) over S4.

Table 2 Natural product design with CLMs

To further investigate the characteristics of the designs, we computed the natural-product likeness71, which captures how similar a molecule is to the chemical space covered by natural products in terms of its substructures (the higher the NP-likeness, the more similar). The novel designs of S4 have significantly higher values of NP-likeness than the benchmarks (Mann–Whitney U test, p = 1.41e−53 compared to LSTM, and p = 1.02e−82 compared to GPT), closer to the values of the training and test sets on average (Table 2). Moreover, the NP-likeness values better match the distribution of the COCONUT molecules in terms of Kolmogorov–Smirnov (KS) distance72, which quantifies how much the cumulative distributions of two observations differ (between 0% and 100%; the lower, the closer the distributions).

In addition to NP-likeness, we evaluated the novel designs in terms of several structural properties important for natural products66,67,68, namely: the number of sp3-hybridized carbon atoms, aliphatic rings, spiro atoms and heavy atoms, as well as the molecular weight and the size of the largest fused ring system. These properties provide additional evidence on the molecular characteristics of the designs, and their structural complexity in comparison with the training natural products. Here, S4 achieved the lowest KS distance to the training and test sets across properties, indicating that its designs match the training natural products best. These results confirm the ability of S4 to learn complex molecular properties for de novo design.

Finally, we analyzed the training and generation speed of the CLM architectures when increasing the SMILES length, to test their practical applicability when designing bigger molecules, like natural products. Our analysis highlighted that S4 is as fast as GPT during training (both are approximately 1.3 times faster than LSTM), and the fastest in terms of generation (Supplementary Fig. 4), thanks to its dual formulation. This further advocates for the introduction of S4 as an efficient approach for molecule design, that “makes the best of both worlds” compared to GPT and LSTM.

Prospective de novo design

We conducted a prospective in silico study with S4, focused on designing inhibitors of mitogen-activated protein kinase 1 (MAPK1), a relevant target for oncological therapies73. The putative bioactivity of the designs was then evaluated via molecular dynamics (MD).

The S4 model previously pre-trained on ChEMBL was fine-tuned with the SMILES strings of 68 manually-curated inhibitors from ChEMBL, having an experimental constant of inhibition (Ki) lower than 1 μM on MAPK1. The last five epochs of the fine-tuned model were then used to generate 256K molecules (51,200 designs per each T value, ranging from 1.0 to 2.0 with a step of 0.25).

The designs were ranked and filtered via log-likelihood score (Eq. (5)) and scaffold similarity to the training set (see “Materials and methods” for further details). The ten top-scoring molecules (110, Fig. 5a and Table 3) were considered for further characterization using MD simulations. As a reference for evaluation, we performed MD simulations also for the closest fine-tuning neighbor of the considered designs (compounds 1116, selected based on scaffold similarity; Fig. 5a and Table 3). The absolute protein-ligand binding free energy (expressed as ΔG – the lower the stronger the predicted binding) for molecules 116 was computed via Umbrella Sampling74 (Table 3). The computed ΔG values for known bioactive molecules (1116) have a good correspondence with experimental Ki values from ChEMBL (Table 3), confirming the validity of the chosen MD protocol.

Fig. 5: Prospective de novo design of putative MAPK1 inhibitors with S4.
figure 5

a Selected de novo designs (molecules 110) for further characterization. For each de novo design, its most similar training MAPK1 inhibitor (as reported in Table 3) is depicted (compounds 1116, gray box). The ligand binding pose (obtained via Umbrella Sampling) of selected designs interacting with MAPK1 (PDB-ID = 2Y9Q), in comparison with their most similar bioactive molecule from the fine-tuning set is also depicted: b Design 2 (green) compared with compound 12 (gray). c Design 3 (yellow) compared with compound 13 (gray). d Design 9 (blue) compared with compound 13 (gray).

Table 3 In silico prospective study on designing mitogen-activated protein kinase (MAPK1) inhibitors with S4

Eight out of ten designs (except 1 and 5) showed a high predicted affinity (Table 3), with ΔG values ranging from ΔG = −10.3 ± 0.6 kcal mol−1 (7) to ΔG = −23 ± 4 kcal mol−1 (2). Interestingly, these affinities are comparable or even surpass those of the closest active neighbor (ΔG = −9.1 ± 0.8 kcal mol−1 to ΔG = −13 ± 2 kcal mol−1). The global substructure similarity (measured on extended connectivity fingerprints) of the designs to their closest neighbor ranges from 31% (10) to 87% (4, Table 3).

The most potent design according to MD predictions is molecule 2G = −23 ± 4 kcal mol−1, Table 3). This molecule – which is the largest one among the designs (Fig. 5a) – engages extensively with the binding pocket of MAPK1 (Fig. 5b), which explains the remarkably favorable predicted affinity. Design 2 has a limited substructure similarity to its closest bioactive neighbor (molecule 12, similarity equal to 57%); however, its synthetic accessibility may be limited. Design 3 is predicted with the second highest affinity (ΔG = −19.6 ± 0.9 kcal mol−1), and it shares the same scaffold of compound 13. Design 3 differs from 13 by the replacement of the ether and hydroxy moieties with two fluorine atoms, and the addition of a methoxy group (Fig. 5a, global similarity equal to 65%). Interestingly, this structural modification leads to an improvement of the predicted ΔG value (of approximately −10 kcal mol−1), possibly due to the ability to penetrate deeply into the binding pocket thanks to the fluorine atoms (Fig. 5c). Halogens are, in fact, favorable for MAPK1, as evident from the fine-tuning molecules (91% of them containing at least one halogen) and existing literature (e.g., refs. 75,76,77,78). Evidence of a favorable positioning of halogens is shown on both the “top”75,76 and “bottom”77,78 of the binding pocket, further supporting the predicted affinity of compound 3.

Design 9G = −17 ± 2 kcal mol−1) features halogens on both sides, unlike its closest neighbor, molecule 13G = −10.5 ± 0.7 kcal mol−1, global similarity equal to 33%), from which it also differs in the moiety attached to the pyridonic ring (Fig. 5a). When inspecting the predicted binding pose, it can be observed that the aromatic ring with halogen substituents, hydroxyl, and carbonyl of pyridone are situated in the same region of the binding groove (Fig. 5d). The difference in ΔG values (approximately 6.5 kcal mol−1 in favor of design 9) could be ascribed, like with molecule 3, to the presence of halogens in the lower binding pocket region. This might also explain the high predicted affinity of design 10G = −15 ± 2 kcal mol−1) – which differs from 9 by a carbonyl and a methyl group.

With 8 out of 10 designs predicted as bioactive on the intended target by MD, with comparable or higher predicted affinities than their closest fine-tuning molecules, these results further corroborate the potential of S4 for de novo drug design.

Opportunities for molecular S4

In conclusion, this study pioneered the introduction of state space models into chemical language modeling, with a focus on structured state spaces (S4s). The unique dual nature of S4s, involving convolution during training and recurrent generation, makes them particularly intriguing for de novo design starting from SMILES strings.

Our systematic comparison with GPT and LSTM on a variety of drug discovery tasks revealed S4’s strengths: while recurrent generation (LSTM and S4) is superior in learning the chemical syntax and exploring diverse scaffolds, learning holistically on the entire SMILES sequence (GPT and S4) excels in capturing certain complex properties, like bioactivity. S4 with its dual nature, makes the best of both worlds”: it demonstrated comparable or better performance than LSTM in designing valid and diverse molecules, and systematically outperformed both benchmarks in capturing complex molecular properties – all while maintaining computational efficiency.

The application of S4 to MAPK1 inhibition, validated by MD simulations, further showcases its potential to design potent bioactive molecules. In the future, we will apply S4 prospectively in combination with wet-lab experiments to enhance its impact in the field. Strategies to increase the structural diversity of the considered designs, such as SMILES augmentation79 and improved ranking protocols, could further boost its potential in medicinal chemistry.

Several aspects of S4 await to be explored in the molecular sciences, such as its potential with longer sequences (e.g., macrocyclic peptides and protein sequences) and on additional molecular tasks (e.g., organic reaction planning80 and structure-based drug design81).

In the future, we envision the relevance of S4 for molecule discovery to increase, and to potentially replace widely established chemical language models like LSTM and GPT. We believe that the provided open-access code will contribute to the adoption and expansion of S4, to further stretch the boundaries of chemical language modeling.

Methods

Designing drug-like molecules

Data curation

The pre-training set was generated starting from ChEMBL v3143. Fine-tuning datasets were extracted from LIT-PCBA53. All sets were generated by (1) retaining molecules containing selected atoms (C, H, O, N, S, P, F, Cl, Br, and I), (2) removing salts and disconnected structures, as well as stereochemistry annotations and charge, (3) retaining molecules whose canonicalized SMILES strings contained 100 tokens or fewer. After sanitization, canonicalization, label encoding, and padding (to 100), molecules were randomly split into training, validation, and test sets. For ChEMBL, this led to a training set of 1,900,000, and a validation and a test set of 100,000 and 23,680 molecules, respectively. The number of compounds for each fine-tuning campaign is reported in Supplementary Table 3.

Training

Pretraining

The hyper-parameters of the LSTM and GPT were tuned with random search for 5 days on a single NVIDIA A100 40GB GPU. The defined hyper-parameter space is based on previous work27,57,60,82 (Supplementary Table 4). 40 LSTM and 35 GPT models were optimized within a 5-day limit. Hyper-parameter search was conducted to maximize the validity during pre-training.

To account for the lack of previous information on optimal hyper-parameters for molecule generation with S4, we implemented a two-step procedure for hyper-parameter tuning. First, 242 models were trained to prioritize hyper-parameters (see Supplementary Table 4). High-performing hyper-parameter values in terms of validation accuracy were advanced to the second phase, where 108 experiments were conducted. Hyper-parameter search was conducted for 10 days on multiple NVIDIA A100 40GB GPUs to maximize the validity during pre-training.

Fine-tuning

Five fine-tuning campaigns were conducted on five targets: PKM2, MAPK1, GBA, mTORC1, and TP53. For each target, ten runs with different training (80%), validation (10%), and test (10%) splits were performed (except for PKM2 where we used 70%–15%–15% due to limited data). Early stopping on the validation cross-entropy was adopted with a patience of five epochs and a tolerance of 10−5.

Temperature sampling

The sampling probability (p) of each i-th element at any step of the sequence was computed as follows:

$${p}_{i}=\frac{{e}^{(\,{y}_{i}/T)}}{{\sum }_{j}{e}^{(\,{y}_{j}/T)}}$$
(3)

where yi is the predicted probability of the ith element, T is the sampling temperature, and j runs over all tokens in the vocabulary.

Molecule ranking with log-likelihoods

The molecules were ranked based on the joint likelihood of the tokens (i.e., SMILES characters) they contain82. For each test molecule, the joint log-likelihood (\({{{{{{\mathcal{L}}}}}}}\)) by a model (M) was computed as:

$${{{{{{\mathcal{L}}}}}}}({{{{{{\bf{M}}}}}}})=\sum\limits_{i}\log p({t}_{i})$$
(4)

where ti is the ith token of the SMILES string of a given test molecule and p(ti) is the probability of that token as predicted by the model M; i runs over all the elements in the molecular string.

To only consider the fine-tuning information and remove potential pre-training bias (as previously observed82), the pre-training log-likelihood was subtracted from the fine-tuning likelihood, to obtain a final score:

$${{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}({{{{{{\bf{M}}}}}}})={{{{{{\mathcal{L}}}}}}}({{{{{{{\bf{M}}}}}}}}_{{{{{{{\bf{ft}}}}}}}})-{{{{{{\mathcal{L}}}}}}}({{{{{{{\bf{M}}}}}}}}_{{{{{{{\bf{pt}}}}}}}})$$
(5)

where Mft is the fine-tuned model and Mpt is the pre-trained model. The obtained \({{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}\) was used to rank each test molecule, the higher the \({{{{{{{\mathcal{L}}}}}}}}_{{{{{\rm{score}}}}}}\), the better the rank.

Natural product design

The COlleCtion of Open Natural ProdUcTs (COCONUT)70 database was used for model training. Salts, disconnected structures, stereochemistry, and charge annotations were removed. Molecules with canonical SMILES strings longer than 100 characters were used to train the models. A random search strategy was adopted to tune the hyper-parameters of all models, as previously explained. The models were given a 5-day limit on a cloud NVIDIA A100 GPU and 1024 strings were generated by each model. The models with the highest SMILES validity were selected for further evaluation.

Prospective de novo design

Data curation

Fine-tuning data were collected from ChEMBL v3343. All annotations for MAPK1 were retained (target ID: CHEMBL4040). Available assay descriptions were manually inspected and analyzed. Molecules whose inhibitory constant (Ki) was lower than 1 μM on reliable inhibition assays (CHEMBL3412886, CHEMBL917079) were retained. SMILES canonicalization and removal of stereochemistry and duplicates led to a set of 68 unique SMILES strings for fine-tuning (SMILES strings available in the dedicated GitHub repository).

Model fine-tuning and de novo design

The fine-tuning dataset was split into ten train and validation splits (80–20%) to find the optimal number of fine-tuning epochs. Early stopping on validation loss was used, with patience of five epochs and tolerance of 10−5. The experiments suggested 45 epochs to be optimal; the pre-trained model was fine-tuned on the whole dataset accordingly.

The models of the last five fine-tuning epochs were used to design molecules. A total of 10,240 designs for temperature values ranging from 1.0 to 2.0 (step size 0.25) were generated per temperature and model, totaling 5 × 5 × 10, 240 = 256K designs. The novel and unique molecules among those designs were ranked by their fine-tuning log-likelihood (Eq. (4)) and the top 5000 molecules were selected for further analysis.

The 5000 top-scoring molecules were divided into two groups, based on their similarity to the fine-tuning set. The similarity was measured via Tanimoto similarity on the extended connectivity fingerprints61 of the Bemis-Murcko scaffolds63 (using a radius of 3 bonds and 2048 bits), and a threshold of 60% similarity. The designs in the lists were grouped by their most similar training molecule and ranked by the log-likelihood score (Eq. (5)). The highest-scoring molecule in each group was picked. The top five molecules of the design lists (i.e., 15 for higher similarity, and 610 for lower similarity) and their most similar actives (1116, based on scaffold similarity, Table 3) were selected for molecular dynamics simulations.

Molecular dynamics simulation

The protein structure of MAPK1 was sourced from the Protein Data Bank under the accession code 2Y9Q, characterized by a Resolution of 1.55Å and an R-Value Free of 0.177. Initial complex structures resulted from Docking simulations using Vina83, establishing the binding pose for subsequent investigation via Umbrella Sampling. The setup of the simulation system was facilitated by the CHARMM-GUI web-based graphical interface84. A cubic water box with an edge distance of 13Å encapsulated the system, supplemented by a 0.15 M ionic NaCl solution for solvation neutralization. Gromacs software version 202185, operationalized on the Dutch supercomputer Snellius, facilitated all simulations. The energy minimization of solvated systems involved a sequence of steps utilizing the steepest descent method and the conjugate gradient algorithm. Subsequently, equilibration occurred through 5 ns NPT (constant Number of atoms, Pressure, Temperature) ensemble after the first 1 ns NVT (constant Number of atoms, Volume, Temperature) ensemble.

Binding free energy calculation

The last conformations of the equilibration phase were used as the starting structures of ligand unbinding simulation. The distance-based Steered MD simulation (center-of-mass-pulling method) was used to pull the ligand away from the protein by approximately 30 Å over the course of 4 ns by using a 1000 kJ/(mol nm2) force along the reaction coordinate (ξ), with a pulling speed (ν) set at 0.001 nm/ps. Snapshot intervals of 10 ps generated 400 configurations from these pulling simulations. Different ligands prompted the extraction of varying conformations, ranging from 22 to 28, along the reaction coordinate (ξ) at approximately 0.1 nm intervals. These distinct configurations were then employed as the initial points for individual Umbrella Sampling simulations, differing in quantity depending on the specific ligand under study. Each conformation underwent independent NPT equilibration for 5 ns, followed by a 20 ns MD run in triplicate for each ligand. The potential mean force (PMF) was determined via the weighted histogram analysis method (WHAM)86, a component of Gromacs. The resultant PMF graphs depicted force in kcal mol−1, representing the force required to dissociate the ligand from the binding pocket, against the corresponding distance. The computation of the binding free energy (ΔG) for each ligand involved comparing the plateau region of the PMF curve to the energy minimum obtained from each simulation. In total, the Umbrella Sampling simulations spanned 400 to 560 ns, comprising 3 replicates, thereby accumulating simulation times ranging from 1.2 μs to 1.6 μs for each ligand.

Software and code

Data preprocessing, scaffold determination, and molecular fingerprint and descriptor calculation were performed with default settings (unless otherwise noted), using RDKit v2020.09.01 in a Python environment. LSTM and GPT were implemented in Keras v2.7.0 (Tensorflow v2.7.1). The S4 code was extracted from the existing Pytorch-lightning v1.15.0 implementation35 and simplified to rely solely on Pytorch v1.13.1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.