SELF-EdiT: Structure-constrained molecular optimisation using SELFIES editing transformer

Structure-constrained molecular optimisation aims to improve the target pharmacological properties of input molecules through small perturbations of the molecular structures. Previous studies have exploited various optimisation techniques to satisfy the requirements of structure-constrained molecular optimisation tasks. However, several studies have encountered difficulties in producing property-improved and synthetically feasible molecules. To achieve both property improvement and synthetic feasibility of molecules, we proposed a molecular structure editing model called SELF-EdiT that uses self-referencing embedded strings (SELFIES) and Levenshtein transformer models. The SELF-EdiT generates new molecules that resemble the seed molecule by iteratively applying fragment-based deletion-and-insertion operations to SELFIES. The SELF-EdiT exploits a grammar-based SELFIES tokenization method and the Levenshtein transformer model to efficiently learn deletion-and-insertion operations for editing SELFIES. Our results demonstrated that SELF-EdiT outperformed existing structure-constrained molecular optimisation models by a considerable margin of success and total scores on the two benchmark datasets. Furthermore, we confirmed that the proposed model could improve the pharmacological properties without large perturbations of the molecular structures through edit-path analysis. Moreover, our fragment-based approach significantly relieved the SELFIES collapse problem compared to the existing SELFIES-based model. SELF-EdiT is the first attempt to apply editing operations to the SELFIES to design an effective editing-based optimisation, which can be helpful for fellow researchers planning to utilise the SELFIES.


Introduction
Drug discovery is a challenging process to overcome the long struggle between humans and diseases.Discovering drug Several machine-readable representations of molecules have been developed to utilise various deep generative models.Widely used molecular representation methods include simplified molecular-input line-entry system (SMILES) [11], self-referencing embedded strings (SELFIES) [12], and molecular graph representations.The molecular graph representation is the most intuitive approach because it resembles Kekulé diagrams with atoms and bonds.In molecular graph representation, each molecule is depicted as an undirected graph, in which atoms are mapped to nodes and bonded to the edges.The molecular graph representations advantages include abundant structural information and high interpretability.However, graph-based deep generative models require significant storage space and memory for graph data processing, resulting in low efficiency of molecular generation [13].In contrast, string-based representations, such as SMILES and SELFIES, enable efficient computations with relatively less storage.SMILES is an ASCII string that simplify atoms, bonds, and chemical structures using strict grammar.The SELFIES was designed to guarantee 100% chemically valid molecular generation by enforcing formal grammar rules (Fig. 1).
Using various optimisation techniques, many studies have proposed molecular optimisation models that efficiently generate new molecules with improved properties.These studies succeeded in generating novel molecules with improved properties.However, they did not consider the synthetic feasibility of the generated molecules, resulting in improved properties that were synthetically infeasible [6].The synthesis of individual compounds typically involves experienced chemists who conduct several validations to assess their synthetic feasibility.However, with the growing number of compounds requiring estimation, new metrics have been designed to predict synthetic feasibility using trained models [14,15].To achieve property improvement and synthetic feasibility, recent studies have exploited scaffold-based generation and editing-based optimisation to slightly modify molecular structures while retaining property-related parts [16][17][18][19][20].
Although scaffold and editing-based approaches predominantly exploit molecular graph representations because of their high interpretability, a recent comparative study revealed that string-based representations, despite their complex grammar, do not exhibit any evident shortcomings when applied to molecule optimisation tasks [21].In addition, the string-based representation exhibits a slightly higher generation efficiency.
To design an effective SELFIES-based editing approach for structure-constrained molecule optimisation, the following criteria should be considered: • C1: A tokenization method should be implemented to address the complex grammar of SELFIES.• C2: The editing model should process molecular structural information at fine to coarse scales.• C3: The outputs of the editing process must be chemically valid, property-improved, and structurally similar to the corresponding input molecules.
The existing string-based editing methods primarily tokenize molecules in atomic units [20].Although this approach reduces the size of the token dictionary, it weakens the preservation of the structural features in the molecules.Furthermore, many SELFIES-based methods that utilise rule-based algorithms frequently suffer from a challenge known as SELFIES collapse during editing process [21].SELFIES collapse refers to the phenomenon wherein different SELFIES strings containing grammatically incorrect substrings, collapse into a single truncated SELFIES string (Fig. 2).Owing to the collapse of SELFIES, SELFIES-based models have difficulty in generating diverse molecular structures.
Following the concept of fragment-based drug design [22], which involves utilising molecular fragments for stepwise optimisation, we propose a SELFIES Editing Transformer (SELF-EdiT) as a simple and efficient editing method based on SELFIES for structure-constrained molecule optimisation.To the best of our knowledge, SELF-EdiT is the first attempt to apply fragment-based editing operations to SELFIES.The main idea is to start with a seed molecule and generate candidates by deleting and inserting fragments of the SELFIES string.To ensure that these edits guaran- (C2).To edit a SELFIES string, SELF-EdiT uses a Levenshtein transformer (LevT) [24], which can iteratively perform deletion-and-insertion operations on SELFIES strings (C3).

SELFIES
In SELFIES [12], atoms are represented by symbols enclosed in parentheses, such as As shown in (Fig. 1a), the branch of the 2nd oxygen bonded to the 1st carbon is written as " indicating a double bond type of branch.Because the value of N is 1, the first character "[C]" on the right side of "[=Branch1]" is interpreted as an indicator, rather than a carbon.Using the SELFIES indexing table, we determined that this branch was a double-bond-type branch of length one.
Although the interpretation of the rings was similar to that of the branches, the search directions were different.For example, " " is a simple ring structure traversed from the 4th carbon and the 9th carbon (Fig. 1a).Because the value of N is 1, the first character "[=Branch1]" on the right side of "[Ring1]" is interpreted as an indicator to return the length of this ring structure.
Overall, the branches begin the search from the last successor and read sequentially based on their length, whereas the rings begin from the first predecessor and read backward.

Transformer
The Transformer [25] is an attention-based neural network architecture with an encoder-decoder structure.The encoder converts the input sequence as a sequence of hidden states, and the decoder decodes the output sequence from these hidden states.The decoding process is implemented in an autoregressive manner, where the next word in the sequence is predicted based on the previous words.The attention mechanism plays a crucial role in the transformer, allowing the model to learn complex syntax and capture long-range dependencies in the input sequence.This is essential for a sequence-to-sequence task requiring an understanding of the relationship between distant words.
The LevT [24] is a variant of a transformer that incorporats the ability to insert and delete tokens during the decoding process.A significant difference between LevT and the original transformer is the absence of a decoder in the former.In contrast, LevT incorporates a Levenshtein edit distance layer into the encoder, which calculates the Levenshtein distance [26] between the input and output sequences.This enables LevT to generate output sequences of varying lengths in a non-autoregressive manner, whereas the original transformer is constrained to output sequences of the same length as the input sequence.This capability is vital for tasks such as postediting, in which output sequences of different lengths are required.

Related works
The existing methodologies for molecular optimisation can be grouped into four categories: 1) reinforcement learning, 2) Bayesian optimisation, 3) evolutionary algorithms, and 4) fragment-based optimisation.Reinforcement learning involves a generative model that randomly generates molecules, and an oracle that calculates the reward for the generated molecules based on their estimated molecular property scores.The generative model is then fine-tuned using a policy gradient algorithm with rewards to maximize the expected reward and generate desirable molecules [27,28].However, the use of a reinforcement learning scheme is not straightforward because of the high variance in rewards [29].
Instead of fine-tuning generative models, such as reinforcement learning, molecular optimisation can be achieved by exploring latent chemical spaces.A popular exploration technique is the Bayesian optimisation method, which is normally coupled with a trained variational autoencoder (VAE) that learns the latent space corresponding to specific data in a probabilistic manner.More specifically, the process starts with known latent vectors of existing molecules with desired properties and structural similarities.Then, a surrogate model and acquisition function are updated to determine the direction that is most likely to optimise the seed molecules in the latent space [30,31].Although many studies have exploited Bayesian optimisation methods to explore the latent chemical space, designing an appropriate acquisition function can be challenging because of the high dimensionality and nonlinear features of the latent space [32].
Evolutionary optimisation approaches such as genetic algorithms and particle swarm optimisation determine the optimised molecular structures by fusing different molecules [20,33].The genetic algorithm is one of the most widely used evolutionary techniques and consists of two components: a set of mutation operations and a fitness function.This algorithm optimises molecules through mutations and/or crossover to perturb the mating pool containing a set of candidate molecules.At each iteration, new candidates are generated and evaluated based on their properties and structural similarities using a fitness function.As the algorithm progressed, unqualified candidates with low fitness scores were eliminated, allowing the most promising candidate molecules to survive in the mating pool and evolve towards an optimal molecule.However, it is worth noting that this approach can sometimes become trapped in regions of local optima [34].
While the previous three categories focus on optimisation algorithms, fragment-based optimisation focuses more on directly modifying the structure of the molecule.This approach can be traced back to the fragment-based drug design (FBDD) using traditional drug design methods [35].Because fragments can be grown, merged, or linked to other fragments, FBDD optimises fragments by adding functional fragments or linking two independent fragments in an iterative process to improve their pharmacological properties [22,36].In line with this concept, recent studies have explored two main types of fragment-based optimisation: scaffold-based generation and editing-based optimisation.Scaffold-based generation represents a molecule as a tree of fragments that are then assembled in a fine-grained manner to optimise the molecules [16][17][18][19], whereas editing-based optimisation utilises addition and deletion operations to directly edit the internal fragments of the molecule [20].

SELFragment tokenization
We first converted the original data from the SMILES representation (x smi , y smi ) to the SELFIES representation (x sel , y sel ), and then tokenized the SELFIES data into substructure-based fragments based on the SELFIES grammar.Because molecules can be considered combinations of branches and rings, the SELFIES strings can be tokenized into multiple fragments (SELFragments) that provide complete substructure information in accordance with the grammar.
Notably, because branches and rings are searched in different directions, a split operation must be performed twice during the tokenization process.More precisely, given a SELFIES string (Fig. 4a), the tokenizer first sequentially splits the string into branch-based fragments (Fig. 4b).However, relying solely on branch substructures for tokenization results in the ring substructure being either broken up or enclosed within a branch, leading to loss of molecular structural information.Hence, the tokenizer performs a backward search for the ring substructure in the obtained fragment sequence and rearranges the fragments via splitting or merging operations (Fig. 4c).

SELFragment embeddings
To efficiently deal with numerous SELFragments, we exploited SimCSE, an embedding model trained using contrastive learning.Contrastive learning aims to generate embedded representations that pull similar data closer to each other, while pushing dissimilar data far apart [37].Similarity for contrastive learning should be well-defined depending on the task.SimCSE uses dropout masks as a data augmentation method to construct semantically similar positive pairs.Specifically, embeddings derived from identical inputs and dropout masks are regarded as positive instances, whereas embeddings derived from different inputs are treated as negative instances (Fig. 3).Specifically, for any SELFragment v, the objective function of SimCSE L(v) is computed as Step1: Generating tokenized SELFIES data from paired SMILES strings where v is a SELFragment derived from v using a dropout mask, ξ v is an embedding vector of v, τ is a temperature, and sim(a, b) is the cosine similarity a b a b .

SELF-EdiT
After tokenizing the input data pair and retrieving the corresponding embeddings, the model begins to edit the seed molecules.Toward this goal, we leveraged LevT, an editbased neural machine translation model.LevT operates by starting with a source string and iteratively performing deletion and insertion operations on a sequence.During the training process, the deletion and insertion labels were obtained by calculating the Levenshtein distance between the source molecule x and the target molecule y.The Levenshtein distance is a string metric that efficiently measures the difference between two sequences using dynamic programming.Figure 3 shows the architecture of the SELF-EdiT, which consists of three transformers.

Word-Deletion Transformer (WDT)
The WDT first scans the input sequence and assigns a binary label l i for each v i .l i = 0 indicates that the i-th fragment is retained and l i = 1 indicates the deletion of the i-th fragment.The start token < s > and end token < /s > are excluded from the deletion process to ensure the integrity of the sequence boundaries.The WDT predictions were as follows: where h i is a hidden state of i-th SELFragment in x, ldel is a predicted deletion label, and W WDT ∈ R 2×d model .

Placeholder-Insertion Transformer (PIT)
After WDT deletes the fragment in sequence x based on the deletion label, PIT predicts the number of tokens to be inserted into each adjacent fragment pair as follows: where v 0 =< s >, l plh is a predicted integer for insertion of placeholder token

Word-Insertion Transformer (WIT)
Given the masked sequence x generated by PIT, WIT predicts the actual fragment for each inserted placeholder as follows: where lins is a predicted fragment that replaces the inserted < PLH > and W WIT ∈ R |V |×d model .Following LevT, instead of training modules with different weights, we implemented three modules that share the same transformer backbone to share useful features in different edit operations.In contrast to the original model, we tweaked the training process, as in Algorithm 1, to better align with our task.

Inference procedure
Once the training is completed, SELF-EdiT iteratively edits the source molecules in the format of SELFragments by alternating deletion and insertion operations.This procedure terminates when the modification count reaches a user-defined threshold and the optimal value is selected heuristically.Algorithm 2 outlines the inference process.

Datasets
The proposed method was trained and evaluated on two widely used benchmark datasets, dopamine receptor D2 (DRD2) and qualitative drug-likeness (QED), as described in [16].Each datasets consists of training and testing sets, formed by pairwise data with specified property ranges and structural similarities sim(x, y) ≥ δ.Specifically, the DRD2 property score represents the probability that a compound is active against DRD2; these score values were evaluated using the trained model provided in [38].In the DRD2 dataset, the source molecules had values < 0.05 and the paired target molecules had values > 0.5.The QED scores [39] measure how druglike a molecule is, and the open-source cheminformatics toolkit RDKit [40] was used to access the value.In the QED task, the goal was to optimise the source molecules in the range [0.7,0.8] to a higher range of [0.9,1.0].To measure the structural similarity between the paired data, we utilised the Tanimoto similarity [41] over Morgan fingerprints and applied a similarity constraint δ = 0.4 to all datasets.
We employed SMILES randomisation as the data augmentation method to enhance the efficiency of our model training.SMILES randomisation is a straightforward method that returns a diverse set of new SMILES strings by scanning a molecule starting from different atoms while retaining its structural integrity.We augmented as much data as possible to expand the training sets, whereas for the testing sets, each dataset was augmented 20 times to accommodate the quantitative analysis requirements.

Baseline methods
We compared our SELF-EdiT model with the following baselines: • MMPA [42]: A rule-based molecular transformation method that extracts several rules from a dataset.During the inference, the seed molecules are translated multiple times using different matching transformation rules.• Junction Tree VAE (JT-VAE) [17]: A Bayesian optimisation-based model that represents the molecule graph as a junction tree that is cycle-free and easier to generate.The encoder maps both the molecular graph and junction tree into latent variables.The decoder first generated a junction tree as a blueprint, which was then reconstructed into a specific molecular graph.
• GCPN [27]: A reinforcement learning-based model that iteratively modifies a molecule by adding or deleting atoms and bonds.The proposed model also adopts adversarial learning to enhance the naturalness of optimised molecules.
Both the encoder and decoder use the GRU as the neural architecture and have been successfully applied to other molecule generation tasks.
• UGMMT [43]: A method that utilises dual learning to optimise molecules.To implement bidirectional conversion between the embedding spaces, each translation network was trained separately for one-way conversion.• VJTNN(+GAN) [16]: An improved method based on the JT-VAE that treats molecular optimisation as a graphto-graph translation task.The proposed method uses adversarial learning instead of Bayesian optimisation, while maintaining the junction tree encoder-decoder for learning.
• HierG2G [17]: A structural motif-based model that utilises a hierarchical graph encoder-decoder model to optimise molecules.The encoder generates a multiresolution representation in a fine-to-coarse manner.Throughout the generation process, the autoregressive decoder progressively adds motifs in a coarse-to-fine manner.
• T-S-Polish [18]: A method proposes an optimisation paradigm called Graph Polish that aims to optimise molecules by maximizing the preserved portions of the source molecule through a Teacher and Student framework.The Teacher component identifies the optimisation center and provides information on the preservation, removal, and addition of other parts.The Student component learns this knowledge and applies it to optimise the molecules.
• STONED [20]: A rule-based method that edits SELFIES by replacing the source and target molecules.STONED has demonstrated its superiority in virtual screening for designing photovoltaic-like molecular structures and offers interpretability by drawing chemical paths from the source to the target molecules.

Molecular optimisation performance
To evaluate the overall optimisation performance of the proposed method, we compared SELF-EdiT with the baseline models using the following metrics: • Success [16]: For each source molecule in a test dataset, the model generates K optimised molecules.We determined whether the optimisation was successful by checking if, among the K candidates, at least one molecule satisfied the similarity constraint and fell within the target range for the corresponding property.Finally, the success score was defined as the ratio of successful optimisation counts to the number of test molecules.
• Property [16]: The average property value of all the generated molecules.
• Similarity [16]: Average Tanimoto similarity between source molecules in the test dataset and corresponding generated target molecules.
• Novelty [16]: The proportion of molecules among the generated molecules that did not appear in the training sets.Novelty measures the potential of a model for the design of new molecules.• SF Score (synthetic feasibility score): The proportion of molecules that simultaneously satisfy the similarity constraint, property improvement, and synthetic feasibility among the overall generated molecules.The synthetic feasibility was measured using GASA [15], a prediction framework that evaluates the synthetic feasibility of small molecules by classifying them as either 0 (easy to synthesize) or 1 (hard to synthesize).• Total Score: The weighted sum of the SF score, property, similarity, and novelty, which comprehensively reflects the overall performance of the model.To properly consider the property improvement, structural similarity, and synthetic feasibility captured by the SF score, we assigned a weight of 1.5 to the SF score a weight of 1 to the other metrics.
We first calculated the success scores of SELF-EdiT and the baseline models because the success score evaluates the model performance in generating molecules with both property improvement and high structural similarity.As shown in Fig. 5, SELF-EdiT exhibited the highest success scores of 0.596 and 0.822 for QED and DRD2, respectively, compared with the baseline models.This demonstrates that SELF-EdiT is an effective tool for structure-constrained molecular optimisation.
To evaluate the overall generative performance of SELF-EdiT, we compared the total scores of the SELF-EdiT and baseline models (Fig. 6).SELF-EdiT outperformed the baseline models in terms of the total scores, achieving scores of 2.791 and 2.520 for QED and DRD2, respectively.The total scores of SELF-EdiT were ranged from 0.035 to 0.884, which were higher than the baselines, demonstrating the effectiveness of SELF-EdiT for structure-constrained molecular optimisation.To better understand the performance differences between the SELF-EdiT and baseline models, all metric scores are provided in Supplementary Tables S1 and S2.Although SELF-EdiT was not the best for each

Hyperparameter analysis
The quality of the output molecules generated by SELF-EdiT varies depending on the number of edit iterations.To investigate the optimal number of edit iterations, we compared the success rates evaluated for different edit count values.We conducted experiments by generating molecules with edit counts ranging from one to five.As shown in Fig. 7, the best success score for the DRD2 task was confirmed when the edit was performed twice (Fig. 7b), whereas in the case of the QED task, there was no significant difference in the success scores across the different edit count values (Fig. 7a).Based on these results, molecular editing with a high number of iterations may lead to low efficiency.For structure-constrained molecular optimisation, we identified an optimal edit count value of two.

SELFIES collapse evaluation
We confirmed that our edit-based SELFIES optimisation approach is more effective in mitigating SELFIES collapse than existing SELFIES-based approaches.Based on the characteristics of the SELFIES, any syntax conflicts are skipped during the conversion process to SMILES, leading to the collapse of the SELFIES.Therefore, we measured the SELFIES collapse rate by reconstructing the SELFIES strings using an official SELFIES-SMILES converter [12].We decoded each given SELFIES string into a corresponding SMILES string and then encoded the SMILES string back into a SELFIES string.If the given SELFIES string was grammatically correct, the original and reconstructed SELFIES strings would be equal, and we can conclude that there was no collapse.Based on the above method, we computed and compared the Levenshtein distance-based collapse rates of SELF-EdiT and STONED, a state-of-the-art SELFIES optimisation method, on two property test sets (Figs.8-9).Compared to the collapse rates of STONED on the QED and Fig. 8 The Levenshtein distance between source and target SELFIES.
The difference between the two SELFIES is highlighted in red colour DRD2 datasets, which were 64.9% and 49.9%, respectively, our model achieved much lower collapse rates of 11.4% and 23.3%.To understand the collapse phenomenon better, we measured the number of collapses occurring at each Levenshtein distance between the generated and reconstructed SELFIES.The collapse frequency of STONED is considerably higher than that of our model for the same edit distance.In summary, the SELF-EdiT exhibited a comparatively lower collapse than STONED, demonstrating that the proposed method does not ignore the grammar of SELFIES for the editing task.

Edit path visualisation
SELF-EdiT exhibits explainability, as we can analyse the edit paths generated during the optimisation process to understand the specific structural modifications preferred by the model.Figure 10a shows an edit path with two iterations to improve DRD2.In the first iteration, SELF-EdiT identifies the base structure by removing unnecessary parts and adding a chain-like substructure.In the second iteration, no substructure was deleted, and the added chain part was refined.These edit path analyses may provide researchers with an opportunity to discover novel and vital substructures related to specific property optimisation.Figure 10b shows the simplified edit paths drawn using SELF-EdiT for the QED and DRD2 tasks.These results demonstrate that SELF-EdiT optimises molecular properties while retaining important substructures (e.g., scaffolds) of the source molecules.

Conclusion
In this study, we proposed SELF-EdiT, a SELFIES-based editing model, to efficiently optimise molecules under structural constraints by alternating between deletion and insertion operations.Our proposed model achieved a better performance on two widely used benchmark tasks.Furthermore, we confirmed that SELF-EdiT relieves the SELFIES collapse problem more effectively than the existing SELFIES-based models.We believe that our approach based on SELFIES editing offers a novel perspective on structure-constrained molecule optimisation, with potential applications in drug design and other related tasks.Although our proposed model showed promising results, there are limitations that need to be addressed in future research.One such limitation is that although we have demonstrated how SELF-EdiT edits molecules step-by-step through the editing path, the black box characteristic of the neural network makes it unclear how the model selects fragments at each editing operation.Therefore, future research should focus on developing a quantitative evaluation of the fragment decision to improve the interpretability of the model, which is expected to make it a more reliable tool for chemical researchers to improve the efficiency of molecular optimisation.

Step2:Fig. 3 Fig. 4
Fig. 3 Overall process of SELF-EdiT.The proposed method consists of four steps: rule-based SELFIES tokenization, contrastive learning for SELFIES fragment embedding, Levenshtein transformer for SELFIES editing operations, and molecular optimisation by iteratively editing SELFIES Fig. 4 Example of the rule-based SELFIES tokenization: (a) an initial SELFIES string; (b) tokenized SELFIES based on its branch symbols and grammar rules for branch; (c) rearranged SELFragments by considering ring symbols and grammar rules for ring; SELFragments and the corresponding structures are highlighted per colours

Fig. 5 Fig. 6
Fig. 5 The success scores of the molecular optimisation performance on (a) QED and (b) DRD2 datasets.The x-axis and y-axis indicate the success scores and the baseline models, respectively

Fig. 7
Fig. 7 Success scores for each number of edit iterations.(a) QED and (b) DRD2.The x-axis indicates the number of edit iterations and y-axis indicates the success scores

Fig. 9 Fig. 10
Fig. 9 Distributions of SELFIES collapse over Levenshtein distances.(a) QED and (b) DRD2.The x-axis and y-axis indicate the Levenshtein distance and the number of SELFIES collapse at each distance value, respectively