Introduction

Polymers have versatile properties and a wide range of applications1,2,3. The optimization of polymeric materials and the development of new polymers are, however, time-consuming processes. Machine Learning (ML) techniques have been demonstrated to significantly accelerate the discovery process by predicting polymer properties4,5 or, more recently, by enabling the automated design and generation of new polymers with predefined target properties6,7,8,9. Despite these advances, computational polymer discovery still faces major obstacles. Polymers are macromolecules that are formed by linking up smaller molecular units. Their synthesis typically involves various polymerization steps, with a multitude of possible links between monomer units. How the monomeric units link up with each other, forming the repeat units, largely determines a polymer’s properties10. For example, the thermal degradation of poly(methyl methacrylate) - PMMA - is enhanced by head-to-head linkage11. Poly(3-alkylthiophene)s (P3Ats) exhibit superior electrical conductivity, light-emitting capability, and field-effect mobility in head-to-tail linkage as compared with head-to-head linkage12. Isomerism can also affect gas permeabilities in carbazole-containing diamides (2,7-CPPI)13. Meta-connected M-2,7-CPPI has a less-ordered chain structure and weaker hydrogen bonding than para-connected 2,7-CPPI, which results in loose chain stacking and increased free volumes of M-2,7-CPPI. Higher free volumes promote the solubility and diffusivity of gas in M-2,7-CPPI. As a result, the meta-linked M-2,7-CPPI shows a lower gas barrier than its para-linked analogue. Therefore, the prediction of a polymerization reaction is only complete with the assignment of atoms that will form bonds between repeat units throughout the polymerization process as well as the linkage arrangement between them. Other obstacles in computational polymer discovery such as the prediction of thermodynamically stable polymer candidates, as well as the determination of a polymer’s synthesizability14, is still affected by critical methodological limitations.

Recently, Caddeo et al.15 reported ML and atomistic approaches for modeling the thermodynamic stability of polymer blends while Chen et al.16 demonstrated a data-driven approach to automated retro-synthesis of target polymers. Kim et al.17 demonstrated the combination of ML-model-based generation of new polymer candidates with a synthesizability analysis based on known polymerization reactions and commercially available reactants.

Despite the encouraging progress, significant gaps still exist in both methods and data domains. Currently, ML models do not exist for conducting retro-synthesis analysis on a range of co-polymers, polymer blends, ladder, cross-linked, and metal-containing polymers. Previous research has predominantly focused on homo-polymers, which can be easily represented as strings using the simplified molecular-input line-entry system (SMILES)18,19,20. The recent development of advanced string representations for polymers21,22 opens up new opportunities for modeling co-polymers21 as well as comb, branched, brushed, and star polymers9,22,23,24.

Another critical issue is that the available polymer reaction datasets do not consider the influence of solvents, catalysts, and experimental conditions. In addition, the data used to train ML models are not always made available publicly, compromising the reproducibility of model predictions. Overall, the lack of open data and open models severely hinders the advancement of computational polymer discovery.

In this work, we report an extension of a transformer-based language model25,26 to polymerization reaction by leveraging transfer learning on a curated reaction dataset for vinyl polymers. We fine-tune the polymerization models for both forward and backward prediction tasks, addressing both homo-polymers and co-polymers consisting of up to two monomers. Our model predicts reactants, as well as reagents, solvents, and catalysts for each step of the retro-synthesis. Finally, we show that our models are able to perform two essential tasks as visualized in Fig. 1): (i) given a set of precursors, to predict a polymer product and (ii) given a polymer, to suggest potential disconnections for synthetic strategies.

Fig. 1: Problem representation.
figure 1

A Molecular Transformer model is being created for answering the following questions: “Given a set of reactants, which polymer could be obtained as product?" and “Given a certain polymer, how could it be synthesized? Blue arrows represent forward synthesis, and green arrows backward retro-synthesis. The characters “*:1" and “*:2" represents the connections points at polymer repeat units (head and tail).

Results and Discussion

Dataset preparation

In Fig. 2, we visualize the end-to-end workflow for predicting polymerization reactions. The workflow includes dataset preparation and training of reaction and retrosynthesis prediction models, respectively. The training dataset was generated based on the publicly available USPTO reaction dataset27,28 which contains chemical reactions of organic compounds extracted from US patents issued between 1976 and 2016. For extracting polymerization reactions from the dataset, we have designed a Python tool (see code availability section) that operates based on specific keywords. To ensure the selection of polymerization reactions only, we have employed a manual curation process that involves an individual review step of the reactions chosen by the automated procedure. Overall, we have analyzed 795 data entries for vinyl homo-polymers and co-polymers, respectively, resulting in two distinct datasets containing 3932 and 2965 reactions. These datasets cover all the possible combinations of the 795 reaction examples (details can be found in the Methods section).

Fig. 2: Methodology flowchart.
figure 2

The polymerization prediction workflow for reactions (forward) and retro-synthesis (backward) comprises data preparation and curation, head and tail assignment with two different methodologies (HTA in blue colors and M2P in orange colors), as well as model training and prediction in forward and backward direction, respectively. For forward reaction prediction, the user provides as input the reaction in SMILES string format and the output is the repeat unit (in the case of the HTA algorithm) or an oligomer with head and tail assigned (in the case of M2P algorithm). For the backward model, or retro-synthesis analysis, the user provides the repeat unit as input and the output are the reactants.

In general, polymer properties are determined to a large extent by how the monomer units are interconnected. For the purpose of our study, we have chosen linear chains as topological representations. For accurately predicting polymerization reactions, it is essential to correctly identify and label head and tail positions of the repeat units. To that end, we have adopted two distinct strategies. In the first approach, we have adapted an existing tool for assigning head and tail atoms, referred to as Monomers-to-Polymer (M2P)29. In the second approach, we have developed a Python tool for Head-and-Tail assignment (HTA). We have provided extensive descriptions related to both HTA and M2P workflows in the Methods section. By using the two techniques, we have assigned head and tail atoms to constituent units within our polymer reaction dataset. We find that the accuracy of the HTA algorithm in identifying members of the polyvinyl class is 100%. Also, the accuracy for the prediction of head and tail atom positions is 100%. In the case of the M2P algorithm, the accuracies for both class prediction and of head and tail atom positions are 100%. We have then trained models on the two distinct datasets, labeled HTA and M2P, for comparative analysis of their predictive performance.

The modified M2P method can be applied to oligomers and assigns the positions of head and tail atoms in linkage bonds. The HTA method assigns head and tail atoms within monomers, thus defining the polymeric repeat unit. For facilitating the comparison of the ML models trained with the HTA and M2P datasets, respectively, we have also performed head and tail assignments in oligomers using the HTA routine. Throughout the training phase, the HTA dataset contained both monomers and oligomers, while the M2P dataset contained only oligomers. The inclusion of monomers within the HTA dataset enables the ML model to predict monomeric units of both homopolymers and copolymers. As the M2P dataset contains only oligomers, the respective model is not expected to predict repeat units in forward mode correctly.

Fine tuning machine learning model

For reaction and retrosynthesis prediction modeling, we have fine-tuned the Molecular Transformer architecture introduced by Schwaller et al.25,26. In brief, the model is based on a vanilla transformer architecture30 trained on textual representations of molecules. A Molecular Transformer casts chemical reaction prediction as a language modeling task31. We have encoded chemical reactions as sentences using reaction SMILES representation18 of reactants, reagents as well as solvents and catalysts, along with the products. We have modeled forward- or retro-reaction predictions as a translation task from one language, i.e., reactants-reagents, to another language, i.e. products. For training purposes, we have formally divided the reaction SMILES into source (reactants and reagents) and target (products) instances. Since HTA and M2P datasets include different target outcomes for the same source instance, we have performed splitting solely based on the target products. For model training, we have split the datasets on products in 95% for training/validation, or more specifically, 90% training/5% validation, and 5% for testing to ensure that no polymer (product) appears repeatedly in different splits.

Model performance

To assess the performance of the Molecular Transformer trained on the two training datasets, we have used the Top-k accuracy metric for both forward and backward prediction models following the method reported in26. We have calculated the model accuracy by considering the number of exact matches between the predicted canonical SMILES and the ground truth in the datasets. The Top-k accuracy considers that the ground truth canonical SMILES was found within the first k suggestions of the model. For example, if the ground truth target was found as the first suggestion in 70 out of 100 examples, it means Top-1 is 70%. While round-trip is the generally preferred method for verifying the performance in the context of single-step retro-synthetic models26, the datasets analyzed in our work link precursors to multiple products. In this case, the round-trip accuracy could be misleading, as multiple forward predictions are still valid for a precursor set and multiple products map to the same precursors. To avoid this, we have used Top-k accuracy for evaluating the performance of both forward and backward models.

In Fig. 3, we show the prediction model performance obtained for the two datasets. The M2P dataset shows better performance overall in both forward and backward models, see Fig. 3a, b. In backward predictions, we observe the general trend that the higher the number of training steps, the higher the model accuracy. For forward predictions, this trend only manifests in certain intervals of the Top-k range. The accuracy increases monotonously in both forward and backward models, albeit with different slopes. We observe a sharp accuracy increase in the forward model for M2P around Top-3 and HTA around Top-4, respectively. This could be explained by the number of possible reaction outcomes. While M2P provides n reaction outcomes as oligomers built from combination of reagent monomers, HTA also provides the repeat units as product of the polymerization. This means that HTA provides n + 1 or n + 2 results, depending on the number of reagent monomers involved in the reaction. On average, M2P returns 4 possible reaction outcomes while HTA returns 5 or 6.

Fig. 3: Prediction model performance in terms of accuracy of top-k predictions.
figure 3

a Polymerization reaction prediction (forward model) accuracy. b Retro-synthesis prediction (backward model) accuracy. The results for the HTA and the M2P datasets are plotted in blue and orange colors, respectively. The training steps are represented as solid lines with symbols in blue or orange colors for the HTA and M2P datasets, respectively.

The observation that the M2P dataset yields superior model performance could be due to the simpler learning process of polymerization rules within this dataset. The M2P algorithm polymerizes monomers in all possible functional groups and chooses a representative structure randomly. Due to the random character of the M2P algorithm, different realizations result in different choices of representative structures, affecting the ML training performance. In comparison, the HTA algorithm identifies reactive sites through the analysis of nucleophile and electrophile atoms, applying the Mulliken’s scheme32,33,34 for identifying the most probable structure relating to chemical rules. Due to the deterministic character of the HTA algorithm in assigning head and tail linkage bonds, the repeat units are kept the same within different ML model realizations. In other words, M2P structures are a combination of all possible bond connections between monomers, while HTA structures are combinations of all possible connections between reacting sites.

To clarify this point, let us consider how the repeat units in the HTA dataset are linked up to form oligomers. A bond between two vinyl monomers with only secondary carbon atoms may be formed as visualized in the example shown in Fig. 4a. We note that the polymeric repeat unit generated by HTA was considered for inclusion into the dataset, however, it was disregarded in the distribution analysis. This is also the case for oligomers with tertiary carbons.

Fig. 4: Data representation.
figure 4

a 2D representation of the vinyl bond formation. b Comparative distribution of HTA data. c Comparative distribution of M2P data. d Examples of Butadiene Isoprene and Allyl Methacrylate. “2X" representation means the same structure appears twice.

In case 1, the bond is formed between the carbon atoms at the end of the monomers in the chain. As a result, both head and tail are localized at external atoms of the reaction site. We refer to this connection type as a tail-tail. In case 2, the head and tail are localized at internal and external carbon positions, respectively. We refer to this connection type as head-tail. Finally, in case 3, the bond occurs between secondary carbon atoms of the double bond. Once polymerized, both head and tail atoms are located at internal carbon atom sites. We refer to this connection type as head-head. By analyzing the case distribution in the dataset for model training, see Fig. 4b, we find that the HTA dataset contains 1/3 of each case for oligomers with 3 different combinations while the ratio is 1/2/1 for oligomers with 4 different combinations. The latter can be explained by the twofold possibility in case 2 of bond formation due to the presence of two monomers. Note, that the M2P dataset does not have a fixed case ratio. This is because M2P performs the polymerization for all possible functional groups of the molecular structure, see Fig. 4c.

Those differences on the distribution are observed in examples in Fig. 4d. For the butadiene isoprene polymer with its four potential polymerizations, the vinyl bond case ratio 1/2/3 representing cases 1, 2 and 3, respectively, see Fig. 4a, is 1/2/1 for HTA and 0/2/2 for M2P. Similarly, in the case of allyl methacrylate, we obtain the case ratio 1/2/1 for HTA and 0/2/2 for M2P. In case of M2P, the polymerization is performed by considering all the functional groups of the monomer. The results observed in Fig. 3a, b could indicate that the model has learned this pattern efficiently. The larger spread of accuracy values observed in the retro-synthesis model could be due to the specifics of the oligomers.

While we obtain overall better modeling results with M2P, both datasets reveal interesting insights. Despite showing a Top-1 accuracy below 10%, the forward model exhibits Top-4 and Top-6 accuracy around 80%, which suggests a direct relation with the way the two datasets have been compiled. Indeed, by construction, the same set of reactants are associated with multiple polymers. The backward model has a Top-1 accuracy of about 60% for M2P and 40% for HTA. The lower accuracy observed in HTA could be explained by the ease that the model may have learned the polymerization pattern represented in M2P data, as explained previously. We will expand this analysis in the following paragraphs by investigating the usefulness of the model outputs from a materials science perspective.

Representative polymerization reactions predictions

For our domain applicability analysis, see Methods section for details, we have selected representative polymers from the literature35,36,37,38,39,40,41,42. A comparison of these reactions reveal product similarities ranging from 0 to 0.3 for HTA and M2P datasets while reactants similarities range from 0 to 0.12, see Supplementary Table 1. Copolymers show increased similarity values in M2P, about 0.03-0.06 higher, attesting to their representation in the training data. Homo-polymers exhibit increased similarity of about 0.04 in HTA as the dataset includes monomer representations.

Overall, both models correctly predicted 6 out of 8 reactions in Top-4 and could suggest at least one correct monomer in all the examples studied. The HTA-based model correctly predicted 3 out of 8 reactions in Top-1 and 4 out of 8 reactions in Top-4, while the M2P-based model correctly predicted 1 out of 8 reactions in Top-1 and 2 out of 8 reactions in Top-4. Note, that the HTA-based model predominantly matches homo-polymers while M2P matches mainly co-polymers. The pattern is plausible as HTA contains the monomers of all polymers while M2P does only contain oligomers.

For the polymerization example of styrene - a homopolymer, see Fig. 5a - the HTA-based model achieves a full SMILES match at Top-1, as well as the representation of a possible oligomers structure, with 2 connect repeat units, at Top-3. In case of the M2P-based model, we do not obtain an exact match for the actual product (repeat unit). For the oligomer representation, we obtain an M2P-based match at Top-3 and Top-4. For the polymerization of the co-polymer p(SBMA-nBA), see Fig. 5b, the model predicts an exact product match for Top-1, along with all other bond formation possibilities on Top-2 to Top-4. This means that the model is able to correctly predict the connections in the polymerization reactions. While the HTA model failed to predict the actual result, the model was able to identify the correct head and tail positions of one of the repeat units (Top-1). In addition, the model suggested fragments of the monomer seen as Top-2 and Top-4. In supplementary material, more examples of forward polymerization reactions are provided (see Suplementary Figures 24).

Fig. 5: Representative examples.
figure 5

Model predictions using the Molecular Transformer trained on HTA and M2P datasets, respectively. a Polystyrene. b p(SBMA-nBA) copolymer. Catalysts, solvents, and stochiometry are not shown. In the 2D molecular representations, the carbon atoms are displayed in black, oxygen atoms and hydroxyl in red, nitrogen atoms in dark blue, and sulfur atoms in yellow, respectively. The connection points of polymer repeat units are represented by Rn atoms.

For the curated examples, the HTA based model predicts a higher number of exact matches for the polymer structures in Top-1 (3 out of 8) and Top-4 (4 out of 8), respectively. In cases of incorrect predictions, the model delivered at least one of the monomers correctly. The model trained with M2P data had limitations regarding homopolymers, as expected. Nevertheless, the M2P model correctly predicts complex copolymers and a very close match for p(tC-tBuM) copolymer, a pattern not represented in the training dataset. Both models appear to have complementary performance, predicting exact matches for 6 out of 8 reactions and suggesting at least one correct monomer for all the examples studied. To increase the likelihood of a suitable prediction outcome, we, therefore, recommend the joint utilization of both HTA and M2P-based models for domain-specific applications.

In summary, we have reported the curation of a vinyl polymerization reaction dataset and the fine-tuning of a Molecular Transformer algorithm for predicting polymerization (forward) and retro-synthesis (backward) reactions. For dataset curation, we have introduced two algorithms for assigning head and tail positions, named HTA and M2P. We have applied both algorithms to process 795 data entries for vinyl homopolymers and copolymers and produced two separate datasets with 3932 and 2965 reactions, respectively, representing all possible combinations of the 795 reaction examples. Upon performing transfer learning on polymerization reactions, the Molecular Transformer exhibits a forward-model (Top-4 and Top-6) accuracy around 80% for both datasets. The retro-model exhibits a Top-1 accuracy of about 60% for the M2P dataset and 40% for the HTA dataset.

We have showcased the capabilities of the models through a case study involving eight reactions. These reactions were selected based on examples provided in the literature. Both models have predicted 6 out of 8 reactions as exact match at Top-4, and suggested at least one correct monomer for all the examples studied. The models work in a complementary manner, as the model trained with the HTA dataset produces better results for homo-polymers while the model trained with the M2P dataset predicts better matches for co-polymers.

The Molecular Transformer approach presented in this work is an extension of transformer-based language models to polymerization reactions for both forward and retrosynthesis tasks. We consider this study a promising step towards the development of new computational tools for automated analysis of reaction pathways. The polymerization reaction dataset created in this work can help to overcome the lack of publicly available data. Also, the tools that assigns heads and tails to monomers facilitates the generation of polymerization reaction datasets. Current limitations include the choice of polymerization classes as well as the size of training data sets used for building the models. Based on our analysis of the strengths and limitations of the Molecular Transformer approach, we expect that extending the model to include other polymer classes in the transfer learning phase will broaden model applicability and further increase the robustness of prediction outcomes. The lack of available data on polymerization reactions and tools for head and tail assignment were major challenges we have encountered in this work. Therefore, we have made our curated datasets and tools publicly available for reuse and validation.

Methods

Polymerization dataset

The polymerization reactions and polymer names were extracted from a publicly available dataset27 derived from the patent mining work of Lowe28. This dataset contains approximately 1.8M chemical reactions, extracted from USPTO patents granted between 1976 and 2016. A Python script was developed to automate the data extraction. Only chemical reactions and molecule names were chosen that presented the keyword “polymerization” in the experimental descriptions. After the automated step was completed, a manual validation was performed to remove data entries in which the “polymerization” keyword was related to any information not compatible with the reaction type. In this step, the number of possible polymerization reactions was reduced from 8.668 to 3.286. In the Lowe28 dataset, the head and tail atom positions for defining the polymer repeat units are missing. Since there was no established methodology for performing the head and tail assignment in polymer structures represented by SMILES notation, we developed Python tools with two different approaches to perform this task. In the first approach, we developed a tool for assigning head and tail atoms referred to as HTA. Details are provided in the HTA algorithm section below. In the second approach, we developed a modified version of the Monomers-to-Polymer (M2P)29 tool for assigning the head and tail atoms. For details, see M2P algorithm section below. The two approaches resulted in two datasets, containing 795 data entries related to vinyl homo-polymers and co-polymers with 2 monomers, which were properly cleaned by removing duplicates and erroneous reactions. Besides the head and tail assignment, another two datasets were generated by describing all the possible product outcomes which are represented by one or two products and the different types of bond formation between monomers. Bond formations were performed by combining monomers using the rdkit.Chem.rdChemReactions method. To that end, all monomer combinations of M2P and HTA algorithms were considered. In the case of the HTA algorithm, monomers were also considered as possible outcome of the reaction. As a result, the number of outcomes is m2p = n and hta = n+1/n+2, respectively. This increased the number of reactions from 795 to 3932 in case of HTA and 2965 in case of M2P, respectively. Overall, four datasets were generated and two datasets were used to train our model: the datasets for HTA and M2P, respectively, that combine all monomers.

Data distribution

Both M2P and HTA datasets were sorted by polymer name and repeating unit, the latter alphabetically and by length. All results for the same polymer were grouped in lists during the pre-processing process. The modified M2P tool assigned head and tail atom positions (linkage bounds) in oligomers and the HTA tool in monomers, respectively, defining the polymeric repeat unit. With the purpose of avoiding any bias between the two datasets during the ML model training, we also performed head and tail assignments with the HTA tool in oligomers. This added another level of complexity with regard to how the repeat units were linked. There are three possible cases: (i) tail-tail; (ii) head-tail and (iii) head-head. For the extraction of the distribution of cases, we used SMARTS43 for each polymerization case, and following a dearomatization process, all SMILES18 were compared to the SMARTS set, using the RDKit44 library. SMARTS43 is a chemical structure query language for describing molecular patterns. RDKit can import SMARTS queries for use in search of SMILES patterns. Cases that deviated from the standard SMARTS query pattern, i.e., tertiary carbons that could cause uncertainties on the algorithm, were not considered. After post-processing, both datasets were merged as only equal polymers were considered for the comparison, and a distribution chart was built with the results.

Applicability domain analysis

The representative polymers used in this case study were manually extracted from the literature35,36,37,38,39,40,41,42 (see Supplementary Table 1). The SMILES representations of polymers were canonicalized using the RDKit44 package. We calculated fingerprints for both input datasets using RDKFingerprint44 and then compared the two resulting datasets. Each representative polymer input data fingerprint was compared with the fingerprints of the whole training data. The similarity was calculated using the RDKIT FigerprintSimilarity45 function which employs the Tanimoto similarity metric46. Here, the similarity between two fingerprints vectors is expressed by a number ranging from 0 (no similarity) to 1 (identity)46. The results obtained contained the mean of the comparison and the maximum value on the list. This process was performed separately for reactants/reagents and products.

HTA algorithm

We have developed the HeadTailAssigner (HTA) algorithm for identifying the position of a polymer’s head and tail atom, respectively, by analyzing the reactivity of the functional groups in the monomer. The algorithm checks for the presence of functional groups that occur in specific polymer classes and, using Quantum Chemical calculations, HTA then rank orders their reactivities for identifying which functional groups are responsible for the polymerization. In the next step, the algorithm identifies the most likely polymerization class and mechanism and, consequently, tags head and tail atoms of the monomer structure. As input, the algorithm accepts both reaction SMILES and monomer SMILES. Following the pre-processing analysis, the most probable monomer in the reaction string is defined by comparing the products with the reactants. The last step is performed by a fingerprint similarity analysis, using the RDKFingerprint44 and maxPath = 7, and a comparison using Tanimoto Similarity44,47. The vinyl class is the focus of this work, but the algorithm may also identify and assign the head and tail of polyamides, polyesters, polyurethanes, and polyethers. To define the polymer class, the algorithm searches all the possible functional groups on the molecular structure by substructure match with the SMARTS pattern of each organic function. The most common functional group promoting polymerization in the poly-vinyl class is the alkene group. A monomer is, therefore, classified as poly-vinyl if it contains an alkene group. To broaden the poly-vinyl class definition, the presence of an alkyne group is also considered for inclusion. In a next step, it compares the atomic index of nucleophilicity48 and the functional groups extracted from the monomer. If the monomer smiles has only one functional group, a SMARTS pattern is acquired to classify the polymerization mechanism. If the monomer smiles have two or more functional groups, the priority of polymerization is decided based on the atomic index of nucleophilicity48. The atomic index of nucleophilicity of an atom X involving only the highest occupied molecular orbital (HOMO) n is defined as48:

$${R}_{X}=\frac{\mathop{\sum }\nolimits_{\alpha }^{X}{\left\vert {C}_{\alpha ,n}\right\vert }^{2}}{(1-{\varepsilon }_{n})}$$
(1)

where Cα,n are the molecular orbital expansion coefficients of αth atomic orbital on molecular orbital n (HOMO) and εn is the HOMO energy.

The RX was calculated within STO-3G basis set and with the Mulliken’s population analysis32,33,34 scheme. All the quantum states functions were calculated at RHF theory level, using the standard ab initio quantum chemistry package GAMESS49 version 2021 R2.

In general, the higher the atomic population value in an atom, the higher the atom index of nucleophilicity RX, which means that the atom has a higher probability of being the polymerization site48. The condition is set depending on the relation between polymerization class and the functional groups present in the structure. If one atom has a higher RX but its functional group is not represented in any polymer class, the algorithm keeps searching until it finds an atom that is represented in an existing polymer class. After obtaining a match, the functional groups are concatenated up until a match is obtained with a previously defined class. The mechanism is defined depending on the polymer class described previously. If the class is vinyl and the algorithm detects the presence of a specific catalyst, it may also define if the mechanism is anionic, cationic, or radicalar. With all the information obtained previously, the algorithm defines the head and tail by assigning the atom id of the respective nucleophile and electrophile to the functional group responsible for the polymerization.

For vinyl polymers, the polymerization should occur at the double bonds and, in some cases, triple bonds. Using atom mappings, the most nucleophilic atom is selected as head by convention. In case the electrophilic atom is located at the same organic function, which is the case in vinyl polymerization, the tail is selected from the same organic function. If the most electrophilic atom is located at a different organic function, the tail is selected from a complementary organic function. For example, if an amide functional group is ranked as the group with highest atomic index of nucleophilicity, and a carboxylic acid or acid halide group exists in the molecule, the class will be assigned as polyamide and the tail will be assigned to the carboxylic acid or acid halide groups. The atomic index of nucleophilicity derived by HOMO electronic population analysis is sufficient for determining the nucleophilic atom with highest probability to donate electrons. Once the polymerization reaction mechanism that occurs in the functional group is identified, the head and tail assignments are processed straightforwardly. In Supplementary Fig. 1, we have provided an example for illustrating how the HTA algorithms works.

M2P algorithm

For the head and tail assignment using Monomers to Polymers (M2P)50, we have created a modified version of the M2P algorithm. According to the authors “The library can generate multiple replicate structures to create polymer chains represented at the atom and bond level. RDKit44 reaction SMARTS43 are used to manipulate the molecular structures and perform in silico reactions. The polymer chemistries available include vinyls, acrylates, esters, amides, imides, and carbonates”50. Within the source code, the algorithm was modified to generate head and tail assignments for 2 and 3 monomers (vinyl polymerization) only if the user checks TRUE for the head and tail creation parameter. The original M2P algorithm compares SMARTS-SMILES patterns and performs a chemical reaction on a sequence of reactant molecules for returning polymerization products. In a first step, the polymerization type is defined by comparing the SMILES input with a library of SMARTS patterns of functional groups. SMARTS43 is a representation for describing molecular patterns that allows specification of substructures with rules that are straightforward extensions of SMILES. After finding a match of a functional group that belongs to a pre-defined polymerization class, the input is provided to the reaction process. The reaction process is performed by following the SMARTS reaction pattern for chemical transformations, identifying which atom should be displaced and where it should be located on the products. In vinyl polymerization, for example, SMARTS is used for breaking up the double bond and for adding two R groups to the carbon atoms that formed the double bond. Noble gases atom representations are used as tokens for identifying the bond formation site. The polymerization mechanism comprises initiation, propagation, and termination steps. During the termination step, the tokens are deleted which leaves only the polymer product as a result. In our modified version of the algorithm, token atoms (Kr, Xe and Rn) are added to the initiation, propagation, and termination steps for representing the positions of head and tail atoms. At the end of the polymerization process, these tokens remain on the structure to represent the head and tail assignments. This treatment was also extended for co-polymers with 3 monomers.

Validation of HTA and M2P algorithms

The validation dataset contains 206 data points with 149 polymers that undergo vinyl polymerization - 17 in the polyamide class, 25 in the polyester class, 12 in the polyether class, and 3 in the polyurethane class. Specifically, 57 polymer names and polymer SMILES with assigned heads and tails belonging to polyamide, polyester, polyether, or polyurethane classes, respectively, were manually retrieved from Polymerdatabase.com51. 149 polymer names and (some of) the polymer SMILES with assigned heads and tails that undergo vinyl polymerization were extracted manually from reference52. The dataset was then modified by manually transforming the polymer product into its precursor (monomer). For validation, the algorithm was then applied to detect the reaction center of the polymerization.

The validation of our methodology was performed by comparing the ground-truth head and tail positions in the SMILES with the positions as predicted by the HTA and M2P algorithms. The HTA algorithm produces results as repeating units for each monomer. Therefore, the co-polymer results could not be compared automatically, and those results were analyzed manually. To compare the results for homopolymers, both datasets were sorted by polymer name and monomers with assigned heads and tails (mon-HTA/mon-M2P). The canonicalization of SMILES was carried out using RDKit for assuring that the labeling system identified each compound bijectively. This step was followed by analyzing if each mon-HTA/mon-M2P entry had the same canonical SMILES in the ground-truth and predicted datasets. The results were then compiled as a Boolean series and the mon-HTA/mon-M2P structures visualized.

Model training for forward and backward reaction prediction

As base model for both forward and backward reaction prediction, we have used the Molecular Transformer proposed by Schwaller et al.25. Encoders and decoders follow a standard transformer architecture with 6 layers, word vectors and hidden size of dimension 512 (rnn_size parameter in OpenNMT53), the gradient was accumulated 8 times with a maximum vector norm of 0.0, and adam was used as an optimizer (β1 = 0.9, β2 = 0.998) setting the maximum number of fine-tuning steps to 20000 (no early stopping applied). The batch size was set to 4096, and the batch type as well as the gradient normalisation method was set to tokens. The learning rate was set to 2.0 with noam as decay method. Dropout and label smoothing (ϵ) were set to 0.1. Parameter initialisation was disabled and position encoding enabled. All models were trained using a version of OpenNMT adapted for the Molecular Transformer54 using the aforementioned fixed set of hyper-parameters for fine-tuning. As compared to the standard Molecular Transformer, we extended model and tokenizer to handle head and tail representations using noble gasses as additonal tokens. We trained models on two datasets generated by the HTA and the M2P algorithm, respectively, and compared both backward and forward model performance.