Predicting polymerization reactions via transfer learning using chemical language models

Ferrari, Brenda S.; Manica, Matteo; Giro, Ronaldo; Laino, Teodoro; Steiner, Mathias B.

doi:10.1038/s41524-024-01304-8

Predicting polymerization reactions via transfer learning using chemical language models

Article
Open access
Published: 04 June 2024

Volume 10, article number 119, (2024)
Cite this article

Download PDF

You have full access to this open access article

npj Computational Materials

Predicting polymerization reactions via transfer learning using chemical language models

Download PDF

1136 Accesses
3 Altmetric
Explore all metrics

Abstract

Polymers are candidate materials for a wide range of sustainability applications such as carbon capture and energy storage. However, computational polymer discovery lacks automated analysis of reaction pathways and stability assessment through retro-synthesis. Here, we report an extension of transformer-based language models to polymerization for both reaction and retrosynthesis tasks. To that end, we have curated a polymerization dataset for vinyl polymers covering reactions and retrosynthesis for representative homo-polymers and co-polymers. Overall, we obtain a forward model Top-4 accuracy of 80% and a backward model Top-4 accuracy of 60%. We further analyze the model performance with representative polymerization examples and evaluate its prediction quality from a materials science perspective. To enable validation and reuse, we have made our models and data available in public repositories.

Artificial intelligence driven design of catalysts and materials for ring opening polymerization using a domain-specific language

Article Open access 21 June 2023

A prospective on machine learning challenges, progress, and potential in polymer science

Article Open access 01 July 2024

Machine learning-assisted systematical polymerization planning: case studies on reversible-deactivation radical polymerization

Article 17 May 2021

Introduction

Polymers have versatile properties and a wide range of applications^1,2,3. The optimization of polymeric materials and the development of new polymers are, however, time-consuming processes. Machine Learning (ML) techniques have been demonstrated to significantly accelerate the discovery process by predicting polymer properties^4,5 or, more recently, by enabling the automated design and generation of new polymers with predefined target properties^6,7,8,9. Despite these advances, computational polymer discovery still faces major obstacles. Polymers are macromolecules that are formed by linking up smaller molecular units. Their synthesis typically involves various polymerization steps, with a multitude of possible links between monomer units. How the monomeric units link up with each other, forming the repeat units, largely determines a polymer’s properties¹⁰. For example, the thermal degradation of poly(methyl methacrylate) - PMMA - is enhanced by head-to-head linkage¹¹. Poly(3-alkylthiophene)s (P3Ats) exhibit superior electrical conductivity, light-emitting capability, and field-effect mobility in head-to-tail linkage as compared with head-to-head linkage¹². Isomerism can also affect gas permeabilities in carbazole-containing diamides (2,7-CPPI)¹³. Meta-connected M-2,7-CPPI has a less-ordered chain structure and weaker hydrogen bonding than para-connected 2,7-CPPI, which results in loose chain stacking and increased free volumes of M-2,7-CPPI. Higher free volumes promote the solubility and diffusivity of gas in M-2,7-CPPI. As a result, the meta-linked M-2,7-CPPI shows a lower gas barrier than its para-linked analogue. Therefore, the prediction of a polymerization reaction is only complete with the assignment of atoms that will form bonds between repeat units throughout the polymerization process as well as the linkage arrangement between them. Other obstacles in computational polymer discovery such as the prediction of thermodynamically stable polymer candidates, as well as the determination of a polymer’s synthesizability¹⁴, is still affected by critical methodological limitations.

Recently, Caddeo et al.¹⁵ reported ML and atomistic approaches for modeling the thermodynamic stability of polymer blends while Chen et al.¹⁶ demonstrated a data-driven approach to automated retro-synthesis of target polymers. Kim et al.¹⁷ demonstrated the combination of ML-model-based generation of new polymer candidates with a synthesizability analysis based on known polymerization reactions and commercially available reactants.

Despite the encouraging progress, significant gaps still exist in both methods and data domains. Currently, ML models do not exist for conducting retro-synthesis analysis on a range of co-polymers, polymer blends, ladder, cross-linked, and metal-containing polymers. Previous research has predominantly focused on homo-polymers, which can be easily represented as strings using the simplified molecular-input line-entry system (SMILES)^18,19,20. The recent development of advanced string representations for polymers^21,22 opens up new opportunities for modeling co-polymers²¹ as well as comb, branched, brushed, and star polymers^9,22,23,24.

Another critical issue is that the available polymer reaction datasets do not consider the influence of solvents, catalysts, and experimental conditions. In addition, the data used to train ML models are not always made available publicly, compromising the reproducibility of model predictions. Overall, the lack of open data and open models severely hinders the advancement of computational polymer discovery.

In this work, we report an extension of a transformer-based language model^25,26 to polymerization reaction by leveraging transfer learning on a curated reaction dataset for vinyl polymers. We fine-tune the polymerization models for both forward and backward prediction tasks, addressing both homo-polymers and co-polymers consisting of up to two monomers. Our model predicts reactants, as well as reagents, solvents, and catalysts for each step of the retro-synthesis. Finally, we show that our models are able to perform two essential tasks as visualized in Fig. 1): (i) given a set of precursors, to predict a polymer product and (ii) given a polymer, to suggest potential disconnections for synthetic strategies.

Results and Discussion

Dataset preparation

In Fig. 2, we visualize the end-to-end workflow for predicting polymerization reactions. The workflow includes dataset preparation and training of reaction and retrosynthesis prediction models, respectively. The training dataset was generated based on the publicly available USPTO reaction dataset^27,28 which contains chemical reactions of organic compounds extracted from US patents issued between 1976 and 2016. For extracting polymerization reactions from the dataset, we have designed a Python tool (see code availability section) that operates based on specific keywords. To ensure the selection of polymerization reactions only, we have employed a manual curation process that involves an individual review step of the reactions chosen by the automated procedure. Overall, we have analyzed 795 data entries for vinyl homo-polymers and co-polymers, respectively, resulting in two distinct datasets containing 3932 and 2965 reactions. These datasets cover all the possible combinations of the 795 reaction examples (details can be found in the Methods section).

In general, polymer properties are determined to a large extent by how the monomer units are interconnected. For the purpose of our study, we have chosen linear chains as topological representations. For accurately predicting polymerization reactions, it is essential to correctly identify and label head and tail positions of the repeat units. To that end, we have adopted two distinct strategies. In the first approach, we have adapted an existing tool for assigning head and tail atoms, referred to as Monomers-to-Polymer (M2P)²⁹. In the second approach, we have developed a Python tool for Head-and-Tail assignment (HTA). We have provided extensive descriptions related to both HTA and M2P workflows in the Methods section. By using the two techniques, we have assigned head and tail atoms to constituent units within our polymer reaction dataset. We find that the accuracy of the HTA algorithm in identifying members of the polyvinyl class is 100%. Also, the accuracy for the prediction of head and tail atom positions is 100%. In the case of the M2P algorithm, the accuracies for both class prediction and of head and tail atom positions are 100%. We have then trained models on the two distinct datasets, labeled HTA and M2P, for comparative analysis of their predictive performance.

The modified M2P method can be applied to oligomers and assigns the positions of head and tail atoms in linkage bonds. The HTA method assigns head and tail atoms within monomers, thus defining the polymeric repeat unit. For facilitating the comparison of the ML models trained with the HTA and M2P datasets, respectively, we have also performed head and tail assignments in oligomers using the HTA routine. Throughout the training phase, the HTA dataset contained both monomers and oligomers, while the M2P dataset contained only oligomers. The inclusion of monomers within the HTA dataset enables the ML model to predict monomeric units of both homopolymers and copolymers. As the M2P dataset contains only oligomers, the respective model is not expected to predict repeat units in forward mode correctly.

Fine tuning machine learning model

For reaction and retrosynthesis prediction modeling, we have fine-tuned the Molecular Transformer architecture introduced by Schwaller et al.^25,26. In brief, the model is based on a vanilla transformer architecture³⁰ trained on textual representations of molecules. A Molecular Transformer casts chemical reaction prediction as a language modeling task³¹. We have encoded chemical reactions as sentences using reaction SMILES representation¹⁸ of reactants, reagents as well as solvents and catalysts, along with the products. We have modeled forward- or retro-reaction predictions as a translation task from one language, i.e., reactants-reagents, to another language, i.e. products. For training purposes, we have formally divided the reaction SMILES into source (reactants and reagents) and target (products) instances. Since HTA and M2P datasets include different target outcomes for the same source instance, we have performed splitting solely based on the target products. For model training, we have split the datasets on products in 95% for training/validation, or more specifically, 90% training/5% validation, and 5% for testing to ensure that no polymer (product) appears repeatedly in different splits.

Model performance

To assess the performance of the Molecular Transformer trained on the two training datasets, we have used the Top-k accuracy metric for both forward and backward prediction models following the method reported in²⁶. We have calculated the model accuracy by considering the number of exact matches between the predicted canonical SMILES and the ground truth in the datasets. The Top-k accuracy considers that the ground truth canonical SMILES was found within the first k suggestions of the model. For example, if the ground truth target was found as the first suggestion in 70 out of 100 examples, it means Top-1 is 70%. While round-trip is the generally preferred method for verifying the performance in the context of single-step retro-synthetic models²⁶, the datasets analyzed in our work link precursors to multiple products. In this case, the round-trip accuracy could be misleading, as multiple forward predictions are still valid for a precursor set and multiple products map to the same precursors. To avoid this, we have used Top-k accuracy for evaluating the performance of both forward and backward models.

In Fig. 3, we show the prediction model performance obtained for the two datasets. The M2P dataset shows better performance overall in both forward and backward models, see Fig. 3a, b. In backward predictions, we observe the general trend that the higher the number of training steps, the higher the model accuracy. For forward predictions, this trend only manifests in certain intervals of the Top-k range. The accuracy increases monotonously in both forward and backward models, albeit with different slopes. We observe a sharp accuracy increase in the forward model for M2P around Top-3 and HTA around Top-4, respectively. This could be explained by the number of possible reaction outcomes. While M2P provides n reaction outcomes as oligomers built from combination of reagent monomers, HTA also provides the repeat units as product of the polymerization. This means that HTA provides n + 1 or n + 2 results, depending on the number of reagent monomers involved in the reaction. On average, M2P returns 4 possible reaction outcomes while HTA returns 5 or 6.

**Fig. 3: Prediction model performance in terms of accuracy of top-k predictions.**

The observation that the M2P dataset yields superior model performance could be due to the simpler learning process of polymerization rules within this dataset. The M2P algorithm polymerizes monomers in all possible functional groups and chooses a representative structure randomly. Due to the random character of the M2P algorithm, different realizations result in different choices of representative structures, affecting the ML training performance. In comparison, the HTA algorithm identifies reactive sites through the analysis of nucleophile and electrophile atoms, applying the Mulliken’s scheme^32,33,34 for identifying the most probable structure relating to chemical rules. Due to the deterministic character of the HTA algorithm in assigning head and tail linkage bonds, the repeat units are kept the same within different ML model realizations. In other words, M2P structures are a combination of all possible bond connections between monomers, while HTA structures are combinations of all possible connections between reacting sites.

To clarify this point, let us consider how the repeat units in the HTA dataset are linked up to form oligomers. A bond between two vinyl monomers with only secondary carbon atoms may be formed as visualized in the example shown in Fig. 4a. We note that the polymeric repeat unit generated by HTA was considered for inclusion into the dataset, however, it was disregarded in the distribution analysis. This is also the case for oligomers with tertiary carbons.

In case 1, the bond is formed between the carbon atoms at the end of the monomers in the chain. As a result, both head and tail are localized at external atoms of the reaction site. We refer to this connection type as a tail-tail. In case 2, the head and tail are localized at internal and external carbon positions, respectively. We refer to this connection type as head-tail. Finally, in case 3, the bond occurs between secondary carbon atoms of the double bond. Once polymerized, both head and tail atoms are located at internal carbon atom sites. We refer to this connection type as head-head. By analyzing the case distribution in the dataset for model training, see Fig. 4b, we find that the HTA dataset contains 1/3 of each case for oligomers with 3 different combinations while the ratio is 1/2/1 for oligomers with 4 different combinations. The latter can be explained by the twofold possibility in case 2 of bond formation due to the presence of two monomers. Note, that the M2P dataset does not have a fixed case ratio. This is because M2P performs the polymerization for all possible functional groups of the molecular structure, see Fig. 4c.

Those differences on the distribution are observed in examples in Fig. 4d. For the butadiene isoprene polymer with its four potential polymerizations, the vinyl bond case ratio 1/2/3 representing cases 1, 2 and 3, respectively, see Fig. 4a, is 1/2/1 for HTA and 0/2/2 for M2P. Similarly, in the case of allyl methacrylate, we obtain the case ratio 1/2/1 for HTA and 0/2/2 for M2P. In case of M2P, the polymerization is performed by considering all the functional groups of the monomer. The results observed in Fig. 3a, b could indicate that the model has learned this pattern efficiently. The larger spread of accuracy values observed in the retro-synthesis model could be due to the specifics of the oligomers.

While we obtain overall better modeling results with M2P, both datasets reveal interesting insights. Despite showing a Top-1 accuracy below 10%, the forward model exhibits Top-4 and Top-6 accuracy around 80%, which suggests a direct relation with the way the two datasets have been compiled. Indeed, by construction, the same set of reactants are associated with multiple polymers. The backward model has a Top-1 accuracy of about 60% for M2P and 40% for HTA. The lower accuracy observed in HTA could be explained by the ease that the model may have learned the polymerization pattern represented in M2P data, as explained previously. We will expand this analysis in the following paragraphs by investigating the usefulness of the model outputs from a materials science perspective.

Representative polymerization reactions predictions

For our domain applicability analysis, see Methods section for details, we have selected representative polymers from the literature^{35,36,37,38,39,40,41,42}. A comparison of these reactions reveal product similarities ranging from 0 to 0.3 for HTA and M2P datasets while reactants similarities range from 0 to 0.12, see Supplementary Table 1. Copolymers show increased similarity values in M2P, about 0.03-0.06 higher, attesting to their representation in the training data. Homo-polymers exhibit increased similarity of about 0.04 in HTA as the dataset includes monomer representations.

Overall, both models correctly predicted 6 out of 8 reactions in Top-4 and could suggest at least one correct monomer in all the examples studied. The HTA-based model correctly predicted 3 out of 8 reactions in Top-1 and 4 out of 8 reactions in Top-4, while the M2P-based model correctly predicted 1 out of 8 reactions in Top-1 and 2 out of 8 reactions in Top-4. Note, that the HTA-based model predominantly matches homo-polymers while M2P matches mainly co-polymers. The pattern is plausible as HTA contains the monomers of all polymers while M2P does only contain oligomers.

For the polymerization example of styrene - a homopolymer, see Fig. 5a - the HTA-based model achieves a full SMILES match at Top-1, as well as the representation of a possible oligomers structure, with 2 connect repeat units, at Top-3. In case of the M2P-based model, we do not obtain an exact match for the actual product (repeat unit). For the oligomer representation, we obtain an M2P-based match at Top-3 and Top-4. For the polymerization of the co-polymer p(SBMA-nBA), see Fig. 5b, the model predicts an exact product match for Top-1, along with all other bond formation possibilities on Top-2 to Top-4. This means that the model is able to correctly predict the connections in the polymerization reactions. While the HTA model failed to predict the actual result, the model was able to identify the correct head and tail positions of one of the repeat units (Top-1). In addition, the model suggested fragments of the monomer seen as Top-2 and Top-4. In supplementary material, more examples of forward polymerization reactions are provided (see Suplementary Figures 2–4).

For the curated examples, the HTA based model predicts a higher number of exact matches for the polymer structures in Top-1 (3 out of 8) and Top-4 (4 out of 8), respectively. In cases of incorrect predictions, the model delivered at least one of the monomers correctly. The model trained with M2P data had limitations regarding homopolymers, as expected. Nevertheless, the M2P model correctly predicts complex copolymers and a very close match for p(tC-tBuM) copolymer, a pattern not represented in the training dataset. Both models appear to have complementary performance, predicting exact matches for 6 out of 8 reactions and suggesting at least one correct monomer for all the examples studied. To increase the likelihood of a suitable prediction outcome, we, therefore, recommend the joint utilization of both HTA and M2P-based models for domain-specific applications.

In summary, we have reported the curation of a vinyl polymerization reaction dataset and the fine-tuning of a Molecular Transformer algorithm for predicting polymerization (forward) and retro-synthesis (backward) reactions. For dataset curation, we have introduced two algorithms for assigning head and tail positions, named HTA and M2P. We have applied both algorithms to process 795 data entries for vinyl homopolymers and copolymers and produced two separate datasets with 3932 and 2965 reactions, respectively, representing all possible combinations of the 795 reaction examples. Upon performing transfer learning on polymerization reactions, the Molecular Transformer exhibits a forward-model (Top-4 and Top-6) accuracy around 80% for both datasets. The retro-model exhibits a Top-1 accuracy of about 60% for the M2P dataset and 40% for the HTA dataset.

We have showcased the capabilities of the models through a case study involving eight reactions. These reactions were selected based on examples provided in the literature. Both models have predicted 6 out of 8 reactions as exact match at Top-4, and suggested at least one correct monomer for all the examples studied. The models work in a complementary manner, as the model trained with the HTA dataset produces better results for homo-polymers while the model trained with the M2P dataset predicts better matches for co-polymers.

The Molecular Transformer approach presented in this work is an extension of transformer-based language models to polymerization reactions for both forward and retrosynthesis tasks. We consider this study a promising step towards the development of new computational tools for automated analysis of reaction pathways. The polymerization reaction dataset created in this work can help to overcome the lack of publicly available data. Also, the tools that assigns heads and tails to monomers facilitates the generation of polymerization reaction datasets. Current limitations include the choice of polymerization classes as well as the size of training data sets used for building the models. Based on our analysis of the strengths and limitations of the Molecular Transformer approach, we expect that extending the model to include other polymer classes in the transfer learning phase will broaden model applicability and further increase the robustness of prediction outcomes. The lack of available data on polymerization reactions and tools for head and tail assignment were major challenges we have encountered in this work. Therefore, we have made our curated datasets and tools publicly available for reuse and validation.

Methods

Polymerization dataset

The polymerization reactions and polymer names were extracted from a publicly available dataset²⁷ derived from the patent mining work of Lowe²⁸. This dataset contains approximately 1.8M chemical reactions, extracted from USPTO patents granted between 1976 and 2016. A Python script was developed to automate the data extraction. Only chemical reactions and molecule names were chosen that presented the keyword “polymerization” in the experimental descriptions. After the automated step was completed, a manual validation was performed to remove data entries in which the “polymerization” keyword was related to any information not compatible with the reaction type. In this step, the number of possible polymerization reactions was reduced from 8.668 to 3.286. In the Lowe²⁸ dataset, the head and tail atom positions for defining the polymer repeat units are missing. Since there was no established methodology for performing the head and tail assignment in polymer structures represented by SMILES notation, we developed Python tools with two different approaches to perform this task. In the first approach, we developed a tool for assigning head and tail atoms referred to as HTA. Details are provided in the HTA algorithm section below. In the second approach, we developed a modified version of the Monomers-to-Polymer (M2P)²⁹ tool for assigning the head and tail atoms. For details, see M2P algorithm section below. The two approaches resulted in two datasets, containing 795 data entries related to vinyl homo-polymers and co-polymers with 2 monomers, which were properly cleaned by removing duplicates and erroneous reactions. Besides the head and tail assignment, another two datasets were generated by describing all the possible product outcomes which are represented by one or two products and the different types of bond formation between monomers. Bond formations were performed by combining monomers using the rdkit.Chem.rdChemReactions method. To that end, all monomer combinations of M2P and HTA algorithms were considered. In the case of the HTA algorithm, monomers were also considered as possible outcome of the reaction. As a result, the number of outcomes is m2p = n and hta = n+1/n+2, respectively. This increased the number of reactions from 795 to 3932 in case of HTA and 2965 in case of M2P, respectively. Overall, four datasets were generated and two datasets were used to train our model: the datasets for HTA and M2P, respectively, that combine all monomers.

Data distribution

Both M2P and HTA datasets were sorted by polymer name and repeating unit, the latter alphabetically and by length. All results for the same polymer were grouped in lists during the pre-processing process. The modified M2P tool assigned head and tail atom positions (linkage bounds) in oligomers and the HTA tool in monomers, respectively, defining the polymeric repeat unit. With the purpose of avoiding any bias between the two datasets during the ML model training, we also performed head and tail assignments with the HTA tool in oligomers. This added another level of complexity with regard to how the repeat units were linked. There are three possible cases: (i) tail-tail; (ii) head-tail and (iii) head-head. For the extraction of the distribution of cases, we used SMARTS⁴³ for each polymerization case, and following a dearomatization process, all SMILES¹⁸ were compared to the SMARTS set, using the RDKit⁴⁴ library. SMARTS⁴³ is a chemical structure query language for describing molecular patterns. RDKit can import SMARTS queries for use in search of SMILES patterns. Cases that deviated from the standard SMARTS query pattern, i.e., tertiary carbons that could cause uncertainties on the algorithm, were not considered. After post-processing, both datasets were merged as only equal polymers were considered for the comparison, and a distribution chart was built with the results.

Applicability domain analysis

The representative polymers used in this case study were manually extracted from the literature^{35,36,37,38,39,40,41,42} (see Supplementary Table 1). The SMILES representations of polymers were canonicalized using the RDKit⁴⁴ package. We calculated fingerprints for both input datasets using RDKFingerprint⁴⁴ and then compared the two resulting datasets. Each representative polymer input data fingerprint was compared with the fingerprints of the whole training data. The similarity was calculated using the RDKIT FigerprintSimilarity⁴⁵ function which employs the Tanimoto similarity metric⁴⁶. Here, the similarity between two fingerprints vectors is expressed by a number ranging from 0 (no similarity) to 1 (identity)⁴⁶. The results obtained contained the mean of the comparison and the maximum value on the list. This process was performed separately for reactants/reagents and products.

HTA algorithm

We have developed the HeadTailAssigner (HTA) algorithm for identifying the position of a polymer’s head and tail atom, respectively, by analyzing the reactivity of the functional groups in the monomer. The algorithm checks for the presence of functional groups that occur in specific polymer classes and, using Quantum Chemical calculations, HTA then rank orders their reactivities for identifying which functional groups are responsible for the polymerization. In the next step, the algorithm identifies the most likely polymerization class and mechanism and, consequently, tags head and tail atoms of the monomer structure. As input, the algorithm accepts both reaction SMILES and monomer SMILES. Following the pre-processing analysis, the most probable monomer in the reaction string is defined by comparing the products with the reactants. The last step is performed by a fingerprint similarity analysis, using the RDKFingerprint⁴⁴ and maxPath = 7, and a comparison using Tanimoto Similarity^44,47. The vinyl class is the focus of this work, but the algorithm may also identify and assign the head and tail of polyamides, polyesters, polyurethanes, and polyethers. To define the polymer class, the algorithm searches all the possible functional groups on the molecular structure by substructure match with the SMARTS pattern of each organic function. The most common functional group promoting polymerization in the poly-vinyl class is the alkene group. A monomer is, therefore, classified as poly-vinyl if it contains an alkene group. To broaden the poly-vinyl class definition, the presence of an alkyne group is also considered for inclusion. In a next step, it compares the atomic index of nucleophilicity⁴⁸ and the functional groups extracted from the monomer. If the monomer smiles has only one functional group, a SMARTS pattern is acquired to classify the polymerization mechanism. If the monomer smiles have two or more functional groups, the priority of polymerization is decided based on the atomic index of nucleophilicity⁴⁸. The atomic index of nucleophilicity of an atom X involving only the highest occupied molecular orbital (HOMO) n is defined as⁴⁸:

$${R}_{X}=\frac{\mathop{\sum }\nolimits_{\alpha }^{X}{\left\vert {C}_{\alpha ,n}\right\vert }^{2}}{(1-{\varepsilon }_{n})}$$

(1)

where C_α,n are the molecular orbital expansion coefficients of αth atomic orbital on molecular orbital n (HOMO) and ε_n is the HOMO energy.

The R_X was calculated within STO-3G basis set and with the Mulliken’s population analysis^32,33,34 scheme. All the quantum states functions were calculated at RHF theory level, using the standard ab initio quantum chemistry package GAMESS⁴⁹ version 2021 R2.

In general, the higher the atomic population value in an atom, the higher the atom index of nucleophilicity R_X, which means that the atom has a higher probability of being the polymerization site⁴⁸. The condition is set depending on the relation between polymerization class and the functional groups present in the structure. If one atom has a higher R_X but its functional group is not represented in any polymer class, the algorithm keeps searching until it finds an atom that is represented in an existing polymer class. After obtaining a match, the functional groups are concatenated up until a match is obtained with a previously defined class. The mechanism is defined depending on the polymer class described previously. If the class is vinyl and the algorithm detects the presence of a specific catalyst, it may also define if the mechanism is anionic, cationic, or radicalar. With all the information obtained previously, the algorithm defines the head and tail by assigning the atom id of the respective nucleophile and electrophile to the functional group responsible for the polymerization.

For vinyl polymers, the polymerization should occur at the double bonds and, in some cases, triple bonds. Using atom mappings, the most nucleophilic atom is selected as head by convention. In case the electrophilic atom is located at the same organic function, which is the case in vinyl polymerization, the tail is selected from the same organic function. If the most electrophilic atom is located at a different organic function, the tail is selected from a complementary organic function. For example, if an amide functional group is ranked as the group with highest atomic index of nucleophilicity, and a carboxylic acid or acid halide group exists in the molecule, the class will be assigned as polyamide and the tail will be assigned to the carboxylic acid or acid halide groups. The atomic index of nucleophilicity derived by HOMO electronic population analysis is sufficient for determining the nucleophilic atom with highest probability to donate electrons. Once the polymerization reaction mechanism that occurs in the functional group is identified, the head and tail assignments are processed straightforwardly. In Supplementary Fig. 1, we have provided an example for illustrating how the HTA algorithms works.

M2P algorithm

For the head and tail assignment using Monomers to Polymers (M2P)⁵⁰, we have created a modified version of the M2P algorithm. According to the authors “The library can generate multiple replicate structures to create polymer chains represented at the atom and bond level. RDKit⁴⁴ reaction SMARTS⁴³ are used to manipulate the molecular structures and perform in silico reactions. The polymer chemistries available include vinyls, acrylates, esters, amides, imides, and carbonates”⁵⁰. Within the source code, the algorithm was modified to generate head and tail assignments for 2 and 3 monomers (vinyl polymerization) only if the user checks TRUE for the head and tail creation parameter. The original M2P algorithm compares SMARTS-SMILES patterns and performs a chemical reaction on a sequence of reactant molecules for returning polymerization products. In a first step, the polymerization type is defined by comparing the SMILES input with a library of SMARTS patterns of functional groups. SMARTS⁴³ is a representation for describing molecular patterns that allows specification of substructures with rules that are straightforward extensions of SMILES. After finding a match of a functional group that belongs to a pre-defined polymerization class, the input is provided to the reaction process. The reaction process is performed by following the SMARTS reaction pattern for chemical transformations, identifying which atom should be displaced and where it should be located on the products. In vinyl polymerization, for example, SMARTS is used for breaking up the double bond and for adding two R groups to the carbon atoms that formed the double bond. Noble gases atom representations are used as tokens for identifying the bond formation site. The polymerization mechanism comprises initiation, propagation, and termination steps. During the termination step, the tokens are deleted which leaves only the polymer product as a result. In our modified version of the algorithm, token atoms (Kr, Xe and Rn) are added to the initiation, propagation, and termination steps for representing the positions of head and tail atoms. At the end of the polymerization process, these tokens remain on the structure to represent the head and tail assignments. This treatment was also extended for co-polymers with 3 monomers.

Validation of HTA and M2P algorithms

The validation dataset contains 206 data points with 149 polymers that undergo vinyl polymerization - 17 in the polyamide class, 25 in the polyester class, 12 in the polyether class, and 3 in the polyurethane class. Specifically, 57 polymer names and polymer SMILES with assigned heads and tails belonging to polyamide, polyester, polyether, or polyurethane classes, respectively, were manually retrieved from Polymerdatabase.com⁵¹. 149 polymer names and (some of) the polymer SMILES with assigned heads and tails that undergo vinyl polymerization were extracted manually from reference⁵². The dataset was then modified by manually transforming the polymer product into its precursor (monomer). For validation, the algorithm was then applied to detect the reaction center of the polymerization.

The validation of our methodology was performed by comparing the ground-truth head and tail positions in the SMILES with the positions as predicted by the HTA and M2P algorithms. The HTA algorithm produces results as repeating units for each monomer. Therefore, the co-polymer results could not be compared automatically, and those results were analyzed manually. To compare the results for homopolymers, both datasets were sorted by polymer name and monomers with assigned heads and tails (mon-HTA/mon-M2P). The canonicalization of SMILES was carried out using RDKit for assuring that the labeling system identified each compound bijectively. This step was followed by analyzing if each mon-HTA/mon-M2P entry had the same canonical SMILES in the ground-truth and predicted datasets. The results were then compiled as a Boolean series and the mon-HTA/mon-M2P structures visualized.

Model training for forward and backward reaction prediction

As base model for both forward and backward reaction prediction, we have used the Molecular Transformer proposed by Schwaller et al.²⁵. Encoders and decoders follow a standard transformer architecture with 6 layers, word vectors and hidden size of dimension 512 (rnn_size parameter in OpenNMT⁵³), the gradient was accumulated 8 times with a maximum vector norm of 0.0, and adam was used as an optimizer (β₁ = 0.9, β₂ = 0.998) setting the maximum number of fine-tuning steps to 20000 (no early stopping applied). The batch size was set to 4096, and the batch type as well as the gradient normalisation method was set to tokens. The learning rate was set to 2.0 with noam as decay method. Dropout and label smoothing (ϵ) were set to 0.1. Parameter initialisation was disabled and position encoding enabled. All models were trained using a version of OpenNMT adapted for the Molecular Transformer⁵⁴ using the aforementioned fixed set of hyper-parameters for fine-tuning. As compared to the standard Molecular Transformer, we extended model and tokenizer to handle head and tail representations using noble gasses as additonal tokens. We trained models on two datasets generated by the HTA and the M2P algorithm, respectively, and compared both backward and forward model performance.

Data availability

D. Lowe’s dataset 1976_Sep2016_USPTOgrants_cml.7z used to extract the polymerization reaction data is available under doi:10.6084/m9.figshare.5104873.v1 - at: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873?file=8664364 The training dataset file hta_dataset_all_combinations.csv containing polymerization reactions in SMILES format and with the head and tail atoms assigned by the HTA tool is available under doi:10.24435/materialscloud:ef-4j - at: https://archive.materialscloud.org/record/2024.40 The training dataset file m2p_dataset_all_combinations.csv containing polymerization reactions in SMILES format and with the head and tail atoms assigned by the modified version of M2P tool is available under doi:10.24435/materialscloud:ef-4j - at: https://archive.materialscloud.org/record/2024.40 The file trained_models.zip contains the Machine Learning training models for both forward and backward directions and is available under doi:10.24435/materialscloud:ef-4j - at: https://archive.materialscloud.org/record/2024.40 The file input.csv with the ground-truth dataset containing monomers and polymer repeat units to validate the head and tail assignment by HTA and M2P algorithms is available under doi:10.24435/materialscloud:ef-4j at: https://archive.materialscloud.org/record/2024.40.

Code availability

The code for extracting the polymerization reaction data from Daniel Lowe’s dataset is available at: https://github.com/IBM/XLMExtractor-chem-reaction. The code for assigning the head and tail atoms (version 1.0.0) using quantum chemistry and polymerization mechanisms information is available at: https://github.com/IBM/HeadTailAssign. The code for assigning the head and tail atoms based on the Monomers to Polymers (M2P) tool is available at: https://github.com/IBM/m2o-head-tail-assign. The code for model training is available at: https://github.com/rxn4chemistry/rxn-models-for-polymerization.

References

Arshad, M., Zubair, M., Rahman, S. S. & Ullah, A. Polymers for advanced applications. In Polymer Science and Nanotechnology, 325–340 (Elsevier, 2020). https://doi.org/10.1016/b978-0-12-816806-6.00014-5.
Namazi, H. Polymers in our daily life. BioImpacts 7, 73–74 (2017).
Article CAS PubMed PubMed Central Google Scholar
Patel, V. K., Kant, R., Chauhan, P. S. & Bhattacharya, S. Introduction to applications of polymers and polymer composites. In Trends in Applications of Polymers and Polymer Composites, 1–6 (AIP Publishing, 2022). https://doi.org/10.1063/9780735424555_001.
Kim, C., Chandrasekaran, A., Huan, T. D., Das, D. & Ramprasad, R. Polymer genome: A data-powered polymer informatics platform for property predictions. J. Phys. Chem. C. 122, 17575–17585 (2018).
Article CAS Google Scholar
Tran, H. D. et al. Machine-learning predictions of polymer properties with polymer genome. J. Appl. Phys. 128, 171104 (2020).
Article Google Scholar
Kim, C., Batra, R., Chen, L., Tran, H. & Ramprasad, R. Polymer design using genetic algorithm and machine learning. Comput. Mater. Sci. 186, 110067 (2021).
Article CAS Google Scholar
Batra, R. et al. Polymers for extreme conditions designed using syntax-directed variational autoencoders. Chem. Mater. 32, 10489–10500 (2020).
Article CAS Google Scholar
Giro, R. et al. AI powered, automated discovery of polymer membranes for carbon capture. npj Comput. Mater. 9. https://doi.org/10.1038/s41524-023-01088-3 (2023).
Park, N. H. et al. Artificial intelligence driven design of catalysts and materials for ring opening polymerization using a domain-specific language. Nat. Commun. 14, 3686 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhou, H., Badashah, A., Luo, Z., Liu, F. & Zhao, T. Preparation and property comparison of ortho, meta, and para autocatalytic phthalonitrile compounds with amino group. Polym. Adv. Technol. 22, 1459–1465 (2011).
Article Google Scholar
Sazali, N. et al. A short review on polymeric materials concerning degradable polymers. IOP Conf. Ser. Mater. Sci. Eng. 788, 012047 (2020).
Article CAS Google Scholar
Wang, Q., Takita, R., Kikuzaki, Y. & Ozawa, F. Palladium-catalyzed dehydrohalogenative polycondensation of 2-bromo-3-hexylthiophene: An efficient approach to head-to-tail poly(3-hexylthiophene). J. Am. Chem. Soc. 132, 11420–11421 (2010).
Article CAS PubMed Google Scholar
Liu, Y. et al. The effect of molecular isomerism on the barrier properties of polyimides: Perspectives from experiments and simulations. Polymers 13, 1749 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ohno, M., Hayashi, Y., Zhang, Q., Kaneko, Y. & Yoshida, R. Smipoly: Generation of a synthesizable polymer virtual library using rule-based polymerization reactions. J. Chem. Inf. Model. 63, 5539–5548 (2023).
Article CAS PubMed PubMed Central Google Scholar
Caddeo, C., Ackermann, J. & Mattoni, A. A theoretical perspective on the thermodynamic stability of polymer blends for solar cells: From experiments to predictive modeling. Sol. RRL 6, 2200172 (2022).
Article CAS Google Scholar
Chen, L., Kern, J., Lightstone, J. P. & Ramprasad, R. Data-assisted polymer retrosynthesis planning. Appl. Phys. Rev. 8, 031405 (2021).
Article CAS Google Scholar
Kim, S., Schroeder, C. M. & Jackson, N. E. Open macromolecular genome: Generative design of synthetically accessible polymers. ACS Polymers Au. https://doi.org/10.1021/acspolymersau.3c00003 (2023).
Weininger, D. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
CAS Google Scholar
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Article CAS Google Scholar
Weininger, D. SMILES. 3. DEPICT. graphical depiction of chemical structures. J. Chem. Inf. Model. 30, 237–243 (1990).
CAS Google Scholar
Lin, T.-S. et al. BigSMILES: A structurally-based line notation for describing macromolecules. ACS Cent. Sci. 5, 1523–1531 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lin, T.-S. et al. PolyDAT: A generic data schema for polymer characterization. J. Chem. Inf. Model. 61, 1150–1163 (2021).
Article CAS PubMed Google Scholar
Guo, M. et al. Polygrammar: Grammar for digital polymer representation and generation. Adv. Sci. 9, 2101864 (2022).
Article CAS Google Scholar
Mohapatra, S., An, J. & Gómez-Bombarelli, R. Chemistry-informed macromolecule graph representation for similarity computation, unsupervised and supervised learning. Mach. Learn. Sci. Technol. 3, 015028 (2022).
Article Google Scholar
Schwaller, P. et al. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lowe, D. Chemical reactions from US patents (from 1976 to September 2016). https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873. Accessed: 2022-11-9.
Lowe, D. M. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, University of Cambridge (2012).
Wilson, N., St John, P. & Crowley, M. m2p (monomers to polymers). Tech. Rep., National Renewable Energy Lab.(NREL), Golden, CO (United States) (2020).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. & Grzybowski, B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem. Int. Ed. 53, 8108–8112 (2014).
Article CAS Google Scholar
Mulliken, R. S. Electronic population analysis on lcao–mo molecular wave functions. i. J. Chem. Phys. 23, 1833–1840 (1955).
Article CAS Google Scholar
Mulliken, R. S. Electronic population analysis on lcao–mo molecular wave functions. ii. overlap populations, bond orders, and covalent bond energies. J. Chem. Phys. 23, 1841–1846 (1955).
Article CAS Google Scholar
Mulliken, R. S. Electronic population analysis on lcao-mo molecular wave functions. iv. bonding and antibonding in lcao and valence-bond theories. J. Chem. Phys. 23, 2343–2346 (1955).
Article CAS Google Scholar
Saleh, N. et al. Surface modifications enhance nanoiron transport and NAPL targeting in saturated porous media. Environ. Eng. Sci. 24, 45–57 (2007).
Article CAS Google Scholar
Francisco-Vieira, L., Benavides, R., Cuara-Diaz, E. & Morales-Acosta, D. Styrene-co-butyl acrylate copolymers with potential application as membranes in PEM fuel cell. Int. J. Hydrog. Energy 44, 12492–12499 (2019).
Article CAS Google Scholar
Concilio, M., Nguyen, N. & Becer, C. R. Oxazoline-methacrylate graft-copolymers with upper critical solution temperature behaviour in yubase oil. Polym. Chem. https://doi.org/10.1039/d1py00534k (2021).
Atta, A. M., Brostow, W., Lobland, H. E. H., Hasan, A.-R. M. & Perez, J. M. Porous polymer oil sorbents based on PET fibers with crosslinked copolymer coatings. RSC Adv. 3, 25849 (2013).
Article CAS Google Scholar
Chen, X.-P. & Qiu, K.-Y. ?living? radical polymerization of styrene with AIBN/FeCl3/PPh3 initiating system via a reverse atom transfer radical polymerization process. Polymer Int. 49, 1529–1533 (2000).
Ogieglo, W., Wormeester, H., Eichhorn, K.-J., Wessling, M. & Benes, N. E. In situ ellipsometry studies on swelling of thin polymer films: A review. Prog. Polym. Sci. 42, 42–78 (2015).
Article CAS Google Scholar
Dena, A. S. A., Ali, A. M. & El-Sherbiny, I. M. Surface-imprinted polymers (sips): Advanced materials for bio-recognition. J. Nat. Sci. Publish. Cor (2020).
Ibrahim, K. Towards more controlled poly(n-butyl methacrylate) by atom transfer radical polymerization. Eur. Polym. J. 39, 939–944 (2003).
Article CAS Google Scholar
SMARTS - a language for describing molecular patterns. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.
RDKit: open-source cheminformatics. https://www.rdkit.org. https://doi.org/10.5281/zenodo.591637.
Fingerprintsimilarity function. https://github.com/rdkit/rdkit-orig/blob/master/rdkit/DataStructs/__init__.py.
Rácz, A., Bajusz, D. & Héberger, K. Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints. J. Cheminform.10. https://doi.org/10.1186/s13321-018-0302-y (2018).
Tanimoto, T. T. Elementary mathematical theory of classification and prediction (International Business Machines Corp., 1958).
Szczepanik, D. W. & Mrozek, J. Nucleophilicity index based on atomic natural orbitals. J. Chem. 2013, 1–6 (2013).
Article Google Scholar
Barca, G. M. J. et al. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 152, 154102 (2020).
Article CAS PubMed Google Scholar
Wilson, N., St John, P. & Crowley, M. Monomers to polymers (m2p) - github. https://github.com/NREL/m2p (2022).
Polymerdatabase.com. https://www.polymerdatabase.com/main.html. Accessed: 2023-05-09.
Bicerano, J. Prediction of polymer properties (cRc Press, 2002).
Klein, G., Kim, Y., Deng, Y., Senellart, J. & Rush, A. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, 67–72 (Association for Computational Linguistics, Vancouver, Canada, 2017). https://doi.org/10.18653/v1/P17-4012.
IBM RXN. ONMT adaptation for rxn4chemistry. https://github.com/rxn4chemistry/OpenNMT-py.

Download references

Acknowledgements

T. L. acknowledges support from the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation.

Author information

Authors and Affiliations

IBM Research, Av. República do Chile, 330, CEP 20031-170, Rio de Janeiro, Rio de Janeiro, Brazil
Brenda S. Ferrari & Mathias B. Steiner
IBM Research Europe, Saümerstrasse 4, 8803, Rüschlikon, Switzerland
Matteo Manica & Teodoro Laino
IBM Research, Rd J Fco Aguirre Proença Km 9 SP 101, CEP 13186-900, Hortolândia, São Paulo, Brazil
Ronaldo Giro
National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zürich, Switzerland
Teodoro Laino

Authors

Brenda S. Ferrari
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Manica
View author publications
You can also search for this author in PubMed Google Scholar
Ronaldo Giro
View author publications
You can also search for this author in PubMed Google Scholar
Teodoro Laino
View author publications
You can also search for this author in PubMed Google Scholar
Mathias B. Steiner
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.S.F. created and curated the polymerization reaction dataset and co-wrote the manuscript. M.M. developed Machine-Learning models and co-wrote the manuscript. R.G. conceived the work and co-wrote the manuscript. T. L. conceived the work and co-wrote the manuscript. M.B.S. conceived the work and co-wrote the manuscript.

Corresponding authors

Correspondence to Ronaldo Giro or Mathias B. Steiner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ferrari, B.S., Manica, M., Giro, R. et al. Predicting polymerization reactions via transfer learning using chemical language models. npj Comput Mater 10, 119 (2024). https://doi.org/10.1038/s41524-024-01304-8

Download citation

Received: 20 October 2023
Accepted: 25 May 2024
Published: 04 June 2024
DOI: https://doi.org/10.1038/s41524-024-01304-8
Springer Nature Limited

Predicting polymerization reactions via transfer learning using chemical language models

Abstract

Similar content being viewed by others

Artificial intelligence driven design of catalysts and materials for ring opening polymerization using a domain-specific language

A prospective on machine learning challenges, progress, and potential in polymer science

Machine learning-assisted systematical polymerization planning: case studies on reversible-deactivation radical polymerization

Introduction