Introduction

Recent advancements in artificial intelligence (AI) have catalyzed significant progress in cheminformatics, particularly in predicting materials’ properties and discovery of materials with optimized properties1,2. Current machine learning models are increasingly proving their efficacy as alternatives to their predecessors such as quantitative structure-property relationship (QSPR) models by proving their remarkable ability to accurately interpret the relationship between materials’ chemical structures and their inherent properties. Their applications extend across various properties of inorganic compounds, such as electronic properties2,3,4, thermodynamic properties5,6, and crystal structures7,8. Deep learning models, in particular, have gained widespread attention for their accurate prediction capabilities and effective handling of intricate chemical structures and properties, owing to their flexible and adaptive feature extraction capabilities9.

With the wide range of applications for polymers, from everyday commodities like packaging and clothing to advanced applications in electronics, energy, and drug delivery, it is no surprise that the use of deep learning in polymer science has sparked interest9,10,11. However, a substantial dataset required to optimize the network parameters presents a significant hurdle in polymer informatics. Unlike with inorganic compounds, the availability of large, comprehensive, and standardized datasets for polymers is limited due to the complexity and diversity of polymer structures, decentralized polymer synthesis and characterization, lack of standardization in polymer characterization, and privacy and intellectual property considerations12,13.

To address these challenges, alongside efforts to acquire experimental databases, researchers are generating virtual datapoints to supplement experimental data via computational approaches, such as DFT (density functional theory), QSPR (Quantitative Structure-Property Relationship), or molecular simulations14,15,16. Yet, even with these approaches, if the generated data does not match the actual values and variability of real experimental data within their uncertainty, it could dilute the information derived from the experimental data, leading to a potential decline in the network’s predictive performance. Thus, it is critical to develop robust methodologies capable of handling the intricate challenges presented by polymer informatics while operating effectively even with limited data.

One way to enhance performance is by implementing an efficient feature representation that can encode polymer structures into a format suitable for deep learning. Traditionally, various methods have been used for this purpose. For example, molecular fingerprinting encodes the structure of a molecule into a bit string or count vector of descriptors. Another approach is through string representations like SMILES (Simplified Molecular-Input Line-Entry System), which provides a textual representation of a molecular structure and composition using specific grammatical rules. String-based representations offer distinct advantages by enabling direct correlation between structural elements and properties, thereby facilitating both generative modeling and insightful analysis of the contribution of specific chemical groups to material properties. A more intricate approach utilizing graph neural networks has also been applied to represent molecular structure and connectivity in graphs, with atoms mapped as nodes and bonds as edges. The effectiveness of these representations can heavily depend on the specific task and model being used.

However, while these methods are effective for small molecules, they may struggle with the complexities of polymers, which often have bulky side groups and intricate chain architectures with branching and cross-linking. Even though some efforts have been made to adapt these methods for polymer structures, such as using SMILES-based17,18 or graph representation of repeat unit structures9, Big-SMILES for crosslinked polymers17,19,20, PolyBERT for automatic fingerprint generation from Canonical PSMILES21, they still often lead to overly complicated and lengthy representations for complex polymers, posing challenges in data interpretation and handling. The choice of representation for polymers is crucial to enable effective training of deep learning networks and reduce the volume of data needed for network optimization.

Drawing parallels with polymer simulations, where mesoscale coarse-grained models are developed and often preferred over detailed atomistic simulations for describing certain polymer characteristics, we propose adopting a coarser-scale representation for deep learning for property prediction. The aim is to focus on encapsulating essential features of polymers rather than detailing every atomistic connection. Our HAPPY (Hierarchically Abstracted rePeat unit of PolYmers), a simplified string-based representation, uses a subgroup-based representation. HAPPY assigns distinct characters to common functional groups and ring structures, then represents their connections and chemical linkages by introducing connectors. These connectors stand as independent characters, possessing full meaning without necessitating any corresponding or complementary characters. This methodology facilitates encoding more direct and condensed information into brief string representations, as will be discussed subsequently. To validate our approach, we evaluated the performance of deep learning models using HAPPY representation in predicting key bulk properties of polymer melts, such as dielectric constant, glass transition temperature, thermal conductivity, solubility, and density, and its performance is compared with SMILES. The robustness of the method against variations in weight initialization and data size was also validated. Furthermore, we proposed Topo-HAPPY, an extended version of our representations, which incorporates topological details such as types of covalent bonds and connection sites. This enhancement maintains the condensed nature of HAPPY while enabling diversification in the string representation of different monomers with the identical chemical compositions.

Results

HAPPY (Hierarchically Abstracted rePeat unit of PolYmers)

This section outlines HAPPY (Hierarchically Abstracted rePeat unit of PolYmers), our proposed string-based representation that converts chemical composition and bond structure information of polymers into machine-readable HAPPY format suitable for deep learning applications. We provide illustrative examples to help elucidate the conversion process.

The key features of HAPPY for the creation of condensed representation include coarsening chemical details and simplifying grammatical rules. HAPPY achieves this through a multi-level simplification of polymer repeat unit structures. The initial step involves lower-level abstraction, where groups of atomistic-level structures are abstracted and assigned as constitutional components denoted by alphabetical characters. Contrary to SMILES, which denotes atoms and connectors individually as symbols, HAPPY expands its character set by integrating additional subgrouping. Even though this integration increases the count of characters, this method of abstraction decreases the overall string length, enabling a more concise representation of polymers in HAPPY, while retaining key information. The general principles that we take in defining subgroups are as follows, with Supplementary Table 1 displaying the chemical structures of these defined subgroups and offering a comparative representation of these structures in both HAPPY and SMILES:

  • Cyclic structures are considered as subgroups, and any sub-structures comprising of polycyclic structures are defined as a single subgroup. (Refer to entries 1–7 in Supplementary Table 1)

  • Common functional groups, such as ethers, alkenes, and amines, are grouped together to form distinct subgroups. (See entries 8–9 in Supplementary Table 1)

  • Structures that appear frequently in the experimental dataset are assigned to subgroups. (10–11 in Supplementary Table 1)

For example, consider ‘9-octyl-3-(9-octylcarbazol-3-yl)carbazole’ structure containing a carbazole ring. In SMILES notation, this structure is represented as ‘C1 = CC = C2C(=C1)C3 = CC = CC = C3N2’, consisting of 25 characters to describe the constituent atomic elements, chemical bond types, and connectivity. SMILES introduces special connectors, such as ‘=’, ‘(‘, and ‘) ’, to describe the connectivity between atoms. Certain connectors, like parentheses, typically adhere to syntax rules that make them interdependent. In contrast, HAPPY renders this carbazole group to a single component named ‘$ Nbb’. The symbol ‘$’ indicates the character notation pertaining to subgroups. By employing this subgrouping approach, HAPPY successfully alleviates the need for complex descriptors for intricate structures, circumventing challenges associated with representing closely situated atoms in the actual molecule as distant entities in string notation. Additionally, Hydrogen atoms are disregarded in HAPPY, as is typical in other molecular structure representations. These rules lead to the production of more succinct strings that effectively retain the essential information.

Figure 1a illustrates the conversion process of a sample polymer repeat unit structure into HAPPY representation. In the low-level abstraction, specific structures such as the aromatic rings, ester functional group, and 2-methylpropane are assigned shorthand notations $B, $E, and $ Ct, respectively, based on the predefined subgrouping rules. High-level abstraction divides the repeat unit into mainline and sideline components. Mainline comprises atoms and subgroups along the shortest path between head and tail of the repeat unit. Any connected branch to the mainline is considered a sideline. This division ensures sequential notation along the mainline path. The final step involves string encoding, proceeding through sideline and mainline encoding. Connectors ‘@‘ and ‘ #’ are introduced as grammatical characters to indicate the number of connected constituents as one and two, respectively. To denote structures with three connected constituents, we can extend the existing grammatical system by introducing an additional symbol, such as ‘%’. These connectors are self-referencing and eliminate the need for interdependent characters such as parentheses used in SMILES notation. In the example structure shown in Fig. 1a, sideline [S1] contains an aromatic ring ($B) connected to a methyl group (C) and an ester functional group ($E). This structure is represented in HAPPY as ‘$B#C$E’ using the introduced connector ‘#’. The encoding process employs these connectors in the mainline as well. For each component in the mainline linked to sideline structures, the presence of the connected sideline is denoted using ‘@’ and ‘#’ characters, depending on the number of connected sidelines. In the example structure, two carbon atoms are connected to sidelines [S1] and [S2]. The mainline encoding sequentially lists every mainline component from head to tail, resulting in ‘HCCCC@[S1]C$BCC@[S2]CT.’ The first sideline [S1] is encoded as ‘$B#C$E,’ and the second sideline [S2], which comprises 2-methylpropane structures, is denoted as ‘$Ct.’ Combining the encodings of sidelines and mainline, the final HAPPY representation is generated as ‘HCCC@$B#C$EC$BCC@$CtCT.’

Fig. 1: Conversion of repeat unit structures into HAPPY and Topo-HAPPY representations.
figure 1

a Schematic illustration of the process of converting polymer repeat unit structures into HAPPY representation. HAPPY proceeds in three steps: low-level abstraction to create subgroups out of distinct chemical structures, high-level abstraction which divides the repeat unit structures into mainline and sideline, and string encoding, which sequentially denotes all the components of the structures, starting from H(Head) and ending with T(tail). b Demonstration of how subgroups and elements with identical HAPPY representation are denoted differently based on their connectivity in Topo-HAPPY. c Schematic illustration of Topo-HAPPY representation, an extension of HAPPY that incorporates the different alphabetical characters to differentiate constitutional components with different connectivity.

Topo-HAPPY: a bond-specific extension of HAPPY

The base HAPPY representation significantly reduces the complexity of polymer repeat unit structure by representing a collection of atoms as a single subgroup. However, this abstraction may disregard certain intricacies, such as the precise types and locations of bonds, which have significant implications for a range of polymer properties. To address these limitations, we enhanced the HAPPY representation and introduced Topological HAPPY(Topo-HAPPY). Topo-HAPPY includes structural information differentiating types of covalent bonds and identifying connection sites between components. HAPPY representation denotes aromatic ring as $B, regardless of its connectivity with other constituents. However, as shown in Fig. 1b, Topo-HAPPY further differentiates it depending on the numbers and positions of bonding to the neighboring subgroups, as $B, $Bp, $Bm, or $Bpy. All extended topological subgroups are listed in Supplementary Table 1. Similar extension is also applied to distinguish the chemical bond types. For instance, in the HAPPY representation, oxygen atom connected to a carbon atom is annotated as C@O. In Topo-HAPPY, if the connection is via a double bond, it is denoted as C@Od. The encoding process for repeat unit structure in topo-HAPPY largely follows the same procedure as in HAPPY. Figure 1c depicts Topo-HAPPY representation of the same example structure used to illustrate HAPPY process. The two aromatic rings are distinguished based on their connectivity, represented as $Bpy and $Bp. The potential of Topo-HAPPY is further showcases in the upcoming result section, where we compare the R-squared score of predicting solubility and glass transition temperature using SMILES, HAPPY and Topo-HAPPY.

Dielectric constant prediction with limited experimental data

Experimental data on dielectric constant values for 56 different polymers repeat unit structures were retrieved from published data22, notably from Bicerano23. The dataset covers a broad range of dielectric constants from 2.1 to 4.5. To ensure an unbiased evaluation of the network’s performance, we elaborately selected a subset of data points in limited numbers for testing based on dielectric constant values and the length of the corresponding SMILES representations. Figure 2a shows where the 11 samples of the selected test set are located on the distribution curves of the normalized dielectric constant and the length of the SMILES representation of the entire dataset, as indicated by colored dots at the bottom of the figure. The testing subset encompasses a wide range of character counts and dielectric constant values, enabling a thorough evaluation of the property prediction performance.

Fig. 2: Distribution of dataset and testing of dielectric constant prediction.
figure 2

a Normalized distribution of dielectric constant values and character count of SMILES representation of the repeating unit for 45 training experimental data points. The bottom lines indicate the positions of 11 data points from the test dataset within the distribution plots, marked with diamond symbols corresponding to each molecular structure. b Repeating unit structure of the test set in the leftmost column, with the comparison between true-labeled dielectric constant values (marked in green) and network-predicted values using HAPPY (marked in red) and SMILES (marked in blue). c Eleven subgroups were defined from the repeat unit structures in the experimental dataset for HAPPY representation. The first column displays the specific structure of each subgroup, whereas the next column presents the total counts of each subgroup appearing in the training and testing dataset.

Figure 2b shows the predicted dielectric constant values for the test set using a RNN deep learning network based on either HAPPY or SMILES as input. The prediction error for each polymer is displayed by placing the predicted and true-labeled experimental values along a single axis. The figure demonstrates that the predicted values from the network trained with HAPPY representation are closer to the experimental values than those predicted by the network taking SMILES representation as input. To quantitatively evaluate the performance of the two networks, we measured their R-squared score and training speed. The R-squared score for the HAPPY network was 81.21%, which was notably higher than SMILES network’s R-squared score of 69.48%. To confirm these findings conclusively, we conducted a 5-fold validation on this dataset, resulting in R-squared scores of 74.04% for HAPPY and 54.30% for SMILES, which we have included in Supplementary Fig. 1. Furthermore, we found that our condensed HAPPY representation enabled faster training times which is 0.0358 second/epoch with approximately a two-fold improvement over the SMILES representation.

Figure 2c shows the occurrence statistics of the eleven subgroups defined for HAPPY representation in the experimental dataset. The statistics were analyzed separately for the training and testing datasets. Due to the limited size of the dataset, it was observed that some subgroups showed very low occurrence in the test set. Notably, one subgroup, marked in red, was present in the test set but not in the training set. The prediction errors for polyetherimide (PEI) that contained this subgroup were relatively high at 11%, as indicated by the shaded structure in Fig. 2b, prompting us to consider augmenting the training data in the next section.

Dielectric constant prediction with augmented training dataset

To inflate the training set, we generated 200 additional structures through an arbitrary combination of constituent elements from the original dataset, as detailed in the method section. The physical properties of these generated structures were estimated using the Synthia QSPR scheme. This resulted in a 10.73-fold increase in the counts of eleven subgroups occurrence in the training set, as shown in Fig. 3a. Importantly, the fourth subgroup that was absent from the original experiment-only training set now appears 43 times in the augmented training dataset (shaded in yellow). The augmentation of the dataset has resulted in a significant improvement in the network predictive performance (Fig. 3b). Both the SMILES and HAPPY-based networks achieved an approximately 6-7% improvement in prediction R-squared score, with the HAPPY-based network achieving a high R-squared score of 89.29% and the SMILES-based network achieving an R-squared score of 75.73%. However, despite the increase in the dataset size, the SMILES-based network trained with an augmented dataset could not outperform the HAPPY-based network with 81.21% R-squared score, that was trained with only 47 experimental data. This highlights the efficacy of the HAPPY representation in capturing essential features of polymer structures.

Fig. 3: Comparison of network performance and training time for HAPPY and SMILES representation.
figure 3

a Frequency counts of subgroups in the augmented training dataset compared to the experimental training dataset, highlighting the increase in the occurrence of subgroups. b Improvement in prediction performance (R-squared score) for dielectric constant values after dataset augmentation, demonstrating the increase in the predictive capability of the networks. c Training time per epoch for networks trained with SMILES and HAPPY, with and without data augmentation.

Figure 3c illustrates the time required for the network to complete one epoch with and without the data augmentation. As expected, the augmentation of dataset increases the training time for both HAPPY and SMILES-based networks. The training time for SMILES-based network with augmented dataset was 2.8 times longer than when trained with only the experimental dataset. Whereas the increase in training time for HAPPY-based network was not as significant. Figure 3c also demonstrates that training a network with the augmented dataset using HAPPY requires approximately the same time as training the SMILES-based network with the experimental dataset, yet the former achieved a 20% higher R-squared score.

Robustness and versatility of HAPPY-based property prediction network

To further assess the potential and the robustness of the HAPPY-based network, we extended our analysis by utilizing the full set of 56 experimental data points as a test set, beyond the 11 used in the previous section. Both the HAPPY and SMILES-based networks were trained exclusively on the numerically generated datapoints, and the number of training datapoints was gradually increased by a factor of 20 until reaching a final size of 400. The networks were trained five times with different initial weight values, and the resulting R-squared scores of each trained network were averaged to obtain the final performance metrics. Despite being trained solely on computationally generated data, the HAPPY-based network achieved a maximum averaged R-squared score of 72.8%, as shown in Fig. 4a. When the size of the training dataset was relatively small, i.e., below 200, the performance of both HAPPY and SMILES-based networks could be inconsistent from one trial of network training to the next, resulting in fluctuations in their performance metrics. However, once a sufficient amount of training data was secured, HAPPY-based networks consistently outperformed SMILES-based networks, suggesting that the HAPPY representation could capture the key features of the molecular structure better than SMILES. The robustness and generalizability of the HAPPY-based network were also highlighted by the narrow standard deviation of the R-squared scores, as presented by the shaded area in Fig. 4a. The relatively small variation in the averaged R-squared score across all numbers of datapoints trained with HAPPY (0.038) indicated that the HAPPY-based network could produce reliable predictions with high consistency while the SMILES-based network showed an averaged variance of 0.090.

Fig. 4: Performance evaluation of networks based on HAPPY and SMILES representations on various polymer properties.
figure 4

a Averaged R-squared scores for dielectric constant prediction when 100% experimental dataset is tested while training is conducted solely with generated dataset. The graph plotted as a function of the number of training data points, with shaded areas representing the standard deviation over five networks trained with different initial weight values. bd Scatter plots showing the prediction performance for thermal conductivity, density, and solubility, respectively. eg R-squared scores for each property prediction (thermal conductivity, density, and solubility) of networks trained with 400 generated datapoints using HAPPY and SMILES representations.

Additionally, we sought to demonstrate the versatility of the HAPPY representation by expanding the properties prediction task to include three additional properties: thermal conductivity, density, and solubility. For this analysis, we employed the same 400 training and 56 testing data points as shown in Fig. 4a. However, in contrast to the previous result which relied on experimental dielectric constant values for the test dataset, we calculated the physical properties of both training and testing data using the Synthia QSPR module implemented in Materials Studio 2022. Figure 4b–d depicts the scatter plots of prediction performance for thermal conductivity, density and solubility on the 56 testing repeat units. The corresponding R-squared score of each property prediction are presented in Fig. 4e–g.

The prediction performance of the HAPPY -based network was superior to that of SMILES for all physical properties. In particular, for thermal conductivity, as illustrated in Fig. 4b, HAPPY-based network predictions are tightly clustered around the line of best fit, while SMILES-based network predictions are more widely distributed. The R-squared scores, as depicted in Fig. 4e, confirmed that HAPPY achieved a significantly high score of 0.84, whereas SMILES only scored 0.49. Regarding density, networks based on both representations had predictions distributed close to the line of best fit as shown in Fig. 4c, f. However, HAPPY still outperformed SMILES in terms of R-squared scores.

In the case of solubility (Fig. 4d, g) the networks using HAPPY performed marginally better than those using SMILES. However, neither network demonstrated satisfactory performance. A closer examination of the repeat unit structures in test dataset revealed that the main source of HAPPY’s poor performance could be attributed to its limitation with oxygen atoms. The errors in structures containing oxygen atoms with the SMILES representation were on average 1.54 \(J.\,{{cm}}^{-3}\), which is 15% higher than structures without oxygen atoms (1.33 \(J.\,{{cm}}^{-3}\)). However, in the case of HAPPY representation, the errors in structures containing oxygen atoms were on average 1.78\(J.\,{{cm}}^{-3}\), which is 104% higher than structures without oxygen atoms (0.87\(J.\,{{cm}}^{-3}\)). Since HAPPY encodes the structure by accounting for constituent elements other than hydrogen atom and disregards connectivity between them, it is unable to differentiate between -OH and =O groups. The dipole moments of -OH and =O groups differ significantly, and this influences dipole-dipole interactions, which in turn, affect the cohesive energy of the substance. Since cohesive energy plays a crucial role in determining solubility, the inability of HAPPY representation to distinguish between -OH and =O can seriously impact the prediction of solubility. This suggests that an expanded HAPPY representation, incorporating additional chemical or topological details, could lead to improved prediction performance, especially in cases where this added information is highly correlated to the physical properties, as in case of solubility. Therefore, in the latter part of this section, we will present results using Topo-HAPPY.

Topological HAPPY implementation and evaluation

Topological HAPPY (topo-HAPPY) supplements topological details to the relationship between constituent elements of subgroups by distinguishing them based on types of covalent bonds and position of connection sites. This enhancement of connectivity information allows for further diversification in the string representation of different monomers with the same chemical composition. However, the expanded constituents within topo-HAPPPY added complexity in optimizing the weights for larger deep learning networks. We assessed the effectiveness of topo-HAPPY by comparing its performance with HAPPY in predicting thermal conductivity, density and solubility using R-squared score as the metric (Fig. 5a–c). The application of Topo-HAPPY did not degrade the network performance for thermal conductivity and density. This stability could be attributed to the added detailed information, which was highly correlated to the property being predicted, and the enhanced consistency between the constituents and target properties. The benefits of Topo-HAPPY representation were particularly evident in solubility prediction, as proved by a 25% increase in the R-squared score from 0.492 with HAPPY to 0.615 with Topo-HAPPY-based network (Fig. 5c).

Fig. 5: Performance analysis on networks trained with Topo-HAPPY (yellow), an extended version of HAPPY that supplements connectivity information within the repeat unit structure, compared with HAPPY (red) and SMILES (blue).
figure 5

ac R-squared scores for each property prediction (thermal conductivity, density, and solubility) of networks trained with 400 generated datapoints using Topo-HAPPY, HAPPY, and SMILES representations. d, e Scatter plots for solubility prediction are shown for HAPPY and Topo-HAPPY, highlighting outliers (marked in gray) in the test dataset. f Detailed comparison of outliers in solubility prediction, showcasing the lower error achieved by Topo-HAPPY. g Comparison of R-squared scores for glass transition temperature (Tg) prediction using the three different representations. hj Scatter plots illustrating outliers in glass transition temperature (Tg) prediction using SMILES, HAPPY, and Topo-HAPPY, respectively, highlighting outliers (marked in gray) in the test dataset.

We further examined the impact of Topo-HAPPY on solubility prediction, focusing on datapoints with prediction error larger than 2.071\(\,J.\,{{cm}}^{-3}\). This outlier threshold was determined as 10% of the average solubility of all 56 answer labels in the test dataset. The scatter plots in Fig. 5d, e illustrate the prediction performance of HAPPY and Topo-HAPPY, respectively, for the test dataset. HAPPY-based prediction showed seven outliers (shaded in gray in Fig. 5d) and six of them contained oxygen atoms. By contrast, Topo-HAPPY (Fig. 5e) identified five outliers, four of which overlapped with those found using HAPPY-based network prediction. HAPPY had an average error 3.499 \({J}.\,{{cm}}^{-3}\), whereas Topo-HAPPY demonstrated a 17.02% lower error. Figure 5f compared eight outliers from HAPPY-based prediction with the prediction using Topo-HAPPY(yellow). Particularly for molecules containing oxygen atoms, Topo-HAPPY’s solubility prediction closely matched the answer labels(green), highlighting the improved R-squared score achieved of Topo-HAPPY in solubility prediction.

To corroborate the performance of Topo-Happy, we extended our prediction task to include glass transition temperature (Tg). We utilized an experimental dataset of 214 polyimide structures obtained from Volgin et al., with 40 structures allocated for testing and the rest for training. To enhance the comparison, the SMILES notation, applied in all instances of Tg prediction, is a version of SMILES that also expresses the location of the polymerization point with [*]. As depicted in Fig. 5g, the R-squared scores showed that while SMILES and HAPPY performed similarly, Topo-HAPPY outperformed the other representations. The scatter plots (Figures h, i, and j) displayed outliers under a more stringent criteria of 27.66 K(5% of the average Tg in the test dataset, 553.21 K). As a result, 15 outliers were identified for SMILES, 11 for HAPPY, and only 5 for Topo-HAPPY. Topo-HAPPY achieved an R-squared score of 0.86, surpassing HAPPY (0.717) and SMILES-based network (0.73), demonstrating its effective enhancement in prediction performance, especially for Tg and solubility.

To further substantiate the efficacy of HAPPY and Topo-HAPPY, and for a comprehensive comparison with other machine learning algorithms, we expanded our initial polyimide dataset. This expansion involved integrating Tg values for an additional 312 polymer repeat unit structures, obtained from the published work of Bicerano. This increased our dataset to a total of 527 data points. Utilizing this enhanced dataset, we conducted a thorough evaluation of our Tg prediction network’s performance, Detailed findings regarding performance, training and evaluation time are presented in Supplementary Fig. 2.

Discussion

Our study proposed HAPPY, a meaningful stride towards an efficient polymer representation for deep learning applications in polymer informatics. In comparison to conventional atomistic-based string representation, SMILES, HAPPY consistently delivers superior performance in predicting various polymer properties, including dielectric constant, density, thermal conductivity, solubility, and glass transition temperature. Additionally, the coarse-grained subgrouping in HAPPY also facilitates the assessment of impact of distinct chemical groups on physical properties, as illustrated in Supplementary Fig. 4 through SHAP (SHapley Additive exPlanations) analysis24,25 on Tg predictions using the expanded dataset. The SHAP analysis sheds light on the specific contribution of each subgroup to Tg, revealing trends that align with actual experimental observations.

Despite its satisfactory performance, the current iteration of the HAPPY-based network still has areas for improvement. The primary limitation lies in that properties were predicted solely based on repeat unit structures, similar to most previous studies9,10,26,27. The sequence and arrangement of these repeat units will significantly affect the polymer properties. Therefore, the development of deep learning algorithms capable of distinguishing different repeat unit sequences and chain architectures is paramount for advancing polymer informatics.

Recognizing this limitation, we reviewed recent studies to identify promising developments in this field. One notable study by Yan, et al. utilized fingerprinting derived from Big-SMILES for a generative model to discover Thermoset Shape Memory Polymer. They employed Variational Auto-Encoders (VAE) and weighted combination network to address copolymer molar ratios and applied transfer learning techniques, fine tuning a mode pretrained on a vast small molecule dataset with a smaller set of experimental data19,20,28. Another alternative is the Transpolymer approach, which predicted copolymer properties (crystallization tendency, refractive index, power conversion efficiency, etc.) by combining the SMILES of each constituting repeating unit, along with the ratios and their arrangements. This representation incorporated a tokenizer separating SMILES strings from additional molecular descriptors such as the ratio of repeat units, molecular weight, and glass transition temperature, thus enhancing the prediction performance18. However, enhancing string representations by appending additional molecular descriptors, like Transpolymer and Big-SMILES, enlarge the input dimension and would not be an optimal approach for more complex structures as it obscures atom placements and fails to capture crucial structural details. HAPPY addresses the challenges associated with sequencing ring structures by abstracting them into subgroups represented by alphabetical characters. However, there are still certain difficulties in representing branching structures as strings, particularly when a sideline structure needs to be placed adjacent to the mainline structure.

To address this, we contemplate the potential adoption of graph representation in our methodology. Graph representation expresses molecular structures more naturally by mapping atoms (objects) as nodes and bonds (connections) as edges18,29,30. Recent studies have demonstrated further enhancement to graph representation by encoding geometric contributions of a molecule, such as pairwise distance, bond angles, and bond torsion, in the embedded feature vector9,10. In alignment with this advancement, more specialized techniques are being developed to improve graph representations. For instance, Patel et al., coarse-grained the representation where each nodes contained fingerprint-embeddings of each repeat unit within the polymer, and edges indicated how repeat units are topologically connected31. Additionally, introduction of stochastic edges as a set of bonds weighted by the probability of monomeric connections by Aldeghi et al., enabled the representation of the recurrent nature of polymer chains and the ensemble of possible topologies32. Building upon this insight, we suggest an integration of HAPPY with graph representation. We believe this combination could further improve property prediction by taking advantage of HAPPY’s simple and coarse-grained nature while incorporating topological information. This proposition is supported by our findings with an extended version, Topo-HAPPY, which demonstrated improved prediction for properties such as glass transition temperature and solubility.

Lastly, looking toward future prospects, the advantageous combination of HAPPY with generative models deserves attention. Generative models have the potential to propose polymer structures with desired properties by learning from known polymers. However, a significant challenge is related to the generation of invalid SMILES strings or molecules, such as missing brackets, incorrect bond symbols, and inconsistent atom valence33. The exclusion of interdependent grammatical characters makes HAPPY a very robust representation that can generate structures with a high percentage of validity. Validity refers to adherence to the specific grammar rules of the chosen format. For instance, in SMILES strings, parentheses must be correctly paired, and numeric notations are used to denote ring closures. An unpaired parenthesis or a lone numeric indicator suggests an invalid SMILES string. An example of validity in HAPPY involves adherence to the designated ‘Head’ (H) and ‘Tail’ (T) markers to signify the start and end of a polymer chain. Strings that do not follow these formats, such as those with consecutive connector markers, are considered invalid. We confirmed this advantage by training Variational Auto-Encoder (VAE) network to generate both SMILES and HAPPY strings. We trained a VAE network with a mix of approximately 400 molecular structures from data augmentation and about 60 experimental data points and sampled possible repeat unit structures from the model. The VAE model was implemented using PyTorch34 and the network structures were adapted from that of Gomez-Bombarelli et al.33; detailed model parameters and its scheme are provided in Supplementary Fig. 3. We confirmed that out of the 128 generated structures of SMILES, only 27 of them were valid. However, when generating an equal number of structures using HAPPY, we obtained 113 valid structures. Overall, HAPPY significantly improved the percentage of valid structures, increasing it from a mere 21.09% with SMILES to an impressive 88.29%. Moreover, the network evaluations using HAPPY demonstrate significant speed advantages. As provided in Supplementary Fig. 2 of our extended Tg prediction test, HAPPY-based network evaluations are 15.58% faster than those using SMILES, highlighting the efficiency and suitability of HAPPY for polymer design applications.

Overall, our study affirms the efficacy of HAPPY as an efficient representation for polymer informatics in deep learning applications. While it outperforms traditional representation in predicting various polymer properties, it also presents potential improvements for better performance. Future work will focus on the integration of HAPPY with other methods of representation, such as graph representation, and exploring its utility in generative models.

In conclusion, we introduced HAPPY (Hierarchically Abstracted rePeat unit of PolYmers), a string representation specifically designed for polymers. HAPPY leverages a hierarchical abstraction approach that accurately and efficiently transforms intricate repeat units of polymer structures into a format compatible with machine learning applications. The dielectric constant prediction was performed with only 56 experimental datapoints, and despite the limited data, satisfactory performance was demonstrated, providing the possibility of HAPPY representation. This efficiency becomes increasingly evident as the volume of training data grows, showcasing HAPPY’s scalability and suggesting its potential to handle even more complex tasks as the quantity and diversity of available data expands. Notably, one of the key strengths of HAPPY-based network is its robustness, as evidenced by the minimal variance in R-squared scores, which underscores its generalizability across different datasets. Furthermore, we enhanced HAPPY’s capabilities by introducing of Topo-HAPPY, an expansion that incorporates additional connectivity information. This strategic inclusion further refines the predictive power of the representation, bolstering its application for physical property predictions in the realm of polymer informatics.

Methods

Dataset acquisition

For the construction of the neural network utilized in dielectric constant prediction, we employed an experimental dataset of 56 homopolymers with various repeat unit structures, listed in Supplementary Table 2. These experimental values were retrieved from published data22, notably from Bicerano’s work23 which informed the empirical Quantitative Structure-Property Relationship (QSPR) scheme successfully implemented in the Synthia module of commercial Materials Studio 2022 software35. The dataset includes room temperature (298 K) dielectric constant values for each homopolymer with a molecular weight of 500,000 amu. It covers a broad range of repeat unit structures and dielectric constant values, ranging from 2.1 to 4.5. We partitioned the dataset into 11 datapoints for testing and allocated the remaining 45 datapoints for training the network.

For the glass transition temperature, we utilized a dataset comprising 214 experimental values for a family of polyimides with various substituent groups, as collated by Voglin et al.14. The polyimides’ repeat unit structures are composed of atoms including C, H, O, N, F, S, and Cl. The dataset includes 607 Tg values associated with 214 repeat unit structures. Due to the variation in the experimental methods used to measure Tg, multiple Tg values may be reported for the same repeat unit structure. We accounted for this variability by averaging the values across all different conditions, obtaining a final value for our dataset. For the testing dataset, we selected 40 datapoints, while the remaining datapoints were used for training the network.

Given the limited availability of open-source experimental datasets for polymer properties, we generated an additional dataset by constructing arbitral polymer repeat unit structure, as outlined in Fig. 6. The physical properties of these structures were estimated using Synthia QSPR scheme. To ensure relevance and predicting performance, we restricted the construction process to only those subgroups present in the experimental dataset, as shown in Supplementary Table 1 and Fig. 6b. This approach allows us to emulate the structural diversity observed in the experimental dataset while generating additional polymer structures for training the network. We created these structures by assembling constituent elements of subgroups and atoms to form mainline structures, adhering to the same rules established in the HAPPY representation. We showcased the process of connecting to sideline structures with one of the generated mainline structures represented as “HCCCCC$BCCCT” in Fig. 6c–f, with a maximum of two sideline connections allowed for each component. The third and sixth carbon atoms (Fig. 6d, marked in dark blue) are selected to branch out sideline connections, resulting in the representation of “HCCC@[S]CC$BC@[S]CCT”. We then attached randomly chosen atoms or subgroups in Fig. 6b to these mainline characters. For instance, Fig. 6e shows the case where the third and sixth carbons in the mainline are connected to an aromatic ring (‘$B’) and a 2-methylpropane subgroup (‘$Ct’) respectively, at the first level connection. The same random selection process is used to choose additional subgroups or atoms as the second level connection. Figure 6e shows the sideline subgroup $B (marked in dark red) connecting with two components, denoted as “HCCC@$B#[S][S]CC$BC@$CqCCT”. In Fig. 6f, the second level connections to sideline subgroup $B are methyl(‘C’) and ester subgroup (‘$E’), resulting the final HAPPY representation of “HCCC@$B#C$ECC$BC@$CqCCT”. Sideline generation ceased at the second level in maximum, mirroring the majority of our experimental dataset.

Fig. 6: Illustration of the process used to generate polymer repeat unit structure.
figure 6

a Several constructed mainline structures showcasing diverse combinations and counts of components. b List of subgroups for the random selection process used to determine the first and second level connections. c An example of the generated mainline structure. d Identification of mainline components (third and sixth carbon atoms, marked in dark blue) that are selected to branch out into sideline connections. e Representation of the first level connection; the third and sixth carbons in the mainline links to an aromatic ring (‘$B’) and a 2-methylpropane subgroup (‘$Ct’). Subgroup $B in the sideline (marked in dark red) is designated to form second level connections. f Illustration of the second level connections, where the sideline subgroup $B is connected to both a methyl group (‘C’) and an ester subgroup (‘$E’).

We generated 200 repeat unit structures to augment the training dataset for dielectric constant prediction. Additionally, we increased the number of generated structures to 400 and computed additional bulk properties of dielectric constant, thermal conductivity, density, and solubility. This expanded dataset served as the training dataset, while all the experimental data were reserved as the testing set. The same set of 400 generated structures was also employed as the training datasets for predicting thermal conductivity, density, and solubility. In addition, we utilized the 56 repeat unit structures from the experimental dataset and calculated their properties using Synthia to procure additional data points, thereby enhancing the training and prediction performance of the network. Notably, the errors in the properties predicted by Synthia are quantified as follows: the standard deviation for the dielectric constant is 0.0871, for thermal conductivity is 0.0174(\(W/\left(k\cdot m\right))\), and for density is 0.0354(\(g/c{m}^{3}\))).

Network architecture and training

The deep learning network was implemented using Python3 and the Tensorflow36 library. The input strings of characters representing repeat unit structures in the dataset, derived from either HAPPY (Topo-HAPPY) or SMILES, were transformed into binary matrix representations using one-hot encoding. The dimensions of input matrices follow [j, k, l] format. Where j is the total number of repeat unit structures in the dataset, k is the number of distinct constituent elements (subgroups, atom, connectors, etc.) for each representation, and l is the length of the longest string-encoded repeat unit structure in the dataset. To secure consistent dimensionality across all polymer structures, zero post-padding was applied, ensuring each structure conformed to the same length. For example, in the dielectric constant prediction solely with the experimental dataset, the dimensions of the input matrices for HAPPY, Topo-HAPPY, and SMILES were [N, 24, 24], [56, 35, 35], and [56, 29, 84] respectively. Similarly, for the Tg prediction task, the dimensions were [214, 19, 20] for the HAPPY dataset, [214, 29, 29] for the Topo-HAPPY dataset, and [214, 27, 131] for the SMILES dataset.

These vectorized strings served as inputs to the neural network, which consisted of two LSTM layers, each with 128 units, activated by the hyperbolic tangent (tanh) function. To mitigate the risk of overfitting, a dropout layer with a regularization rate of 20% was applied after the LSTM layers. The output sequences from the first LSTM layer were fed to the second LSTM layer, which returned a single output. The output from the LSTM layers was then passed through four fully connected dense layers with 64, 32, 16, and 8 units, respectively, all activated by Rectified Linear Units (ReLU). The final output layer consisted of a single unit without an activation function. The network was compiled using Mean Squared Error (MSE) as the loss function and the Adam optimizer with a learning rate of 1e−5. During the training phase, the dataset was shuffled with a buffer size of 1024 and partitioned into batches of 32. The training process was conducted for 30,000 epochs, supplemented by an early stopping mechanism to prevent overtraining and improve generalization.