Modeling glass transition temperatures of epoxy systems: a machine learning study

The use of machine learning (ML) models to screen new materials is becoming increasingly common as they accelerate material discovery and increase sustainability. In this work, the chemical structures of 16 epoxy resins and 19 curing agents were used to build an ML ensemble model to predict the glass transition (Tg\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_g$$\end{document}) of 94 experimentally known thermosets. More than 1400 molecular descriptors were calculated for each molecule, of which 119 were chosen based on feature selection performed by principal component analysis. The quality of the trained model was evaluated using leave-one-out cross-validation, which yielded a mean absolute error of 16.15∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document}C and an R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document} value of 0.86. The trained model was also used to predict Tg\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_g$$\end{document} for 4 randomly selected resin/hardener combinations for which no experimental data were available. The same combinations were then prepared and measured in the laboratory to further validate the ML model. Excellent agreement was found between experimental and predicted Tg\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_g$$\end{document} values. The current ML model was created using only theoretical features, but could be further improved by adding experimental or quantum mechanical properties of the individual molecules as well as experimental processing parameters. The results presented here contribute to improving sustainability and accelerating the discovery of novel materials with desired target properties.


GRAPHICAL ABSTRACT Introduction
In recent years, the demand of epoxy resins in industry has increased continuously because of their great potential in various sectors like adhesives, coatings, construction applications, electronic packaging materials, or composites. Due to their superior physical-mechanical properties, chemical and temperature resistance, dimensional stability, versatility, and processability, they are especially used for applications with increased performance requirements [1][2][3][4]. This especially holds true for glass or carbon fiber reinforced composites used in aerospace, automotive, and sports industries where their higher specific strength is superior to metallic materials such as aluminum or steel [5,6]. The glass transition temperature (T g ) is considered a critical property for polymers because it is an important indicator of processing and performance properties such as stiffness, heat resistance, and adhesion. In addition, T g defines and limits the application temperature of the material. Typically, T g is measured using differential scanning calorimetry (DSC) or dynamic mechanical thermal analysis (DMTA) [7]. However, achieving a target value for T g of a resin/hardener system is so far largely a trial-and-error process based on the experience of the researcher. As a result, the development of new thermoset systems and process optimization is often an inefficient, costly, and timeconsuming process [8].
Due to the importance of T g , there exist several modeling approaches in polymer research to predict it. Weyland et al. developed the basis of the group additive property method (GAP), which predicts the T g of polymers from the sum of calibrated contributions with respect to typical monomer substructures [9]. This triggered a series of further works to improve the model's accuracy, which showed good predictive correlations, but were only applicable to polymers whose chemical structure groups had beforehand been experimentally investigated. Since the late 1980s various more general QSPR (quantitative structure-property relationships) models using descriptors for T g prediction were developed [10][11][12][13][14][15]. However, all of the previous discussed prediction models were built on the basis of homopolymers, even though thermoset systems like epoxy-amine systems are of great importance for many high-performance applications.
Nevertheless, there are some approaches to predict the T g of cross-linked thermoset systems, which, however, are limited to relatively few epoxy systems [16][17][18]. Bellenger et al. [16] presented a predictive model for the glass transition temperature of epoxy resin systems using an additivity law for copolymers and the contribution of cross-linked structures, using a total of 40 systems. Morrill et al. [18] developed a model that predicts the dependence of T g on the stoichiometric ratio of co-monomers in amine-cured resins of bisphenol A diglycidyl ether.
The use of machine learning in materials science is a growing area which shows large potential in predicting material properties, such as the T g of polymers. Data-driven approaches are a great way to optimize the thermoset development process in terms of efficiency, speed, and sustainability. In the field of homopolymers, there exist several promising attempts to predict the T g using machine learning [19][20][21][22][23]. However, in the field of thermosets, there are few approaches that use machine learning applications for T g prediction. These are trained either on very small datasets and/or predominantly on datasets obtained using molecular dynamics [24,25]. Jin et al. [24] described an optimization method for multicomponent epoxy resin systems using an artificial neural network (ANN), where molecular dynamics simulation was used to obtain the input data for the ANN. In the same work, T g was one of six predicted values of the model. In addition, only two different hardeners and resins based on 30 different data points were used, making it more difficult to use this model for screening more diverse resin/ hardener combinations. Higuchi et al. [25] established a model which predicts the glass transition temperatures of linear homo-/heteropolymers and cross-linked epoxy resins. In addition, a consensus model is presented that links the two different kinds of models. The thermoset-specific model was trained with just 50 data points and six different epoxy resins. The data came exclusively from the literature where different methods (DSC, DMTA, TMA) and different specimen geometries were used. The performance of their consensus model (homo-and heteropolymers) was good (R 2 = 0.848), while the thermoset-only model gave poorer results (R 2 = 0.687). Yan et al. [26,27] developed the state-of-theart model currently available for predicting T g of thermoset shape memory polymers (TSMPs), whose mean absolute percentage error for the test set was 13.91 %. The model is based on a variational autoencoder trained with drug molecules and adapted to TSMPs with transfer learning, combined with a weighted vector combination method and a convolutional neural network as regressor.
The use of atomistic molecular dynamics or multiscale simulations to add features to the dataset is another strategy to improve the performance of the model [28], but these theoretical methods are not trivial to implement for polymers, time-consuming, and their accuracy still needs to be improved, the latter possibly having a negative impact on the predictive ability of the model. On the other hand, using available experimental data from different works to extend and improve the dataset is also challenging, as the experimental techniques for preparing and measuring the T g of materials are not standardized, as often found in the literature.
This work aims to predict the T g for resin-hardener thermosets using an ML ensemble approach based solely on theoretical molecular descriptors that can be easily calculated for any molecular structure present in the thermoset. To this end, we will present a variety of systems studied in our group using, among others, standardized internal DMTA measurements of T g to improve the generalizability of the final ML model. The molecular descriptors are calculated from the SMILES representation of the resins and hardeners, whose information is initially used to train different linear and nonlinear individual ML models, prior to combining them into one single ML ensemble model. Finally, the experimental validation of the ensemble model will be demonstrated by preparing and measuring new thermosets whose compositions and T g were first suggested by the ML model itself.

Methods
The workflow for creating the machine learning model was inspired by the Crisp-DM model, the quasi-standard framework for data mining (DM) [29]. In total 6 steps were necessary to produce the results presented in this paper. The DM process is embedded in real experiments and measurements in the lab, both for data generation and for validation of the model, see Figure 1. The different steps of the workflow are chronologically described in detail in the next subsections.

Data generation/experiments
The experimental data were generated by DMTA carried out at a heating rate of 3 K/min in torsional mode at an elastic deformation of 0.1 % and an applied frequency of 1 Hz using a Rheometrics Scientific ARES RDA III (Germany). The specimens had a rectangular shape with 50x10x2 mm 3 according to standard ISO 6721-7. T g was determined via DMTA measurements by the peak value of the loss factor (tan(d)). For multiple measurements of the same material system, the average value for T g was used.
The resins and hardeners combinations were mixed stoichiometrically and cured in the oven to the maximum degree of cure under optimized thermal conditions. Subsequently, multiple measurements were carried out in the DMTA for one material. The epoxy equivalent weight (EEW) of the epoxy resins from the dataset has a range from 101 g/eq to 187 g/ eq, while the functionality ranges from 2 to 4.5. The amine equivalent weight (AHEW) of the curing agents has a range of 12.05 to 221.67, while the functionality is between 1 and 7. The technical data such as EEW were taken from the suppliers' technical data sheets. Figure 2 shows the distribution of T g values in the dataset corresponding to 94 combinations of 16 resins and 19 hardeners. All resins and hardeners used in this work were those available in our laboratory and are shown in the SI. The lowest T g value (77.91 C) is shown by the system DGEBF/ D230, while the system Tactix742/4,4-DDS exhibited the highest T g (333.51 C).

Data preparation and preprocessing
For each resin and hardener, the chemical structure of the molecule was drawn with the visualization program Avogadro and saved in the mol2 format, which is a plain text table format that represents a single or multiple chemical structures and contains atomic coordinates, chemical bonding information, and metadata of the molecule [30,31]. The mol2 files were then converted into the corresponding SMILES (Simplified Molecular Input Line Entry System) notation [32]. The cheminformatics libraries Mordred [33] and RDKit [34] were used to convert each SMILES notation into the corresponding 2D descriptors, providing an initial set of 1613 descriptors per molecule. RDKit is a Python library for building, manipulating, analyzing, and automatically designing molecules [34]. MordRed is a descriptor generator available on Github, which can calculate more than 1800 descriptors [33]. A molecular descriptor transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized  experiment. Molecular descriptors are used to present molecular characteristics in cheminformatics and therefore are useful for feature engineering in data mining [33,35]. The 1613 molecular descriptors available at the Mordred library for 2D structures were initially reduced to 1435 after removing nonnumerical (e.g., Boolean) and missing/non-computed values, which were subsequently reduced to 119 descriptors per molecule (resin or hardener) after feature selection (see the next subsection). The molecular descriptors for the resin/hardener pair DEN431/D230 before and after feature selection are shown in the SI.
At the end of this stage, the dataset consisted of 94 samples, each containing 2 x 1435 features (or independent variables) corresponding to each resin/ hardener pair, and the target property (or dependent variable), T g . 70 % of the dataset was used as training set, which was also used to optimize the hyperparameters of individual ML models by leave-one-out cross-validation to decrease overfitting. The final model was tested using 30 % of the dataset, as shown in the next sections. All features and target property were preprocessed to have unit variance and zero mean using the mean and variance of the training set.

Feature selection
In order to obtain the greatest possible information content for model training, feature selection was performed. Too many features lead to too much complexity of the dataset and may result in poorer performance. Therefore, feature selection was carried out via principal component analysis (PCA). PCA decomposes a multivariate dataset into a set of successive orthogonal components that explain a maximum of variance [36]. By using PCA, the descriptive variance of a feature can be determined over the dataset. By looking at the weight coefficient (also called loading) of each feature on the resulting uncorrelated linear combinations of each principal component, the amount of descriptive variance of the feature can be determined: The higher the weight of one feature, the greater its descriptive variance and thus the greater its information content. Features with little information content were filtered out of the dataset by setting a minimum threshold for the weight of all features. The evaluation of the model's performance for different thresholds allowed to determine the final descriptors to be used as features in the model, strongly reducing the total number of features from 1435 to 119 per molecule.

Modeling and validation
Modeling and evaluation do not take place separately in the workflow, but are performed iteratively (see Figure 1), as already shown in the previous subsection to find the best threshold for feature selection. To find the best ML model, the following three-step procedure was used: 1. Pre-screening of different linear and nonlinear ML models with unoptimized hyperparameters. 2. Hyperparameter optimization of the most accurate ML models found in a). 3. Creation of an optimized ensemble using the ML models found in b).
The leave-one-out validation method was used to improve statistics in all steps. Here, the learning algorithm is applied once for each data point, using all other instances as the training set and the selected data point as the test set with a single sample. The methodology is a special case of cross-validation where the number of folds is equal to the number of data points in the dataset [37] and is time-consuming for middle-to-large datasets, since one different ML model is trained for each sample. The mean absolute error (MAE), mean absolute percentage error (MAPE) and the coefficient of determination (R 2 score) were used as model evaluation metrics. MAE is given by: where n is the number of samples, and y i andŷ i are the true and predicted target property, respectively, for sample i. MAPE is given by: The coefficient of determination (R 2 score) is the quotient of the explained variation to the total variation in a regression model. It is often expressed as a percentage. Seven different regression models were prescreened, whereby their MAE and R 2 scores were compared via leave-one-out cross-validation. The ML methods are briefly summarized below and are described in more detail in refs [38,39]. Linear Regression (LR). The LR fits a linear model with coefficients w ¼ ðw 1 ; :::; w p Þ to decrease the residual sum of squares between the observed data points, and the predicted data points by the linear approximation [40].
K-nearest-neighbors regression (KNN). The KNN classification methodology selects a set of K objects in the training set that are closest to the test object and appoints a label based on the most dominant class in that neighborhood. Neighbors-based regression can be used in cases where the data labels are continuous [41].
Random Forest Regression (RFR). Random Forest is an ensemble learning approach in which ''weak learners'' collaborate to form ''strong learners'' using a large collection of uncorrelated decision trees. However, instead of building a solution based on the results of a single deep tree, Random Forest aggregates the results of a set of shallow trees. RFR is built by growing trees depending on a random vector such that the tree predictor takes on numerical values as opposed to class labels [42].
Gradient Boosting Regression (GBR). GBR builds an additive model step wisely, allowing the optimization of arbitrary differentiable loss functions. At each stage, a regression tree is fitted to the negative gradient of the given loss function [43].
Kernel Ridge Regression (KRR). KRR merges ridge regression [44] with the kernel trick [45]. Thus, it learns a linear function in the space generated by the kernel and the data. This corresponds to a nonlinear function in the original space of the dataset. Support Vector Regression (SVR). Support vector machines solve binary classification problems by formulating them as optimization problems to find the maximum edge that separates the hyperplane while correctly classifying as many training points as possible. The optimal hyperplane is represented with support vectors. SVR is achieved by formulating an optimization problem by defining a convex -insensitive loss function to be minimized and finding the flat tube that contains most of the training instances [46]. Lasso regression. Lasso regression is a linear model combined with the estimation of sparse coefficients via L1-regularization. This is particularly useful in the context of feature selection, as it tends to favor solutions with fewer nonzero coefficients, effectively reducing the number of features upon which the given solution depends [47].
The most promising models were followed up and had their hyperparameters optimized. In the last step, two or more single models were combined into an ensemble with the goal of minimizing MAE. For this purpose, the predictions of the individual ML models were combined via a weighted average whose weight coefficients were optimized to minimize MAE. Eq. 3 describes how to obtain the ensemble prediction: where m is the number of individual ML models, a is the weight coefficient (or importance) of each model inside the ensemble,ŷ ij is the predicted target property of sample i carried out using model j, andŷ i is the final ensemble prediction for sample i. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm was used to find the weighting factors of the individual ML models that minimize MAE of the ensemble predictions. All codes used in the workflow were written in Python (version 3) using Jupyter Notebook. The ML models were built using the Sklearn library [48].

Practical application
The optimized hyperparameters of the final ensemble model were then used to train a fresh model with the full dataset (training and test sets) and subsequently used to predict all 210 possible resin/hardener combinations for which no experimental data were available. Four of the new combinations were chosen to be prepared and measured in the laboratory to further validate the ML model. The first two formulations (Table 1) were chosen based on high and low target T g values randomly selected. The remaining ones (Table 2) were randomly chosen from the pool of 210 possible combinations. All four new systems were then prepared as follows. Resin and hardener were mixed stoichiometrically and degassed in a vacuum chamber below 50 mbar. After degassing, the mixtures were poured into steel molds and curing was performed according to the specific curing cycles shown in Tables 1 and 2.

Results and discussion
This work shows that it is not only possible to obtain a predictive model for T g of thermosets, but also to get some insight into the influence of molecular structures of resins and hardeners on the T g of thermosets. Feature selection using PCA was carried out to reduce the total number of descriptors from 1435 to 119. The GBR and KRR models provided the best individual predictions, with MAE/R 2 values of 16.64/0.84 and 17.82/0.83, respectively, evaluated for the test set. The detailed evaluations of these models, together with their optimized hyperparameters, are shown in the Supporting Information.

Optimized ensemble model
The composition of the optimized ensemble model in terms of individual ML models and their corresponding model importances were: GBR (53 %) and KRR (47 %). The performance of the trained ensemble model evaluated using the test set is shown in Figure 3, where MAE = 16.15 C and R 2 = 0.86. The ensemble model was more accurate and had a larger coefficient of determination than the best individual ML model (GBR), which shows that the ensemble technique described here (Eq. 4) is an effective way to increase the models' performance. The results are Table 1 Model-proposed thermoset systems for targeted (high and low) T g . The predicted T g shown in the last column was calculated using the trained ensemble model. The supplier and trading names are given

Final molecular descriptors
The most important descriptors affecting T g were identified using the Lasso regression trained using the entire dataset, since it does by default feature selection [47], meaning that only the most important features will be associated with nonzero weight coefficients. In addition, the weight coefficients (c i ) of Lasso enable one to easily interpret the relation between the features and the target property (e.g., property ¼ c 1 Á feature 1 þc 2 Á feature 2 þ ::: þ c n Á feature n þ b), which is not possible with other ML models like GBR or KRR. Table 3 shows the five descriptors that have the largest influence on the predicted T g . The column ''Molecule'' describes whether the descriptor refers to the resin or hardener component of the thermoset. In addition, in the column ''Effect,'' the characters (?, -) describe whether the descriptor has a positive or negative influence on T g . Descriptors directly related to aromaticity and polarity were also inside the pool of 119 descriptors used in the ML models, although their weight coefficients were not among the largest ones. The two most important descriptors were SlogP_VSA and SMR_VSA. The former is a metric of hydrophobicity and the latter accounts for molecular size and polarizability. The first two lines of Table 3 reveal that epoxy systems tend to exhibit higher T g when the resin counterpart is more hydrophobic and smaller and less polarizable. These properties can be linked indirectly to aromatic structures, which generally possess higher T g .
The interpretation of different descriptors is not always simple or intuitive, and the fact that the final ML model considers a great number of them, shows that this task is even more challenging. There seems to exist a trade-off between interpretability and prediction capability: simpler models are very easy to interpret but do not necessarily deliver the best predictions.

New T g predictions
Based on the dataset, which consists of 16 different resins and 19 different hardeners and experimental T g values available for only 94 out of 304 total resin/ MOE-type descriptors using Crippen-Wildman atom classification system for atom-based calculation of octanol-water partition coefficient (log P) (hydrophobicity) [49] Resin ?

PathCount
Total number of paths of length m in the molecular graph [49] Hardener -Autocorrelation ATS(0,'dv') Broto-Moreau autocorrelation of order 0 weighted by valence electrons [50] Hardener - hardener combinations, a total of 210 new resin/ hardener combinations were predicted by the trained ML ensemble model. In the histogram of predicted T g values, the distribution ranges from 81 to 318 C (Figure 4), which indicates that the model depicts the maximum variance of the available experimental glass transition temperatures.

Practical application
The comparison between the experimental and predicted T g for the four selected resin/hardener systems described in Tables 1 and 2 is depicted in Figure 5. The experimental trends were nicely reproduced by the predicted values. The predictions were much better for systems with middle to high T g , as expected based on the greater availability of data points in that region to train the ML models, compared to regions of low T g (see Fig. 2). In addition, the experimental T g used to train the ML models were in the range 78-334 C (Figure 2), and therefore high and low predicted T g values tend to accompany the same range. Using many experimental data points with T g smaller than 78 C to train the ML models would have led to better predictions below 78 C.

Final considerations
Predicting the final T g of a relatively complex polymeric material by simply using molecular descriptors of isolated resin/hardener molecules is a rough approximation assumed in this work. The first validation of this idea comes from the good model fittings shown in Fig. 3, which suggests that the property of individual molecules can influence their interaction and consequently the property exhibited by the final material. The second validation to support this approximation comes from Fig. 5, which shows that using molecular descriptors of individual molecules to predict the T g of novel polymeric materials is feasible, especially in the case of middleto-high T g epoxy systems. Fig. 5 also shows that there is room for model improvement (vide infra). Most importantly, the results suggest that cross-linking is implicitly taken into account by the ML model via linear and nonlinear combinations of molecular descriptors of individual (isolated) molecules since T g was predicted based on those descriptors. Some intuition on the approximation described here could be gained from a chemical point of view. For instance, if a molecular descriptor reveals that an (isolated) resin and its hardener counterpart have high aromaticities, one could expect to have a stabilizing interaction in the polymer between nearby molecular units, which can increase T g , as in general observed experimentally. The assumption made here would have benefited from a larger number of measurements (as opposed to only 94 data points available) since this is also expected to improve the model accuracy.

Conclusions
In this work, it was shown that it is possible to predict the glass transition of thermoset systems based solely on the chemical structures of 16 resins and 19 hardeners using ML models. The best individual ML models combined via an ensemble approach and trained with 94 resin/hardener combinations were GBR and KRR, which contributed to 53 and 47 %, respectively, of the final predicted T g value. The ML ensemble model performed quite well (MAE = 16 C and R 2 = 0.86). The analysis of the most important descriptors used in the Lasso regression has revealed Figure 5 Comparison between experimental (True) and predicted T g for the resin/hardener systems described in Tables 1 and 2. that T g of the thermoset is expected to increase for smaller, less polarizable, and more hydrophobic resins. The trained model was used to predict T g for the 210 remaining resin/hardener combinations that had not been previously investigated experimentally. Four new resin/hardener combinations were randomly chosen from the 210 new predictions and experimentally prepared and characterized, giving T g values in good agreement with the predictions: exp/ theo [ C] = 45/82, 330/310, 54/104, 243/259. This strategy allows for greatly accelerating the development of new resin/hardener systems in a sustainable fashion, as one avoids trial-and-error procedures to obtain new thermosets exhibiting any specific T g value. By describing the entire workflow, from data generation to ML modeling to the final practical application, this work demonstrates how machine learning can systematically unlock efficiencies in thermoset development and uncover new relationships. In conclusion, the results discussed help improve sustainability while accelerating the discovery of novel materials with desired target properties. Even though the rough approximation made here through which the features of isolated monomer units (here, hardener and resin) are used to predict the T g of a much more complex, 3D polymeric material provided a good initial ML model, we are currently working on its improvement by adding quantum mechanical properties of the molecules to the current dataset to further increase the model accuracy.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licen ses/by/4.0/.

Author's contributions
Sven Meier: Writing of the original draft, preparation of the dataset, discussion. Rodrigo Q. Albuquerque: ML modeling, paper correction, discussion. Martin Demleitner: Experimental setup and measurements, paper correction, discussion. Holger Ruckdäschel: Paper correction, discussion.

Funding
Open Access funding enabled and organized by Projekt DEAL.