Microstructure–property relation and machine learning prediction of hole expansion capacity of high-strength steels

The relationship between microstructure features and mechanical properties plays an important role in the design of materials and improvement of properties. Hole expansion capacity plays a fundamental role in defining the formability of metal sheets. Due to the complexity of the experimental procedure of testing hole expansion capacity, where many influencing factors contribute to the resulting values, the relationship between microstructure features and hole expansion capacity and the complexity of this relation is not yet fully understood. In the present study, an experimental dataset containing the phase constituents of 55 microstructures as well as corresponding properties, such as hole expansion capacity and yield strength, is collected from the literature. Statistical analysis of these data is conducted with the focus on hole expansion capacity in relation to individual phases, combinations of phases and number of phases. In addition, different machine learning methods contribute to the prediction of hole expansion capacity based on both phase fractions and chemical content. Deep learning gives the best prediction accuracy of hole expansion capacity based on phase fractions and chemical composition. Meanwhile, the influence of different microstructure features on hole expansion capacity is revealed.

GRAPHICAL ABSTRACT 16  Introduction Advanced high-strength steels are widely used in industrial applications. Besides high strength and good ductility, stretch-flangeability is an important mechanical property which controls the quality of shaping many metallic components. Hence, the hole expansion capacity (HEC), which describes the formability and edge cracking resistance of sheet metals, is one of the most important mechanical properties in for instance the automotive industry. Figure 1 shows the most common test procedure for the determination of HEC following standard ISO 16630 [1]. The sheet metal is first punched with an initial hole of diameter D 0 of 10 mm. The punched hole is then widened with a conical punch (60 angle) until the first through-thickness crack appears, at the final hole diameter D h . The hole expansion capacity (k) is then calculated with The results are considered useful when the thickness of the sheet material is below 2.5 mm, even though the standard allows thicknesses up to 6 mm. Due to the complexity of determining the hole expansion capacity, many testing factors, such as punch edge quality [2][3][4] and crack determination [5], could influence the testing result. The HEC is not yet well understood in terms of its relationship with the microstructure of the metal. Many studies have been performed on the relations between HEC and microstructure features, processing parameters and other mechanical properties, such as tensile strength and hardness [3,[5][6][7][8], but results either are not convincing due to the limited number of data or do not give an overall picture on the effects of multiple phases due to the specific materials choices. Meanwhile, recent progress in the field of HEC of multi-phase steels results in a better understanding of the relation between HEC and fracture toughness, which can be related to microstructure features through damage and fracture models [9,10]. It has been shown that HEC is closely related to the capacity to resist the initiation of micro-cracks and their propagation [11][12][13]. The connection of fracture behavior and microstructure features and heterogeneities can then be extended to the understanding of the HEC behavior [14][15][16][17]. The study from de Geus et al. [15] shows that fracture initiation correlates strongly with the local microstructural morphology. Meanwhile, the laminography observations performed by Kahziz et al. [18] reveal the damage evolution on both the punched and machined edges, which indicates the possibility of building predictive models based on physical understanding. Table 1 summarizes the present interpretation in the literature on the influence of different phases on the HEC from various studies on multi-phase steels. Except for ferrite and possibly austenite, all phases are reported to have a negative effect on the hole expansion property. These effects are often explained by the hardness difference between the hard phases and the soft phases, but with no clear physical explanation [2,6,19].
The information obtained from Table 1 is rather limited and qualitative, since the trends of changing HEC on different phases are always gathered from a very limited number of data points (i.e., less than 10). Besides, the results shown in Table 1 only concern the relation of a single microstructure feature with hole expansion capacity. When dealing with complex-phase steels, the combined effect of the phases has not been studied yet.
In order to study the relations between hole expansion capacity and microstructure features in more detail, 55 groups of data containing the composition of phases and chemical content corresponding to the HEC values are collected, as shown in Table 4 (see Appendix A), from a final report of a research project of the Research Fund for Coal and Steel [20]. As the original report does not make full use of these data, it is valuable to have a deep look into these data and to derive more comprehensive understanding in addition to Table 1. In the present paper, HEC is fully investigated on its relation to phase fractions individually, to the combination of phases and to the number of phases. To quantify the relations, different statistical regression methods are applied to enable prediction of the HEC on the basis of both phase fractions and chemical content, while also giving the importance ranking of different microstructure features.

Data analysis
As shown in Table 4 (see Appendix A), the studied dataset contains 55 complex-phase steel specimens' results. For each specimen, the hole expansion capacity (HEC, %) with phase fractions in volume percentage and chemical composition in weight percentage are collected. The identified phases are martensite (M), ferrite (F), tempered martensite (TM), upper bainite (UB), lower bainite (LB), carbide-free bainite (CFB), bainite (B), pearlite (P) and retained austenite (RA). For these microstructures, ferrite, martensite and bainite can be present as matrix phases, while pearlite and retained austenite are always secondary phases. The average standard deviation of measuring HEC three times on the same steel grade is AE9%, which is calculated from the work by Chen et al. [22]. Meanwhile, there is also research showing a standard deviation of 15% on HEC values for martensitic steels [23].

Volume fraction of individual phases
Based on the obtained data, the individual influence of phase volume fractions on hole expansion capacity is shown in the scatter plots in Fig. 2. Figure 2a and b show the scatter plot of martensite (without tempered martensite) and ferrite fraction in relation to hole expansion capacity, while Fig. 2c and d show the total bainite (the sum of upper bainite, lower bainite, carbide-free bainite and bainite) and retained austenite volume fractions in relation to hole expansion capacity. The straight line in Fig. 2b is a linear fitting of all data points of ferrite volume fraction and hole expansion capacity. All curved lines in Fig. 2 are based on the scatter.smooth function in R [24], which uses the loess (local polynomial regression fitting) function [25]. The lines are merely a guide to the eye for the main trends.
As shown in Fig. 2a, there is a clear valley in the plot of the relation between HEC and martensite volume fraction, which indicates either low martensite volume fraction (lower than 20%) or high martensite volume fraction (higher than 80%) tends to have the possibility to reach relatively high HEC. Meanwhile, HEC is always low when the martensite volume fraction falls between 20% and 70%. For ferrite in Fig. 2b, the relation is not as clear as for martensite, but a very distinct observation is that only low HEC values are found above 50%. When ferrite volume fraction is lower than 50%, there is no clear relation between HEC and ferrite volume fraction. Low HEC values occur in the region where ferrite volume fraction is higher than 50%, with only one exception: No. 23 in Table 4, that consists of a large volume fraction of ferrite and secondary phase pearlite. An opposite trend to martensite is shown in Fig. 2c when looking into the relation between the total bainite volume fraction and HEC. High HEC values are found only between 30% and 40% bainite. Figure 2d shows the relation between HEC and the secondary phase retained austenite. There are obviously two stages in the relation of HEC with retained austenite volume fraction in Fig. 2d. The lower volume fraction of retained austenite shows higher HEC than the group of higher volume fraction. In Fig. 2d, the bainite fractions are also indicated for the microstructures. Relating these values to Fig. 2c, the relation between HEC and bainite fraction, it shows that the microstructures with low RA fractions all lie in the optimum range of bainite fraction. The values of HEC for zero retained austenite fraction, with the average on the green line, lie within the shaded area in Fig. 2d, at the level of the values for 2-4% RA. The present data therefore do not give a conclusive view on the influence of retained austenite on HEC.
The dataset is unfortunately very limited on pearlite. Only three microstructures contain pearlite, of which one is the exceptional No. 23. The other two are No. 26 (15% P, 84% F, 1% M, k = 48%) and No. 29 (10% P, 80% F, 10% M, k = 28%). The difference between these two HEC values is therefore primarily the result of the difference in martensite and pearlite fractions. The reduction from k = 48% for 1% martensite to k = 28% for 10% martensite is stronger than the general trend in Fig. 2a, which points at a positive effect of pearlite on the HEC.

Difference between volume fraction of phases
Many researchers proposed that the HEC is closely related to the difference in mechanical behavior between hard and soft phases [2,6,7,19]. Here, we assume that the ferrite and retained austenite are soft phases, while martensite, bainite and pearlite are hard phases. The relation between HEC and the volume fraction difference is shown in Fig. 3a. The scatter plot shows an increase of HEC when the hard phase volume fraction is increasing. When the dataset is divided into two groups, as the boxplot in Fig. 3b shows, the microstructures in the group with more than 50% volume fraction of hard phase have significantly higher HEC than the group with more than 50% volume fraction of soft phase. This indicates that HEC displays a relation with the strength of materials. The lack of high HEC values for microstructures with a higher fraction of soft phases coincides with the observation in Fig. 2b. The one exception mentioned in ''Volume fraction of individual phases'' section, No. 23 in Table 4, is also marked in Fig. 3. It clearly shows that this No. 23 sample is an outlier with exceptionally high HEC while containing more soft phase, which is considered to be an artifact of the testing procedure. Hence, in the following statistical analysis, this No. 23 sample is deleted from the dataset.

HEC in relation to combinations of phases
As discussed in the previous section, certain phases (ferrite and martensite) have a distinct impact on HEC. The phase compositions with the increasing order of HEC are plotted in Fig. 4 with both combined and non-combined fractions of similar phases (applied for martensite and for bainite). Considering samples which have relatively high HEC, two kinds of phase composition are occurring frequently, either fully or nearly fully martensite, or a combination of ferrite, martensite and bainite with the volume ratio around 2:1:1. This indicates the significant contribution of martensite and bainite to HEC. It is also found that most two-phase martensite/ferrite microstructures, especially with a high ferrite fraction, have low HEC values. all martensite. In Table 2, the t-value is the estimate (second column, the coefficient for each input variable) divided by its standard error (third column). By comparing this t-value to the Student's t distribution, the p-value can be calculated [26]. A small p-value (typically below 0.05) indicates that there is a relation between the explanatory variable and the response variable. The intercept of the linear model is at the one-phase category, which indicates that the onephase category is set as the baseline. The model shows the change of HEC values of increasing number of phases based on the one-phase category. Figure 5 shows that the one-phase category has the highest HEC values, while in Table 2, the p-values (last column) for one, two, three and four phases in non-combed phase fractions (a) and one, two and three phases in combining all bainite and all martensite conditions (b) are all below 0.05, which indicates that it is highly unlikely that the coefficient is equal to zero instead of the current value of the estimate [27,28]. Since the values of the estimate are all negative except for the one-phase condition, it indicates that the one-phase category has the highest HEC. In this dataset, only the pure martensite structure appears in the one-phase category; hence, the result suggests that for pure martensite structure, the hole expansion capacity is significantly higher with respect to HEC values for microstructures with two, three or four phases, as shown in Fig. 5 and Table 2. Only the five phases without combining and four phases with combining have increased HEC, since these structures belong to the ones mentioned in ''HEC in relation to combinations of phases'' section which have the combination of ferrite, martensite and bainite with the volume ratio around 2:1:1. These microstructures all have a low volume fraction of retained austenite, and both martensite and tempered martensite are present.

LASSO selection of importance phases
Because of the large number of phases and a single target variable, hole expansion capacity, a statistical method called least absolute shrinkage and selection operator (LASSO) is employed as described comprehensively in previous work [29] and in Appendix B.1.
The LASSO regression is performed on only the matrix phases, i.e., martensite, ferrite and bainite. In order to avoid the collinearity, the samples with only ferrite and martensite phases are excluded in this regression. Collinearity is a condition where two or more independent variables are highly correlated, which tends to inflate the coefficient for one variable and hence leads to wrong estimates of the coefficients [26]. In Fig. 6, with the decrease of the LASSO penalty parameter logðk e Þ [29], more input variables (phases) are included in the linear regression. The first four phases showing up in Fig. 6 from the high-k e side of the graph are lower bainite, martensite, upper bainite and ferrite. Since in the LASSO analysis just a linear function between HEC and the phase volume fractions is adopted, LASSO is not sufficient to fully explain the relationships, but LASSO does give an indication of certain phases which make the most significant contribution to the influence on HEC, namely lower bainite, martensite, upper bainite and ferrite. Meanwhile, LASSO shows that lower bainite and upper bainite have a clear positive effect on hole expansion capacity and martensite has a negative effect. Here, the negative effect from martensite seems to be different from the trend shown in Fig. 2a. This is because the samples with only martensite and ferrite have more than 50% of martensite, which are not included in the LASSO regression. Hence, the negative effect of martensite from LASSO only shows the effect for 0-50% martensite, which is therefore the same as the trend shown in Fig. 2a.

Prediction of HEC with both phase fraction and chemical contents
Machine learning has been widely adopted in various applications in materials science due to its powerful data processing and high prediction performance [30][31][32][33][34][35][36]. In order to predict the HEC with both phase fractions and chemical content based on the data gathered in Table 4, we selected five different machine learning methods: 1. Linear regression (lm), 2. Linear regression with Elastic Net regularization (glmnet), 16 22 24 26 30 33 35 40 42 44 47 49 56 67 71 73 79 84 All specimens with incresing HEC order Phase volume fraction and HEC value   Detailed information on these methods is given in Appendix B. The first four methods are applied using the 'caret' library [37], adopted in the R [24] environment. For the first four methods, tenfold crossvalidation is repeated five times. There is no tuning parameter in lm. For glmnet, the tuning grid for mixing percentage a is ten grids from 0 to 1 and 50 grids from 0.0001 to 50 for regularization parameter k. For ctree2, the tuning grid for max tree depth maxdepth is five grids from 1 to 5 and ten grids from 0 to 1 for (1 minus p-value) threshold mincriterion. For cforest, the tuning grid for randomly selected predictors mtry is 15 grids from 1 to 15. Deep learning is applied using the 'keras' library [38], which uses TensorFlow [39] as backend in python. The network consists of two hidden layers. Both hidden layers are dense layers with 100 and 50 neurons, respectively. Both hidden layers use the activation function relu [40]. The model compiles with optimizer Adam [41]. The training epoch is 600 with batch size of 32 and validation split of 5%. The modeling process follows a route consisting of five steps: 1. data partitioning into training and testing set (random: 90% of the data in the training set and 10% in the testing set); 2. feed training data to train the model, 3. predict testing target (HEC) using the trained model; 4. calculate the performance (calculate RMSE on both training and testing data); 5. repeat steps 1-4 ten times (tenfold cross-validation) and calculate the mean performance, i.e., the average RMSE on both training and testing data over ten repeated runs. In step 4, the RMSE is calculated on both training dataset and testing dataset based on the predicted hole expansion capacities k p;i , the real hole expansion capacities k r;i and number of samples N in the dataset as follows: Machine learning model performance The performance of all five machine learning models is shown in Fig. 7. The two linear regression methods (1 and 2) and the conditional inference tree regression (3) clearly perform the worst with a high RMSE (root mean square error) on the testing dataset. The deep learning model shows the best performance with the lowest RMSE. The HEC prediction accuracy of the deep learning model is AE16%. Comparing the hole expansion testing error range of the experimental data acquired by Chen et al. [22], where the average standard deviation of testing three times on the same steel grade is AE9%, and the 15% standard deviation of experimental HEC values for martensitic steels [23], due to various testing conditions, such as edge surface quality and first crack determination timing, it can be concluded that deep learning predictions reach a similar degree of accuracy as experiments, where the 9% accuracy for the training dataset indicates an experimental accuracy of that magnitude. In Fig. 8, the deep learning-predicted HEC is plotted against the experimental HEC, with the experimental test error shown in the bottom-right corner. It can be seen that based on the learning from the training data points, deep learning can give confident prediction of the testing data points. With the improvement in the experimental data quality and increase in quantity of the data, the authors believe that the prediction accuracy can be further enhanced.

Machine learning model interpretation
The conditional inference tree regression model and random forest regression model both give rise to a ranking of importance of the independent variables, which is shown in Fig. 9. The feature importance based on the conditional inference tree is calculated by the sum of the reduction of variance to the parent node weighted by the probability of reaching that node caused by the certain feature. A higher value indicates high importance. Random forest averages the importance of each feature from each tree to obtain the rank of importance of all features.  means no linear correlation [27,42]. The effect of Mn and Cr is also shown by Guo et al. [43] who state that Mn improves strength to certain extent, while Cr improves ductility of bainitic steels. Higher ductility has positive effect on hole expansion capacity [44], while higher strength leads to lower hole expansion capacity [22]. With the conditional inference tree regression model, a decision tree can be built as shown in Fig. 10. At each node of the decision tree, one specific input variable is selected, according to algorithms mentioned in Appendix B.2, to separate the dataset into two subsets. For each node, the separation criterion, the root mean square error of samples in the node, the number of samples and the mean HEC value of all samples in the node are shown in the node box. The left arrow from the node box indicates the condition for separation is true, while the right arrow indicates it is false. Node 0 contains all 54 samples; its criterion is a martensite phase fraction smaller than 97.5%. This criterion is true for 48 samples with an average HEC of 45%, as shown in node 1, and it is false for six samples with an average HEC of 74%, as shown in node 16. The samples of each of these nodes are further separated on the basis of subsequent criteria. The color of the node indicates its average HEC values.
From the decision tree, the trend of the influence of different independent variables is evidenced. It shows that the change in HEC with different variables is not monotonic. Table 3 summarizes the information from the decision tree based on the range of the HEC values corresponding to the phase fractions and chemical contents. Node 13 and node 18 in Fig. 10 classify the highest HEC with either fully martensitic structure or the combination of martensite, lower bainite and ferrite. Meanwhile, node 8 in Fig. 10 classifies the lowest HEC with more than 31.5% ferrite, less than or equal to 13.7% lower bainite and a martensite volume fraction between 11.5% and 97.5%.

Discussion
Comparing the summary in Table 1    results. The summary in Table 1 shows that only ferrite has a positive effect on hole expansion capacity, while all other phases have a negative effect. But the analysis from a large number of data, as presented in the present paper, shows more complicated effects due to varying volume fractions of different phases, other than simply positive or negative. This is mainly due to the limitation of the range of data in studies in Table 1. Most of the studies only observe a certain fraction range of certain phases, which is not representing the effect on HEC across the whole volume fraction range. The effect of phase fractions on HEC is complicated and cannot be expressed by simple monotonic functions.
Taking into account the analysis in ''Data analysis'' section, in whichever way the data are looked at, the most important phases which contribute to HEC are ferrite, martensite and lower bainite. Considering that many studies relate the HEC to the difference in hard/soft phases, these three phases actually take the most important role in hard/soft phases in steels, especially ferrite and martensite, which are most commonly seen the softest phase and the hardest phase. Statistics show that the higher the fraction of the hard phase is, the higher the HEC is. This reflects that the HEC is a strength-related mechanical property. The HEC value shows a valley at the intermediate volume fraction of martensite which is possibly related to the minimum fracture stain in dual-phase steels with the similar condition of martensite [45,46]. This can be explained by damage nucleation and crack growth mechanics being favored by strength mismatch and the related increase in the local stress triaxiality. Meanwhile, certain combinations of phases also give high HEC, such as the combination of ferrite, martensite and bainite with the volume ratio around 2:1:1. This high HEC can be accounted for by the accommodation of stress by this specific volume combination of hard and soft phases, where the hard phase gives the overall strength and soft phase gives ductility for expansion under stress without cracking. But the ferrite/martensite combinations do not perform very well. Although with the analysis in this paper, the complicated relations between HEC and microstructure features are clearly shown, it is not possible to give a simple relation. However, with the help of deep learning, a reliable prediction (with an accuracy of AE16% on HEC, which is similar to the experimental accuracy) can be made with the combination of the volume fraction of each phase and chemical content. Still, the accuracy of the prediction model highly depends on the amount of the data gathered and the accuracy of the data. Even though the dataset used in this study is a large dataset in the context of materials science, it is definitely limited and small in the field of so-called big data and traditional machine learning. Nevertheless, the present study shows that meaningful results can also be achieved with limited datasets. The authors believe that significant improvement in the prediction model can be made if the data will be enhanced, both in the amount and in the quality.
In this study, since the obtained dataset only contains the phase volume fractions and the chemical composition, the data analysis and prediction of HEC are only based on these two microstructure features. Even without considering many other microstructure features, such as grain size distribution, texture and grain morphology, which are normally considered to have distinct impact on mechanical behavior, this study shows valuable results with limited materials information.

Conclusions
This study focuses on data acquired from the literature to investigate the relation of phase volume fractions and chemical compositions with hole expansion capacity. The findings in this paper can guide some new physical investigations to unravel the root causes of the HEC behavior and consequently to the development of better steels. The following conclusions are drawn based on the analysis from different perspectives.
-The effect of phase fractions on HEC is complicated and cannot be expressed by simple monotonic functions. For martensite, volume fractions between 20% and 70% will lead to a low HEC. HEC slightly decreases with an increasing volume fraction of ferrite. Around 30% bainite gives a high HEC. -Certain phases make significant contribution to the HEC, most prominently, ferrite, martensite and lower bainite. -The higher the volume fraction of harder phases is, the higher the HEC is. -Purely martensitic microstructure or microstructure with lower bainite tends to have higher HEC compared to other combinations of phases. High HEC can also be achieved with the combination of ferrite, martensite and bainite with the volume ratio around 2:1:1. -The applied deep learning model has better performance (with the prediction accuracy of AE16% on HEC) over the linear regression models and tree regression models on the prediction of HEC based on phase fraction and chemical content.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit h ttp://creativecommons.org/licenses/by/4.0/.
where x j is the jth variable in the prediction point x and p is the number of independent variables. The estimateb j is the corresponding coefficient in the Elastic Net which minimizes the objective function: Here, n is the number of the data points, x ij is the ith observation corresponding to the jth variable and y i is the target mechanical property corresponding to the data point x i . Different from the LASSO method which is used in the previous work [29], here the shrinkage penalty has two parts [47], namely LASSO penalty (magnitude a) and Ridge penalty (magnitude 1 À a). The LASSO penalty is indifferent while solving the problem among a set of strong but correlated variables. The Ridge penalty, on the other hand, tends to shrink the coefficients of correlated variables toward each other. The Elastic Net penalty is a combination of the two, also a compromise [48]. The two regularization parameters (a and k e ) are optimized within a certain tuning grid during the training process.

Conditional inference tree regression (ctree2)
A decision tree is a model in the form of a tree structure, which breaks the dataset into smaller and smaller subsets; hence, the tree structure is built up.
In order to build a tree structure, the most important two main steps are needed: to choose the feature and to find the condition to split, i.e., the partitioning algorithm. The most popular implementations of the recursive partitioning criteria, such as 'CART' [49] and 'C4.5' [50], have the problem of overfitting and a  selection bias toward covariates with many possible splits [51]. Therefore, the conditional inference tree regression, also known as unbiased recursive partitioning, was introduced [51]. Unlike selecting the variable and deciding split criteria based on Gini Impurity [49] or Information Gain [50], it uses a significance test procedure, e.g., permutation tests [52]. Conditional inference tree is proved to be well suited for both explanation and prediction.

Random forest regression (cforest)
A random forest is a meta-estimator (i.e., it combines the result of multiple predictions) which aggregates many decision trees. It is a bagging technique, i.e., bootstrap aggregation, which is done with random sampling with replacement and aggregation of the outputs at the end without preference to any model. Therefore, the cforest model used in this paper is the conditional random forest which can be simply seen as averaging multiple conditional inference tree results [48,51,53].

Deep learning (keras)
Deep Learning refers to deep neural networks. It is an artificial intelligence function involving multiple units, called neurons, which are connected to each other like a Web, to make the data processing in a nonlinear approach. The 15 input neurons build up the input layer, while the output layer, in this study, is just one neuron, i.e., hole expansion capacity. The hidden layers are in the middle. The kind of fully connected neural network is called multilayer perceptron. Data flow from the input layer through the hidden layers and finally arrive output layer. The mathematics for calculating the value Y of each neuron from the neurons in the previous layer is [48,54]: where w i is the weight for the neuron with value x i , n is the number of neurons and b is the bias of each layer, which is a constant for each layer. F is the activation function, and relu is used in this study, which adds complexity and dimensionality to the neuron network. While the network is trained by feeding it with input data, the weight and bias will be learned to correct themselves to minimize the loss function by the technique called back propagation. The loss function, in this case, is the mean squared error between the network calculated HEC and the real HEC corresponding to the input microstructure.