1 Introduction

Fatigue failure is the most prevalent breakdown mechanism in engineering structures that are subjected to long-term load disturbances [1, 2]. The fatigue life has previously been characterised as two independent processes involving “crack initiation” and “crack growth” [3]. Fatigue crack growth (FCG) prediction is essential for the damage-tolerant design of engineering components [4, 5]. Stress intensity factor (SIF) is the main characteristic of crack growth [6, 7]. Based on the linear elastic fracture method (LEFM), the earliest model was proposed by Paris and Erdogan [8] and had a profound impact on the field. However, Paris’ model fails to perform well in the threshold and fast-fracture regions.

Various studies have suggested improvements to the Paris’ model to handle new governing factors impacting the FCG rate, which can be separated into two types. First, an effective SIF (ΔKeff) based on the crack closure phenomenon was proposed to characterise the effect of the stress ratio on the FCG [9,10,11]. The other methods established the FCG rate prediction model directly based on the numerical fitting method considering parameters such as R-ratio, threshold value (ΔKth), and fracture toughness (KC) [12,13,14,15,16,17]. Although several semi-empirical models have been developed, they have some application restrictions, which are discussed in the following section. FCG rate prediction is a nonlinear multivariable problem.

Machine learning (ML) has powerful nonlinear processing and multivariate learning capabilities. It has been widely used for crack growth to solve complex nonlinear prediction problems [18,19,20,21,22,23,24,25]. Indeed, the radial basis function artificial neural network (RBF-ANN), backpropagation neural network (BPNN), extreme learning machine (ELM), fully connected neural network, random forest (RF), hidden Markov model (HMM), and long short-term memory (LSTM) all yield accurate life and crack growth predictions [26,27,28,29,30,31]. However, most current FCG rate models based on ML are inexplicable black-box models that are difficult to apply in engineering practice. Simultaneously, it is often desirable to find interpretable models for insight into the FCG models. Schmidt and Lipson [32] used symbolic regression (SR) to obtain interpretable formulas from pendulum motion test data. Even after removing the sin and cos components from the initial function, a Taylor series expansion expression with these two terms appeared in the final expression. In contrast to ordinary regression approaches, SR employs only one assumption: the expression for the relationship between the target and characteristic parameters can be obtained by combining various elementary functions using algebraic operations [33]. More importantly, the SR is a white-box model whose output comprises a defined initial function sign with constants and input variables [34]. At present, SR has been applied to the fields of astronomy, materials science, and engineering [35,36,37,38], demonstrating the prospect of discovering interpretable models from data-driven models.

SR is a violent search method based on genetic programming [39]. An infinite set of mathematical function symbols, input variables, and constants exists and the determination of an equation that is both simple and accurate for such a large set is time-consuming. To constrain the search space, it must rely on the domain knowledge and inspiration. In this study, eight existing semi-empirical FCG rate models were analysed, and the domain knowledge in the existing FCG rate model was inherited from the SR. The equation was derived from the FCG test data of Al-7055-T7511, which includes three variables, ΔK, R, and ΔKth through the SR. The SR model is straightforward and suitable for engineering applications. This work also provides a fitting method for the SR model, which is used to fit the crack growth rate test results of six metal materials. The accuracy and universality of the SR model were evaluated and compared with traditional FCG rate model prediction results.

2 Crack Growth Models

The generally accepted FCG theory, based on the relation between the FCG rates da/dN and ΔK, was proposed by Paris [8], and is given by Eq. (1).

$$\frac{{{\text{d}}a}}{{{\text{ d}}N}} = C \cdot ({\Delta }K)^{m} ,$$
(1)

where a and N represent the crack length and number of loading cycles, respectively, while C and m are material constants in the Paris’ model. The range of the SIF is determined by ΔK = Kmax − Kmin, where Kmax and Kmin are the maximum and minimum stress intensity factors, respectively. In Figure 1, the FCG rates of identical materials at various R-ratios are shown as a series of essentially parallel curves associated with ΔK [14]. The FCG rate curve shifts to the left overall as the R-ratio increases. The R-ratio is defined by R = σmin/σmax, where σmax and σmin, denote the maximum and the minimum loading amplitude. It is difficult to reflect the effect of the R-ratio on the FCG rate of the materials using only ΔK. To overcome this issue, some well-known FCG rate models incorporate the contribution of R-ratios, which are briefly listed in the following section. Furthermore, it is generally assumed that the crack does not grow when ΔK < ΔKth. Therefore, ΔKth in the FCG rate model is particularly important in practical engineering applications.

Figure 1
figure 1

Crack growth rate curves under different R-ratios

2.1 Forman’s Model

Forman et al. [12] modified Paris’ model to integrate the Paris region and fast-fracture region FCG behaviour, as shown in Eq. (2).

$$\frac{\text{d}a}{{\text{d}N}} = \frac{{C({\Delta }K)^{m} }}{{(1 - R)(K_{{\text{C}}} - K_{\max } )}},$$
(2)

where Kmax denotes the maximum SIF, and Kc is fracture toughness that denotes the transition to unstable crack growth. As Kmax approaches KC, the denominator approaches zero and the FCG rate increases significantly. Thus, the model matches the fast-fracture region FCG. However, Forman’s model is inadequate for forecasting FCG behaviour in the threshold region.

2.2 Elber’s Model

Elber [9] introduced the notion of crack closure and proposed ΔKeff as a driving force to characterise the influence of the R-ratio on the FCG rate. This model is expressed by Eqs. (3) and (4).

$$\frac{{{\text{d}}a}}{{{\text{ d}}N}} = C \cdot ({\Delta }K_{{{\text{eff}}}} )^{m} ,$$
(3)
$${\Delta }K_{{{\text{eff}}}} = K_{{{\text{max}}}} - K_{{{\text{op}}}} ,$$
(4)

where Kop is the SIF corresponding to the crack opening stress σop, which is determined by the load associated with a 2% deviation in the slope of the load–displacement curve [40]. The crack closure phenomenon has a more significant impact at a low R-ratio. With the crack opening Kop approaching the minimum SIF, the crack closure phenomenon can be ignored at a high R-ratio.

2.3 Kujawski’s Model

Kujawski [13] established a new driving force model for predicting the FCG rate of aluminium alloys. This model is not based on the crack closure effect, but selects the positive value of the SIF range (ΔK+) and Kmax as the driving force to explain the R-ratio effect on crack growth. This model can be described using Eqs. (5) and (6).

$$\frac{{{\text{d}}a}}{{{\text{ d}}N}} = C \cdot ({\Delta }K^{*} )^{m} ,$$
(5)
$${\Delta }K^{ * } = \left( {K_{\max } {\Delta }K^{ + } } \right)^{0.5} ,$$
(6)

where ΔK+ = ΔK when R ≥ 0, and ΔK+ = Kmax for R < 0. Subsequently, Eq. (6) is expanded into Eq. (7) based on the crack growth test data of the other metals [41, 42].

$${\Delta }K^{ * } = \left( {K_{\max } } \right)^{{\alpha_{{\text{K}}} }} \left( {{\Delta }K^{ + } } \right)^{{1 - \alpha_{{\text{K}}} }} ,$$
(7)

where 0 ≤ αK ≤ 1 is a fitting parameter determined by the material and geometry. Kujawski’s model is based on the assumption that when R < 0, negative stress does not contribute to crack growth. The impact of compressive loads on crack growth cannot be overlooked, following the observations of subsequent studies [43, 44].

2.4 Huang’s Model

To overcome the overestimated results in Kujawski’s model at high R-ratios, Huang et al. [14] developed an improved FCG model by introducing a correction factor M in the form of a piecewise function, as shown in Eqs. (8) and (9).

$$\frac{{{\text{d}}a}}{{{\text{ d}}N}} = C(M{\Delta }K)^{m} ,$$
(8)
$$M = \left\{ \begin{aligned} &(1 - R)^{{ - \beta_{1} }} { }( - 5{ \le }R < 0), \hfill \\& (1 - R)^{ - \beta } { }(0{ \le }R < 0.5), \hfill \\& \left( {1.05 - 1.4R + 0.6R^{2} } \right)^{ - \beta } { }(0.5{ \le }R{ < }1). \hfill \\ \end{aligned} \right.$$
(9)

where β and β1 are constants that depend on the material properties and environmental conditions, and β1 = 1.2 × β. For the aluminum alloy and steel, the parameter β is approximately set to 0.7, while for Ti-6Al-4V, it is set to 0.5. Compared with Kujawski’s model, this model not only considers the contribution of the compression load, but also provides more accurate results under R > 0.5. More accurate parameters, β and β1 can be acquired using numerous sets of test data. However, the FCG rate data with varying R-ratios are sometimes inadequate for engineering materials.

2.5 Zhan’s Model

Because the da/dN–ΔK curve resembles the parallel line under different R-ratios, Zhan et al. [15] proposed a simplified FCG rate prediction model, as shown in Eqs. (10) and (11).

$$\frac{\text{d}a}{{\text{d}N}} = C_{0} \cdot (\Phi \Delta K)^{{m_{0} }} ,$$
(10)
$${\Phi } = \exp (\alpha_{{\text{Z}}} R),\quad ( - 1 \le R < 1),$$
(11)

where C0 and m0 are the material constants corresponding to R = 0 in Huang’s model, and αZ is the correction factor of the R-ratio. It can be solved by Eq. (12).

$$\left. {\log \left( {\frac{\text{d}a}{{\text{d}N}}} \right)} \right|_{R \ne 0} - \left. {\log \left( {\frac{\text{d}a}{{\text{d}N}}} \right)} \right|_{R = 0} = \Delta \log \left( {\frac{\text{d}a}{{\text{d}N}}} \right) = m \cdot \alpha_{{\text{Z}}} \cdot R,$$
(12)

The solution of Zhan’s model is straightforward, which compensates for the limitation that Huang’s model is unsuitable for high-strength alloy steel. For most low-strength alloys, αZ is set to 0.65. For some high-strength alloys, such as titanium alloys, αZ is set to 0.75. Zhan suggested that the FCG rate curve under an arbitrary constant R-ratio could be used as the basic curve in the scarcity of test data with R = 0. This model is limited to Paris’ region, and the choice of a basic curve with varying R-ratios results in erroneous solutions of C0, m0, and αZ.

2.6 NASGRO’s Model

Based on the crack closure theory, Mettu et al. [10] proposed the NASGRO model of crack growth rate suitable for the entire process of the threshold, Paris’, and fast-fracture regions. Its form is shown in Eq. (13).

$$\frac{\text{d}a}{{\text{d}N}} = C\left( {{\Delta }K_{{{\text{eff}}}} } \right)^{m} \frac{{\left( {1 - {\Delta }K_{{{\text{th}}}} /{\Delta }K} \right)^{p} }}{{\left( {1 - K_{\max } /K_{{\text{C}}} } \right)^{q} }},$$
(13)

where p and q describe the contribution parameters of the threshold and fast-fracture regions, respectively. Because only a small fraction of the fatigue life is spent in the fast-fracture region, Zhang et al. [45] simplified the NASGRO model using Eq. (14).

$$\frac{\text{d}a}{{\text{d}N}} = C\left({{\Delta }K_{{{\text{eff}}}} } \right)^{m} \left( {1 - \frac{{{\Delta }K_{{{\text{eff,th}}}} }}{{{\Delta }K_{{{\text{eff}}}} }}} \right)^{p} ,$$
(14)

where ΔKeff and ΔKeff,th denote the effective SIF and its threshold value, respectively. This can be expressed by Eqs. (15) and (16).

$${\Delta }K_{{{\text{eff}}}} = U{\Delta }K,$$
(15)
$${\Delta }K_{{\text{eff,th}}} = U{\Delta }K_{{{\text{th}}}} ,$$
(16)

where U is expressed in terms of Newman's crack closure function (f), and R in the form of Eq. (17) [11].

$$U = \left( {\frac{1 - f}{{1 - R}}} \right),$$
(17)

f is given by Eq. (18) as:

$$f = \left\{ {\begin{array}{*{20}c} {\max \left( {R,A_{0} + A_{1} R + A_{2} R^{2} + A_{3} R^{3} } \right){ }(R \ge 0),} \\ {A_{0} + A_{1} R{ }( - 2 \le R < 0),} \\ \end{array} } \right.$$
(18)

while Newman’s crack closure estimations are expressed by Eqs. (19)–(22).

$$A_{0} = \left( {0.8613 - 0.3387\alpha_{{\text{N}}} + 0.0465\alpha_{{\text{N}}}^{2} } \right)\left[ {\cos \left( {\frac{{(\pi /2)S_{\max } }}{{\sigma_{0} }}} \right)} \right]^{{\frac{1}{{\alpha_{{\text{N}}} }}}} ,$$
(19)
$$A_{1} = (1.047 - 0.233\alpha_{{\text{N}}} )\frac{{S_{\max } }}{{\sigma_{0} }},$$
(20)
$$A_{2} = 1 - A_{0} - A_{1} - A_{3} ,$$
(21)
$$A_{3} = 2A_{0} + A_{1} - 1,$$
(22)

ere, Smax represents the maximum stress during the load cycle. The flow stress σ0 was calculated as the average between the uniaxial yield and ultimate tensile strength [46]. αN is the constraint value used to account for the three-dimensional stress state [47], depending on the material. The NASGRO model is highly accurate and is applicable to all the three regions of the FCG process. However, it must confirm various parameters, and a large amount of test data is required to support its establishment while considering crack closures. The various phenomena that cause crack closure include the crack plasticity region, crack surface roughness, fluid inside the crack, and corrosion deposits near the crack tip [48]. The crack closure phenomenon cannot be precisely represented quantitatively, because it is heavily influenced by slight variations in the crack path, ambient factors, loading conditions, and test methods [49]. The model must be simplified to enhance its applicability to engineering problems, considering time and test costs.

3 Symbolic Regression

3.1 High-Performance Symbolic Regression in Python

As previously mentioned, the existing FCG models are either solely applicable to the Paris region or excessively complex for considering the crack closure phenomenon, and require a large amount of test data for support. Recently, various scholars have used ML regression models to predict FCG rates. Both the ML and traditional regression models were based on specific model parameters. For example, the standard linear regression model is based on the linear relationship between the dependent and independent variables. As a nonlinear technique, artificial neural networks (ANN) also depend on the definition of the activation or transfer functions.

SR does not require assuming a specific form of the function between independent and dependent variables in advance. Instead, it only requires that the connection between the independent and dependent variables be described using function expressions. Simultaneously, the special mathematical operators, constants, and analytical functions are introduced to search for the optimal solution using these unique module combinations. Any equation inthe SR can be expressed in the form of a binary tree. The SR tree structure comprises the terminal and internal nodes, which represents the constant and variable, and the function and operation symbols, respectively. The SR tree representation equation is presented in Figure 2. The model becomes increasingly sophisticated as the length and depth of the tree increase. To avoid excessive tree development, the length and depth of the trees must be limited. SR is essentially genetic programming, which resembles Darwin's natural selection principle. In genetic programming, the initial population is randomly generated according to the defined function, and individuals are screened according to their fitness. It is easier for an individual possessing higher fitness to appear in the next generation of individuals. Individuals with higher fitness evolve through crossover and mutation. As shown in Figure 2, crossover refers to the random exchange of subtrees between the equations of the previous generation, which produces new offspring individuals. Mutation implies that a node or multiple nodes in the equation are randomly adjusted to ensure population variety to explore better data-fitting equations. Fitness evaluation, screening, crossover, and mutation occur in each generation to produce new individuals, and this process repeats until it reaches the number of iterations or is artificially terminated. In this study, high-performance symbolic regression in Python (PySR) is used for the symbolic regression [50].

Figure 2
figure 2

Structure trees of SR

The procedure for the domain knowledge-guided SR is shown in Figure 3. In PySR, each node has a Complexity of one and the Complexity increases with the number of nodes. Each variable, constant, or operator in the equation increases Complexity by one. The equation not only requires considering higher accuracy, but also its Complexity [51]. The SCORE is defined by Eq. (23).

$${\text{SCORE}} = \frac{{ - \left[ {\ln ({\text{curMSE}}) - \ln ({\text{lastMSE}})} \right]}}{{{\text{curComplexity}} - {\text{lastComplexity}}}},$$
(23)

where curMSE and lasMSE are the mean square error (MSE) of the contemporary and last individuals, respectively, while curComplexity and lastComplexity represent the complexities of the contemporary and last individuals, respectively.

Figure 3
figure 3

Procedure of domain knowledge-guided SR

The main advantage of SR is that its output results are visual formulae with interpretable findings. However, genetic programming based on SR is essentially a random search process, with an almost limitless search space, and brute-force search without any preconditions may consume a lot of time and memory. Therefore, it is necessary to study the existing semi-empirical FCG rate model to identify specific conditions that may be used to develop an FCG rate model based on symbolic regression.

3.2 Domain Knowledge-Guided

The existing method for solving for the parameters of the FCG rate model usually uses the logarithm of the left and right sides of the equation. The corresponding results for the eight FCG rate models are shown in Table 1. The model is represented in SR tree form for the convenience of analysis, as shown in Figure 4. Figure 4 demonstrates that regardless of the type of FCG rate model, the compensation function of the constant term lnC is included, and values of the different models vary so that the crack growth rate equation can be expressed as ln(da/dN) = g(•). The variables that affect g(•), both of which affect the crack growth rate of the material, include ΔK, Kmax, Kop, f, ΔK+, R, ΔKth and KC. Kmax and ΔK+ can be represented individually by a function that includes ΔK and R. Kop and f are crack closure parameters that must be determined by the FCG test data. As described above, a precise quantitative description of crack closure was not feasible [52]. The threshold and Paris regions dominated the crack growth process most of the time. The fast-fracture region accounted for only a minor portion of the time during crack growth. Therefore, the five parameters Kmax, ΔK+, Kop, f and KC were excluded from the present model. In the above eight FCG rate models, the ΔK and ΔKth terms usually appeared in the form of lnΔK and ln(1-ΔKthK). The R term may appear in the form of ln(1-R), R or (1-R); therefore, three forms of R terms were imported to analyse R-ratio contribution to the establishment of the FCG rate model. The remaining lnC, m, αZ, p and q are constant terms that can be generated by SR. Moreover, traditional semi-empirical FCG rate models contain Paris’ term, lnC+ m × lnΔK. Therefore, the equation generated by SR should include Paris’ term.

Table 1 Eight equations of traditional FCG rates prediction models
Figure 4
figure 4

SR trees of eight traditional models equation for FCG rates prediction and formula structures

Thus, the FCG rate model established by the domain knowledge-guided SR is mainly related to ΔK, ΔKth and R, which can be inferred from Eq. (24).

$$\ln \frac{{{\text{d}}a}}{{{\text{d}}N}} = g\left( {\ln \Delta K,\ln \left( {1 - R} \right){ }{\text{or }}\,R{ }\,{\text{or (1 - }}R),\ln (1 - \frac{{\Delta K_{{{\text{th}}}} }}{\Delta K})} \right),$$
(24)

The FCG model established in the present work primarily included three variable parameters, which were represented by the SR tree, as shown in Figure 4. Previously, researchers proposed a connection between the three parameters based on experience or test data. The artificial introduction of the relationship between the three parameters may affect the accuracy of model establishment. In this study, PySR was used to explore the relationships between the three variable parameters. Here, x0, x1, and x2 represent lnΔK, ln(1-R) or R or (1-R), and ln(1-ΔKthK), respectively.

In the present study, the SR model was established using PySR, and the model parameters are listed in Table 2. The data used were the Al-7055-T7511 [14] FCG test data obtained from other studies. Niteration means the number of iteration, which was set to a large value (2000) here. PySR can yield optimal offspring individuals in real-time. The training process was manually terminated after identifying an interpretable and highly accurate individual. The operators ' + ', ' - ', ' × ', ' / ', 'ln_abs() ' were used. To avoid the overcomplex of individuals generated by SR, the Complexity is set to 20, which means that the total number of operators, constants, and variables in equations can not beyond 20. A high logarithmic term such as ln(ln(•)+•) did not appear in the semi-empirical FCG rate model; to obtain an interpretable solution, there was at most a variable in the constraint ln_abs() function. The MSE was used as the loss function of the SR to judge fitness. The Pearson correlation coefficient (r) was adopted to evaluate the fitting degree between the predicted and test values [53, 54], which is defined by Eq. (25):

$$r = \frac{{\sum \left( {y_{{{\text{pre}}}} - \overline{y}_{{{\text{pre}}}} } \right)\left( {y_{{{\text{test}}}} - \overline{y}_{{{\text{test}}}} } \right)}}{{\sqrt {\sum \left( {y_{{{\text{pre}}}} - \overline{y}_{{{\text{pre}}}} } \right)^{2} } \sqrt {\sum \left( {y_{{{\text{test}}}} - \overline{y}_{{{\text{test}}}} } \right)^{2} } }},$$
(25)

where ypre and ytest denote the predicted and test values of the output separately, and \(\overline{{y}}\)pre and \(\overline{{y}}\)test denote the mean of two variables.

Table 2 Basic settings for PySR

The input characteristic X and output characteristic Y were lnΔK, ln(1-R) or R or (1-R), ln(1-ΔKthK), and ln(da/dN), respectively. Moreover, the output feature Y of PySR was the logarithmic form of the FCG rates, which is ln(da/dN); therefore, the quantitative evaluation parameters used in the present work were all based on the value of feature Y.

Moreover, integrating other FCG rate models provided more information for the SR process. The eight collected FCG rate models were suitable for a wide range of R-ratios and exhibited a high prediction ability for a variety of materials. Therefore, the eight FCG rate models were considered reliable references for establishing the SR model.

4 Results and Extensions

4.1 Symbolic Regression Results and Analysis

In this section, the FCG test data of the Al-7055-T7511 are used for the SR. The Pareto front in Figure 5 illustrates the trade-off between the equation complexity, as defined by the number of nodes in the SR tree, and MSE. To make the results more intuitive, we considered the logarithmic coordinates of the loss. Figure 5 shows that the loss decreases with an increase in the equation complexity, which represents an improvement in the accuracy of the regression. As shown in Figure 5, the loss of the equation generated by the three approaches is nearly identical before the complexity becomes less than 11. When the complexity exceeds 12, the loss of the equation obtained by ln(1-R) exceeds that of the other two methods, indicating that the result obtained by R or (1-R) is more accurate than that obtained by ln(1-R). The loss obtained using R or (1-R) then decreases only marginally with more complex equations after the equation complexity reaches 14.

Figure 5
figure 5

Pareto front of best equations showing Complexity and Loss

Tables 3, 4, 5 show the detailed equations obtained by the SR. Combined with Table 3, 4, 5, the equations with a complexity of less than 11 obtained by the other three methods have the same form, except for the equation of complexity 9. The equations of complexities 9 and 12 obtained by R or (1-R) can take the same form after numerical transformation. The complexity 6 equation has the highest evaluation SCORE of 7.903765, indicating that it possesses the optimal value of the improved precision-to-complexity ratio when compared to other equations. From the standpoint of model efficiency, the complexity 6 equation guarantees a lower complexity while considering the accuracy and is the optimum solution for SR. However, the equation contained only one characteristic variable x2, without x0 and x1. This study aimed at building an FCG rate model with three characteristic variables: ΔK, R and ΔKth. Furthermore, we discovered that this equation and the complexity 9 equation share some similarities with Zhan's model, which confirms the reliability of the SR results. The form of the equation derived by the three approaches was dissimilar when the complexity exceeds 14. Further, the equations derived from the three subtrees are separately analysed.

Table 3 Best equation obtained by PySR using ln(1-R), where x0, x1, and x2 represent lnΔK, ln(1-R), and ln(1-ΔKthK), respectively

4.1.1 Symbolic Regression Results by ln(1-R)

Table 3 lists the detailed equations for ln(1-R). When the complexity is less than 11, the characteristic variables of each equation appear alone, and there is no x1 term related to R. Those equations are inconsistent with the purpose of this study. Other equations lack Paris’ term which is necessary for traditional semi-empirical equations. Thus, the equations obtained using ln(1-R) do not contain the objective equations of this study.

4.1.2 Symbolic Regression Results by R

As shown in Table 4, equations with a complexity of less than 11 are consistent with those obtained by ln(1-R) and are not analysed in this section. Only the fitting coefficient values differed between the equations of complexities 12 and 14, where the x0 and ln(|x2|) constant coefficients of the complexity 12 equation are the same, changing them to alternative coefficients reduces the loss and has a SCORE of 1.431851. Thus, the complexity 14 equation was chosen over the complexity 12 equation. Equations with complexity greater than 16 contain the Paris’ term necessary for the FCG rate model. In contrast, the x1 term, conversely, takes the form ln(|x1|) or 1/x1, which is singular at R = 0. Therefore, only the complexity 14 model can be regarded as an SR-undetermined model.

Table 4 Best equation obtained by PySR using R, where x0, x1, and x2 represent lnΔK, R, and ln(1-ΔKthK), respectively

4.1.3 Symbolic Regression Results by (1-R)

As shown in Table 5, the equations whose complexity is less than 12 are consistent with those obtained using R. Note that following a numerical operation, the equations of complexities 12 and 14 given by R or (1-R) can assume the same form. As in Section 4.1.2, the complexity 14 model can be regarded as an SR-undetermined model. In addition, the complexity 13 equation is supplemented by the threshold subtree of the constant terms multiplier x2 compared with the complexity 9 equation, and the constant multipliers x0 and x1 are adjusted. Therefore, the complexity 13 equation can be chosen as an SR-undetermined model. Moreover, the complexity 20 equation has the highest accuracy in this round of SR. Nonetheless, obtaining a hint of x0×x1 from the standard semi-empirical FCG rate model is problematic. The complexity 18 equation also lacks interpretability for the term. The complexity 16 equation also lacks an explanation for x1×x2 from the traditional semi-empirical FCG rate model. Thus, the equations of complexity 13 and 14 models can be regarded as SR-undetermined models.

Table 5 Best equation obtained by PySR using (1-R), where x0, x1, and x2 represent lnΔK, (1-R), and ln(1-ΔKthK), respectively

4.2 Equation Selection and Extension

Following the description provided in Section 4.1, there are currently three SR-undetermined models. The best equation is selected as the final model in this section. Because the equations of complexity 14 by R and (1-R) could be converted to each other numerically, they have been treated as complexity 14 equations. The complexity 14 equations had a higher SCORE than the complexity 13 equations, indicating that the former conserved more equation space. Moreover, the complexity 14 equation was more precise than the complexity 13 equation. The two equations primarily differed in the processing of threshold terms. In the threshold term of the complexity 13 equation, the constant term multiplied by x2 was added as compensation, followed by adding the constant term multiplied by ln(x2).

Figure 6 shows the effect of the x2 coefficient on the FCG process in the two SR-undetermined models. The influence of the threshold value on FCG is known to be mainly concentrated in the threshold region. The impact of the threshold value rapidly diminishes after the FCG enters the Paris region. Therefore, the x2 term should approach zero in the second half of crack growth. As illustrated in Figure 6a, the x2 correlation factor of complexity equation 13 approaches the zero baseline as ΔK increases. However, the x2 correlation factor of the complexity 14 equation maintains an increasing trend. After removing the x2 term of the two SR-undetermined models, the complexity 13 equation in Figure 6b still reflects a good correlation with the test data in the Paris region whereas the complexity 14 equation results deviate significantly from the test data. Thus, considering the equation fitness, complexity, and interpretability, the complexity 13 equation was selected as the final model derived from the domain knowledge-guided SR, defined as the SR model.

Figure 6
figure 6

Effect of the x2 term coefficient on the Al-7055-T7511(R = − 1) FCG process in two SR-undetermined models, a the x2 term correction factor versus ΔK, b predicting curves and test data after removing the x2 term

The complexity 13 equation maintained the foundation of Zhan's model and adds a parameter with a threshold value. The threshold value parameter resembled that of the NASGRO model and could be regarded as compensation for the threshold value of Zhan’s model. By replacing the constants in the model by constant coefficients, the SR model is expressed as Eq. (26).

$$\begin{aligned} \ln \left( {\frac{\text{d}a}{{\text{d}N}}} \right) &= \ln C + m\ln ({\Delta }K) + q\ln \left( {1 - \frac{{{\Delta }K_{{{\text{th}}}} }}{{{\Delta }K}}} \right) + \alpha (1 - R) \hfill \\ &= \ln \left[ {C \cdot e^{\alpha (1 - R)} \cdot ({\Delta }K)^{m} \cdot \left( {1 - \frac{{{\Delta }K_{{{\text{th}}}} }}{{{\Delta }K}}} \right)^{q} } \right], \hfill \\ \end{aligned}$$
(26)

Exponentiating both sides of the equation, we obtain Eq. (27).

$$\frac{\text{d}a}{{\text{d}N}} = C \cdot e^{\alpha (1 - R)} \cdot ({\Delta }K)^{m} \cdot \left( {1 - \frac{{{\Delta }K_{{{\text{th}}}} }}{{{\Delta }K}}} \right)^{q} ,$$
(27)

Defined α’ = α/m, the SR model was simplified manually to Eq. (28).

$$\frac{\text{d}a}{{\text{d}N}} = C \cdot (e^{{\alpha^{\prime}(1 - R)}} {\Delta }K)^{m} \cdot \left( {1 - \frac{{{\Delta }K_{{{\text{th}}}} }}{{{\Delta }K}}} \right)^{q} ,$$
(28)

Therefore, an FCG rate model considering ΔK, R and ΔKth was obtained using domain knowledge-guided SR. The various colour label points in Figure 7 represent the FCG test rates of the Al-7055-T7511 under different R-ratios, and the corresponding colour curves represent the predictions of the SR model. Figure 7 shows that the test results are consistent with those predicted by the SR model, and the MSE corresponding to ln(da/dN)pre and ln(da/dN)test is 0.13260567.

Figure 7
figure 7

SR model predicting curves and test data about Al-7055-T7511

Observing Eq. (28), the explanatory variables ln(da/dN) and (lnΔK, 1-R, ln(1-ΔKthK)) were regarded as a linear relationship, implying that there was a multivariate linear function between the explanatory variables. The SR model had four undetermined parameters, which were divided into the partial regression coefficient K = (K1, K2, K3) and constant term B. Hence, multiple linear regression (MLR), as shown in Eqs. (29) and (30), can be used to obtain the fitting parameters.

$$Y = B + {\varvec{K}}X,$$
(29)
$$\ln \left( {\frac{\text{d}a}{{\text{d}N}}} \right) = \ln C + m\ln ({\Delta }K) + \alpha (1 - R) + q\ln \left( {1 - \frac{{{\Delta }K_{{{\text{th}}}} }}{{{\Delta }K}}} \right),$$
(30)

which partial regression coefficient K = (K1, K2, K3) = (m,α,q) and constant term B = lnC. The MLR method can extend the SR model to other materials, and the next section demonstrates the application of the SR model to other materials and evaluates it in comparison with the semi-empirical models.

4.3 Performance Evaluation and Model Comparison

To demonstrate the effectiveness of the SR model, the following examples used test data for the FCG rate from the literature. The material fitting coefficient was obtained using the MLR. This section contains test data for the titanium alloys Ti-10V-2Fe-3Al [55], Ti-6Al-4V [56], Cr-Mo-V steel [57], aluminum alloys LC9cs [58], Al-2324-T3 [43] and Al-6013-T651 [59]. In this section, the minimum FCG rate corresponding to an order of 10-10 m · cycle-1 was considered as ΔKth. Because the minimum crack growth rate of some test data was lower than 10-10 m · cycle-1, ΔK, corresponding to the minimum FCG rate, was defined as ΔKth to ensure the integrity of the test data. The fitting parameters and correlation coefficients of the different materials obtained by the MLR are listed in Table 6. The value of α in Table 6 is the weight term of the R-ratio effect. In Zhan's model, the value of αZ is closer to 0.75 for some high-strength metallic materials such as titanium alloys, and for other metallic materials, αZ is set to 0.65[15]. However, the value of α did not have a fixed value in the SR model. Zhan's model ignored the threshold value and arbitrarily selected the test data under a constant R-ratio as the basic crack growth rate curve to solve, whereas the MLR method adopted by the SR model more comprehensively considers the influence of various variables.

Table 6 SR model fitting parameters for various materials

The various colour label points in Figure 8 represent the FCG test rates with different R-ratios, and the corresponding colour curves represent the prediction of the SR model. According to Figure 8, the majority of the test data are condensed to the SR model's predicting curves, and the MSE of the prediction and test is less than 0.1. Most of the test data points for titanium alloys Ti-6Al-4V, Ti-10V-2Fe-3Al are consistent with the predictive curves, as shown in Figure 8a and b, indicating that the SR model is suitable for titanium alloys. The matching effect between the test data points and predicted curves was relatively unsatisfactory when the FCG process approached the fast-fracture region because the SR model considered only the threshold and Paris regions and ignored the fast-fracture region of crack growth. Figure 8c demonstrates that the predicted curve has a good coincidence relationship with the test data for Cr-Mo-V steel which indicates that the SR model is appropriate for steel materials. According to Figure 8d, e, and f, the predicted curves for the aluminium alloys LC9cs, Al-2324-T3 and Al-6013-T651 have a good coincidence relationship with the test data, indicating the suitability of the SR model for aluminium alloys.

Figure 8
figure 8

SR model predicting curves and test data about six materials: a Ti-6Al-4V, b Ti-10V-2Fe-3Al, c Cr-Mo-V steel, d LC9cs, e Al-2324-T3, f Al-6013-T651

The crack growth rates predicted by the SR model were satisfactory for all the above materials and cases. Nevertheless, the R-ratios of the FCG test data used for prediction were between − 1 ≤ R < 1 in the present research work, so it is considered that the model showed a good prediction effect when the R-ratios were between − 1 ≤ R < 1. However, the prediction effect of the crack growth rate for R < − 1 requires further verification.

Furthermore, the three FCG models, namely, Kujawski's model, Huang's model, and Zhan's model, were chosen for comparison with the SR model. Owing to the insufficient crack closure and KC test data and Paris’ model cannot predict the crack growth rate with different R-ratios, other FCG models introduced above were not evaluated in the present work. Table 7 summarises the MSE values of the various models for various materials and R-ratios, and Figure 9 compares the test and predicted values of the four FCG rate prediction models.

Table 7 MSE of the four models with different materials
Figure 9
figure 9

FCG rates based on four models between the SR model and test data: a SR model, b Kujawski’s model, c Huang’s model, d Zhan’s model

As shown in Table 7, the MSE of the prediction results of the SR model for various materials is smaller than those of the other three models, which shows the accuracy of the SR model in FCG rate prediction. Figure 9 shows that the other three models predict well in the Paris region but not in the threshold region. Furthermore, the SR model can predict the FCG rates in the threshold region well. In general, as the r approaches 1, the global model exhibits better global prediction performance. For 850 groups of FCG test data with different R-ratios, the evaluation parameters r of the four prediction models are 0.9921 (SR model), 0.9771 (Kujawski's model), 0.9775 (Huang's model), and 0.9781 (Zhan's model). The SR model continued to exhibit the highest global prediction precision.

5 Discussion

The evaluation and comparison of previous models indicate that the proposed FCG rate prediction model based on domain knowledge-guided symbolic regression is suitable for predicting the threshold and Paris’ regions with different R-ratios. The SR model does not condense the FCG test data to a constant R-ratio in the narrow band of the crack-growth rate curve. Instead, the FCG rate prediction model was built directly based on the test data. As previously demonstrated, the SR model provides a more accurate prediction in the threshold region than the three traditional semi-empirical FCG rate models.

In addition, the domain knowledge-guided symbolic regression proposed in this study can serve as a general model construction method in research on crack growth prediction. The FCG rate-prediction model has the advantage of involving fewer subjective factors. Unlike the semi-empirical FCG rate prediction model developed by researchers for the test phenomenon, the SR model is primarily driven by test data under domain knowledge guidance. This reduces human influence in the SR model and ensures the interpretability of the model within the framework of the traditional semi-empirical FCG rate model. The successful implementation of the SR model demonstrated the feasibility of domain knowledge-guided SR in the construction of FCG rate models. Owing to data-driven adaptability, domain knowledge-guided SR can develop more accurate models than the traditional FCG rate-modelling strategy based on experience and inspiration. Furthermore, because it considers the substructure of the traditional semi-empirical model, domain knowledge-guided SR can establish more interpretable models than the pure numerical regression modelling technique. In contrast, traditional ML methods are not only less explanatory, but also the training results can only be applied to the data space of the training set. In this study, the SR model was constructed by training with the Al-7055-T7511 FCG test data, and MLR was used to extend the SR model to other materials with accuracy. The FCG test data R-ratios used to train and evaluate the performance of the SR model were between − 1 ≤ R < 1, and its prediction performance for R < − 1 requires further investigation.

Note that despite being built on the domain knowledge, the SR model is still data-driven. Therefore, more research into the physical meaning of each subtree structure is required for a better understanding of the crack growth process. Furthermore, although the present study only considers ΔK, R, and ΔKth, other useful domain knowledge, such as the crack closure factor f may provide guidance for extending the application scope of the SR model. Other methods, such as XGboost, can be used to examine and select the importance of features as the number of characteristics increases, thereby reducing the spatial dimensions of the model.

6 Conclusions

(1) The proposed domain knowledge-guided SR obtained the variable subtree required for SR construction by analysing traditional semi-empirical FCG rate models. SR based on the variable subtree could balance the accuracy and interpretability of the data-driven model. This method provides a new direction for research on FCG.

(2) The SR model established in this work considered the comprehensive relationship between ΔK, R, and ΔKth, and the prediction equation had a concise mathematical structure. The model was acquired based on the Al-7055-T7511 FCG test data and could be extended to other materials using MLR. The prediction curve of the SR model had a good correlation with the test results.

(3) In comparison to the other three traditional semi-empirical FCG models, the SR model exhibited a more accurate prediction performance in both the threshold and Paris’ regions. Overall, to seven materials in the study, the average MSE of the three conventional models was about 0.5, while the average MSE of the SR model was only 0.171, a more than 60% reduction. These results highlight the reliability of the SR model for predicting the FCG rate.