Training deep material networks to reproduce creep loading of short fiber-reinforced thermoplastics with an inelastically-informed strategy

Deep material networks (DMNs) are a recent multiscale technology which enable running concurrent multiscale simulations on industrial scale with the help of powerful surrogate models for the micromechanical problem. Classically, the parameters of the DMNs are identified based on linear elastic precomputations. Once the parameters are identified, DMNs may process inelastic material models and were shown to reproduce micromechanical full-field simulations with the original microstructure to high accuracy. The work at hand was motivated by creep loading of thermoplastic components with fiber reinforcement. In this context, multiple scales appear, both in space (due to the reinforcements) and in time (short- and long-term effects). We demonstrate by computational examples that the classical training strategy based on linear elastic precomputations is not guaranteed to produce DMNs whose long-term creep response accurately matches high-fidelity computations. As a remedy, we propose an inelastically informed early stopping strategy for the offline training of the DMNs. Moreover, we introduce a novel strategy based on a surrogate material model, which shares the principal nonlinear effects with the true model but is significantly less expensive to evaluate. For the problem at hand, this strategy enables saving significant time during the parameter identification process. We demonstrate that the novel strategy provides DMNs which reliably generalize to creep loading.

Section 2 outlines the basic construction principles of DMNs, together with the classical elasticity-based training and the novel inelastically informed training strategy. Section 3 reports on the considered setup (Eglass fiber-reinforced PBT) and examines the various training procedures and their inelastic generalizations with care. Particularly powerful turns out to be a hybrid strategy, where the early stopping is itself based on an inelastic surrogate material model which shares the basic phenomena with the "real" material model but may be evaluated with less expense.

Basic concepts
A deep material network (DMN) [44,46,47] with N phases in three spatial dimensions consists of the following data. The weights may be collected in the diagonal matrix W ∈ R 6m×6m with W ii = w i/6 (floor division). (2.1) To express the compatibility conditions, we introduce the averaging matrix A ∈ R 6×6m , an m-fold copy of the 6 × 6-identity matrix. Then, the symmetrized gradient operator D and the weights w i should satisfy the condition AW D = 0, (2.2) which states that the weighted average of the compatible strains vanishes. Suppose a nonlinear stress-strain relationship f i : R 6 → R 6 at small strains is given for each material i = 1, . . . , N . Then, for prescribed strain ε, which we consider as an element of R 6 , we seek a displacement field u ∈ R 3n , s.t. the balance of linear momentum holds, where the function f applies the nonlinearity f i at the ith integration point and the vector A T ε contains m identical copies of the average strain ε. The effective stress is subsequently computed via Deep material networks serve as a high-level abstraction for a finite element discretization of an N -phase microstructure. Indeed, for such a microstructure Y , decomposed into N subdomains Y i , any finite element discretization gives rise to a node set N , a set of integration points G with associated quadrature weights w i . The gradient operator D arises by evaluating the FE B-matrix at the integration points, and a nodal decomposition G i emerges by associating any integration point y j ∈ Y to the material domain which it lies in, y j ∈ G i ⇐⇒ y j ∈ Y i . Then, the classical weak form of the finite element problem is equivalent to the equilibrium equation (2.3) for any type of nonlinearity f i . Deep material networks give rise to an effective material behavior which automatically satisfies the Hill-Mandel condition, inherits thermodynamical consistency from its phases, preserves elementary micromechanical bounds and gives rise to a uniquely solvable nonlinear system of equations (2.3) (provided the kernel of D is factored out and the nonlinearities f i are strictly monotone) [46,47].
Deep material networks serve as an abstraction of finite element discretizations of micromechanical problems. They inherit the positive characteristics, but dispense with the constraints of a physical realization of the finite element discretization. In particular, the node sets N and G have no intrinsic physical meaning. Still, DMNs may be regarded as statistically similar representative volume elements [55,56], which reflect-on an abstract level-the topology of the microstructure which they should serve as a surrogate model for.
To start with, DMNs typically fix the node set N and the integration points G (with the decomposition G i ) a priori. Then, a model for the gradient and the weights D ≡ D( p), W ≡ W ( p) (2.5) in terms of a suitable (hyper)parameter vector p is postulated. In the literature, different possibilities to choose the parametrized matrices D( p) and W ( p) were proposed. Liu and coworkers [44,45,50,51] considered a hierarchy of rotated laminates. Gajek et al. [46] introduced a hierarchical construction with laminates of variable direction of lamination, see Sect. 2.2. This construction has fewer parameters than rotated laminates, but was shown to provide similar accuracy. Nguyen and Noels [49] introduced polytopal DMNs which permit more general rank-one jumps than pure laminates to deal with porous microstructures. All these approaches have in common that the matrix D is sparse, reflecting a further characteristic of finite element discretizations. In any case, once the construction plan (2.5) is fixed, DMNs are operated in two stages. In the first stage, the parameters p are fitted to a suitable objective with contributions J s that measure data fidelity to n obs observations and, possibly, a regularization term ψ. This so-called offline training is typically based on linear elastic precomputations obtained from full-field simulations on a high-fidelity representation of the microstructure [44,45], giving rise to the contributions J s . Once the parameters p are identified, the deep material network may be used as a high-fidelity fullfield surrogate model for inverse parameter identification or for concurrent multiscale simulations [48,52,53]. For this purpose, equation (2.3) is solved for nonlinearities that arise from a time discretization of inelastic constitutive laws.
As an alternative to a parameter identification based on linear elastic data alone, the nonlinear effective behavior may be taken into consideration [49]. However, such a procedure may be delicate, as modern gradientbased learning techniques are based on automatic differentiation, which may prove rather computationally expensive for sophisticated material models. The work at hand proposed an alternative strategy based on early stopping [54, § 7.8], see Sect. 2.3.

Direct deep material networks
As a construction strategy for the symmetrized gradient operator D( p) and the weight matrix W ( p), we use a hierarchy of N -phase laminates where a scale separation is assumed at each level of the hierarchy. For K − 1 such scale separations, we obtain a perfect, ordered N -ary tree of depth K comprising individual N -phase laminates. We call such an N -ary tree a direct DMN [46,48,52]. Figure 1 shows an example of a direct DMN with K = 3 layers and N = 2 phases.
To build the DMN, we first consider a single N -phase laminate with lamination direction n ∈ R 3 and a vector c ∈ R N of volume fractions, which are positive and sum to one. The elastic behavior of such a laminate may be expressed in terms of N − 1 displacement jump vectors a = a 1 , . . . , a N −1 T ∈ R 3(N −1) , see Ospald et al. [57]. Classically, a N -phase laminate has N displacement jump vectors whose weighted average vanishes. In our representation, this constraint has been explicitly resolved by expressing the last displacement jump vector in terms of the previous N − 1 vectors. For fixed kinematics, the N strain fluctuations, one for each phase, emerge by applying the symmetrized gradient operator L(c, n) ∈ R 6N ×3(N −1) , given by the formula (2.8) to the vector of displacement jump vectors. Here, for any vector a, the symmetrized tensor product n⊗ s a may be expressed by the matrix-vector product B(n)a in Mandel's notation. Explicitly, the strain-displacement-jump matrix B : With these expressions for a single laminate at hand, let us consider an N -ary tree of such N -phase laminates.
To each laminate, a unit direction of lamination n k j is associated. We collect these normals in a large vector n = n 1 1 , n 2 1 , . . . , n 2 N , . . . , n K 1 , . . . , n K with an ordering which traverses the tree from the root to the leafs through each level from left to right (like a breadth-first search of the corresponding tree). Moreover, to each laminate, N volume fractions are associated. We parametrize those in terms of a collection w = w 1 , . . . , w N K ∈ R N K of positive weights (formally) residing on level K + 1. The volume fractions are then computed by a weighted average [46, eq. (3.8)-(3.9)].
The set of m = N K integration points G corresponds to the individual phases of the laminates on level K , and is partitioned into N subsets G i (i = 1, . . . , N ) in a alternating manner (2.11) see Fig. 1 for an illustration for the special case of N = 2. Thus, each laminate on the lowest level receives one of the N phases as input.
The kinematics of each of the laminates is governed by N 1 (local) displacement jumps Here, the upper index k = 1, . . . , K labels the depth and the lower index j = 1, . . . , N k−1 corresponds the horizontal position in the N -ary tree. We collect these individual displacement jumps in a (global) displacement vector with the same ordering as for the lamination directions. These serves as the displacement-type degrees of freedom of the deep material network. Thus, the direct DMN comprises n = N K − 1 nodes, one for each displacement jump. The (global) symmetrized gradient operator D ∈ R 6N K ×3(N K −1) may be assembled from the local symmetrized gradient operators (2.8) in a similar fashion as finite elements [58]. More precisely, we obtain a representation in terms of suitable extraction E k j and prolongation matrices J k j ∈ R 6N K ×6N . These matrices contain only zeros and ones and establish the connection between local quantities and global quantities. Moreover, the matrices E k j and J k j depend only on the topology of the tree and are independent of the DMN parameters p. The extraction matrix E k j ∈ R 3(N −1)×3(N K −1) picks out the (local) displacement jumps a k j from the global vector u, i.e., a k j = E k j u holds, and may be expressed in terms of the Kronecker product [59] (2.15) and the Cartesian basis vector e j−1+N k−1 ∈ R Similarly, the injection matrix J k j admits the Kronecker product representation the Cartesian basis vector e j ∈ R N k−1 and the column vector of ones To illustrate the procedure, we record these matrices for the two-phase DMN of depth K = 3 shown in Fig. 1. The extraction matrices read whereas the prolongation operators take the form We obtain the symmetrized gradient operator Returning to the general case, we observe that the symmetrized gradient operator D is uniquely parameterized by the individual volume fractions c k j as well as the directions of lamination n k j of the collection of laminates. Thus, the vectors n and w of lamination directions and weights could serve to parametrize such a direct DMN.
In the implementation [44,46], it is convenient to ensure non-negativity of the weights w ∈ R N K by expressing them in terms of unconstrained weights v ∈ R N K via applying the Macauley bracket (or ReLU activation function in machine learning) in a component-wise manner Thus, the parameter vector which uniquely defines both the gradient operator D( p) and the weight matrix W ( p) of a direct DMN is given by p = ( n, v).

An inelastically informed training strategy
Classically, DMNs are trained on objective functions of the form (2.6) where the functions J s ( p) measure the proximity of the current DMNs predictions to precomputed effective elastic properties. The term ψ( p) serves as a regularizing term and is explicitly defined in Sect. 3.2.
Although DMNs, which are identified based on elastic data, show accurate predictions for large classes on nonlinear and inelastic constitutive laws for the microstructural phases, the developed theory [46] hints that such a close agreement may be lost if the constitutive laws show a high degree of nonlinearity. Therefore, it appears a good idea to add nonlinear constitutive laws to the training, encoded by an additional term H ( p). Then, instead of the original problem (2.22), an augmented problem may be used to identify the parameters. Unfortunately, such an approach is restricted to rather simple inelastic material models [49], as the computational effort in evaluating complex constitutive laws, including computing the derivatives necessary for the gradient descent for a multitude of different loading conditions and time steps, quickly becomes prohibitive.
For this reason, we propose to use an alternative strategy which is similar in spirit to the commonly used early stopping [54, § 7.8] technique in machine learning. More precisely, in our strategy, the iterates p k of a conventional solver for the original problem (2.22) are stored in a set P. Then, the additional term H ( p), which encodes the performance of our model on the nonlinear inelastic constitutive laws is evaluated using the stored iterates p k during training. Finally, the best parameter vector p best = argmin p∈P H ( p) (2.24) is selected. It is common to store only each 50th or 100th iterate p k of the learning method, so that the runtime of such an early stopping strategy will be reduced by one or two orders of magnitude compared to a coupled strategy (2.23) if the effort of evaluating the inelasticity-aware objective H is significantly larger than for the original objective (2.22).  Table 1 Identified material parameters for PBT-GF30

Setup
We consider a polybutylene terephthalate (PBT), reinforced by 30% E-glass fibers (PBT-GF30). To characterize the matrix material, experiments were performed on a dog-bone shaped Becker sample [60], shown in Fig. 2, at the three load levels 23.5 MPa, 32.8 MPa and 37.5 MPa. The load was ramped up to the respective stress levels in 8.5 s and held constant for different time intervals depending on the load level. The results are shown in Fig. 3a. For all three load levels, we observe the typical three different creep phases [61]: primary, secondary and tertiary creep. The creep-strain rate starts with a high value at the beginning of the primary creep stage and reaches an approximately constant value during the secondary creep stage. The tertiary state is characterized by an increase in the creep strain rate, leading to fracture of the specimen, eventually [62]. The first phase ends at roughly the same time for all three considered load levels. The onset of the third phase and the inclination of the strain at this onset, however, differ significantly for the three load levels.
For the matrix, we use a coupled plasticity-creep model, where the elastoplastic part accounts for shortterm effects, whereas a viscoplastic augmentation accounts for long-term creep effects. We refer to Appendix A for details. To identify the free parameters, we used a two-step procedure. In a first step, the evolution of the creep strain was deactivated, and the elastic constants as well as the plasticity parameters were determined from uniaxial, monotone, tensile experiments up to 5% strain, performed for four different samples. For the parameter identification using OptiSlang [63], we set up a single-element unit-hexahedron model in Abaqus [64], and fixed the Poisson's ration to ν = 0.4 and the viscosity to η = 0.001 MPa s, emulating strain rate independence.
Once the elasticity and plasticity coefficients were determined, the remaining creep parameters were identified from long-term creep tests carried out at different load levels, see Fig. 3a, based on OptiSlang [63], wrapping an Abaqus [64] simulation. The model predictions with the identified parameters are compared to the experimental results in Fig. 3a. The model captures the three creep phases rather well and represents the three experimental results to a sufficient accuracy. The final list of identified parameters is given in Table 1.
For the E-glass fibers, we use standard parameters [65].
It is well known that the mechanical properties of fiber-reinforced composites strongly depend on the microstructure characteristics like fiber orientation and fiber length distribution [66,67]. To quantify the microstructure characteristics, we rely upon microcomputed tomography [68][69][70]. by the method discussed in Hessman et al. [71] for a specimen milled out from a 120 mm × 80 mm × 2 mm injection-molded plate, see Fig. 2. We used these data to reconstruct a digital microstructure with the SAM algorithm [72], see Fig. 3b. We generated a cubic unit cell with an edge length of 625 μm containing 2056 cylindrical fibers with a length of 269 μm and a diameter of 10 μm with a minimum distance of 3 μm between the fibers. The second-order fiber orientation tensor (3.1) was prescribed, combined with the exact closure approximation [73,74]. The microstructure was discretized by 512 3 voxels. For the subsequent FFT computations, we used the composite-voxel technique [75][76][77] with a resampling factor of two. Thus, after resampling, the unit cell consists of 256 3 voxels. The computation time when performing a linear elastic FFT simulation using a phase contrast of 10,000 and a unit cell discretized by 512 3 voxels was 19.9 h. In comparison, the resampled microstructure with 256 3 voxels and composite voxels required just 2.38 h. The resampling error between the stiffness tensors is just 1.12% in the Frobenius norm in Mandel notation. As, in the case of a lower phase contrast, an even smaller error is expected, the composite-voxel method serves as a reasonable way to speed up the computations.
Last but not least, let us discuss the used software and hardware. The micromechanical computations were performed with the software FeelMath [78] on a HPC cluster with nodes that have 16 CPUs each and 4 GB of memory per CPU. FeelMath [78] implements FFT-based computational homogenization methods. More precisely, we used the staggered grid discretization [79,80] and linear as well as nonlinear conjugate gradient methods [81][82][83] for linear and nonlinear constitutive behavior, respectively.
The DMN was trained on a single CPU with 60GB of memory. The DMN online phase code was run on a single computing node. Both the material model and the DMN were implemented as user-defined material subroutines (UMATs) in Abaqus [64], which may also be used in FeelMath [78]. We use the LAPACK [84] libraries for the linear algebra operations.

Performance of classical sampling
We consider a microstructure with N = 2 phases, i.e., the PBT matrix and the E-glass fibers. We use the direct DMN described in Sect. 2.2, and initialize the parameters ( n, v) as proposed in Gajek et al. [46].
For the training, we generated training and validation data based on full-field FFT-based computational homogenization. These data consist of triples C s 1 , C s 2 ,C s FFT , where C s 1 and C s 2 serve as the linear elasticity tensors of the two materials, andC s FFT refers to the effective elasticity tensor. The tuples C s 1 , C s 2 were sampled by the method proposed by Liu and Wu [45] utilizing orthotropic elasticity tensors. The effective properties C s FFT were computed with FeelMath [78]. We generated 1 000 samples as mentioned, split into 800 training and 200 validation points.
For fixed DMN parameters ( n, v), we denote byC s DMN ( n, v) the effective stiffness computed by the DMN with phase stiffnesses C s 1 and C s 2 . We refer to Gajek et al. [46] for details how to evaluate the effective stiffness C s  DMN ( n, v) efficiently.
For the offline training with linear elastic data, we minimize the objective function (2.6), i.e., with the batch size n b = 40, the contributions where c 1 = 0.822 and c 2 = 0.178 denote the volume fractions of the respective phases, and a 1 is a vector that has ones at all odd indices and is zero; otherwise, a 2 is a vector that has ones at all even indices and is zero otherwise, and λ refers to a penalty factor which we set to 100. The term ψ enforces the respective volume fractions of both phases [85]. This approach differs from previous works [44][45][46], where the total volume fraction of the DMN is enforced to unity. Instead, we enforce the respective volume fractions of each of the phases. We found empirically that such a strategy increases the reliability of the offline training. We use the 1 -norm in Eq. (3.3) since an independent study with p -norms and variable exponents p revealed that the subsequently trained DMNs capture the inelastic creep response most closely for p = 1. For reasons of conciseness, we chose not to report this study here.
The training, i.e., the minimization of the objective function (3.2), proceeds as for (more general) neural networks, i.e., via a stochastic batch-gradient-descent-type algorithm based on automatic differentiation and a grouping into batches. (We use batches of size 40.) We implemented the procedure in PyTorch [86] and used the AMSGrad method [87] for training the network along with a learning rate modulation using cosine annealing [88], where β and m refer to the learning rate and epoch, respectively. We use a maximum learning rate β max = 0.0007 and a minimum learning rate β min = 0. The parameter M is set to 4000. The training of the DMN was performed up to 25,000 epochs. The typical progress during training is shown in Fig. 4a. For the first 100 epochs, the objective function decreases monotonically. Thereafter, the loss function shows some fluctuations. These are triggered by the learning rate modulation which enables escaping local minima of the objective function. Indeed, on a large scale, a further decrease of the objective function up to around epoch 20,000 is observed. For the considered example, the smallest loss occurred at epoch 19,962. The training ran for 18.8 h.
After training, we used the DMN with the smallest realized loss for further use.
We investigated the effect of depth on the online evaluation of creep and found that the performance of the DMN in the online phase increases up to a depth of seven. For higher depth, no further improvement in accuracy for the evaluation of creep in the online phase was observed. We trained our seven-layer DMN on   ( n, v).
To assess the generalization capabilities of the DMN, we introduce the sample-wise error where N s is the number of samples in the training or the validation set, respectively. After the training, we studied the generalization capabilities of the model with the help of linear elastic validation data comprising 200 data points. The results of the training and the validation errors are detailed in Table 2.
We observe that the reached loss value is comparable for all six considered DMNs. Also, the mean training and validation error is rather close for all considered networks. We find that the mean training error is slightly smaller than the mean validation error for all the networks. The maximum training error for all the networks is around 13%. The maximum validation error ranges from slightly above 4% to almost 7%, i.e., we find a larger variation (roughly by a factor of two). Still, the elastic training and validation errors are on a reasonable level, in particular in view of the mean errors.
The mean training and validation errors for DMN #1 over the course of epochs is shown in Fig. 4b. We observe that the mean training and validation errors follow a similar trend as the loss function. Moreover, the training and validation error decrease simultaneously during training indicating no overfitting occurs in the offline training phase with respect to the linear elastic training data. We use our six trained DMNs to evaluate the online phase. We briefly comment on the elastoplastic response of the DMNs. The online phase of the DMN, implemented as an UMAT, is evaluated on a single four-node unit tetrahedron element in Abaqus [64]. We use the load increment control of Abaqus [64] for applying the load, and compare the output to full-field simulations obtained by an FFT-based solver [83]. We applied hysteretic, uniaxial strain loadingsε i j for a period of 4 s with a strain amplitude of 2.5% for 40 load steps in all six loading directions. For each instance of time t, the relative error in each stress component is considered, together with the maximum relative error over the entire simulation window. In addition to the elastoplastic response, we also investigate the creep behavior. To account for the anisotropy of the microstructure, we investigate the performance of the DMN in three loading directions with angles 0 • , 30 • , and 90 • . Here, the 0 • direction indicates the flow direction and the 90 • direction is perpendicular to the flow direction during the manufacturing of the test plate. The experimental analysis of the creep loading was performed using the samples cut out from the injection-molded plate in these specific angles as shown in Fig. 2. The creep response of the DMN was evaluated on a single voxel microstructure using an FFT-based solver [83]. The implemented DMN online phase UMAT can be flexibly integrated into the FFT-based solver FeelMath [78]. We consider 32 time steps, equally spaced in the logarithmic timescale. The DMN response was compared with the effective strain computed by full-field simulations performed with the same FFT-based solver [83].
In the 0 • direction, we applied a uniaxial tensile stressσ 11 in the flow direction which is ramped up to 65.4 MPa in 8.5 s and subsequently held constant for 3.78 × 10 6 s. We evaluated the creep strain component ε 11 in the flow direction over the entire simulation window. In the 0 • direction, we refer to the evaluated strain by the nameε 0 • . We apply a lower uniaxial stressσ 22   The trained DMNs whose elastic validation results are shown in Table 2 are used for the online evaluation of plasticity and creep. Table 3 contains the results for the inelastic evaluation. For the maximum error during the elastoplastic loading, we observe a variation of the results by a factor of three. DMN #1 leads to a low relative error slightly below 3%. Three DMNs give rise to about 5% error, and two DMNs are characterized by a relative error slightly below 10%. Taking a glance at the elastic validation errors, see Table 2, we observe that the DMN with the best elastoplastic online error does not correspond to the DMN with the best elastic validation error.
Even more striking are the variations in the creep response. Whereas DMNs #1, #4 and #5 provide a rather small creep error, two DMNs give rise to relative creep errors on the order of 20%, which is certainly not acceptable for engineering accuracy. Moreover, we notice that the maxima are realized in different directions. Indeed, for DMN #2, the 0 • -direction is worst, but the 30 • -direction is matched rather accurately. In contrast, for DMN #3, both the 0 • and the 30 • -direction are inaccurate. Thus, there appears to be no systematic cause for this phenomenon, e.g., related to the fiber orientation. We observe that the DMN #1, #4 and #5 can represent both the elastoplastic loading and creep accurately. In contrast, the DMNs #2 and #3 are incapable of representing the creep loading with sufficient fidelity.
To sum up, the results of Table 3 reveal that good results achieved during the elasticity-based training results do not necessarily lead to accurate predictions for the creep response. We will study this phenomenon more thoroughly in the next section.

Accurate creep predictions with early stopping and a surrogate material model
In the previous section, we observed that being a minimizer of the elasticity-based loss function (3.2)-(3.3) is not necessarily related to providing an accurate creep prediction for the DMN under consideration. To get a deeper insight into this phenomenon, we took a closer look at DMN #6 from Table 3. More precisely, we stored the DMN parameters ( n k , v k ) every 50 epochs during the training. As a post-processing step, we evaluated the elastoplastic (3.9) and the creep errors (3.11) at these parameter states ( n k , v k ). The results, alongside the loss function, are shown in Fig. 5. In Fig. 5a, we observe that the loss function decreases in a similar way as for DMN #1, see Fig. 4a. The decrease is monotonic on large epoch scales, but shows variations up to half an order of magnitude during training as a result of the learning rate modulation. Taking a look at the plasticity errors (3.9) for the six considered loading directions, see Fig. 5b, we observe that the errors decrease consistently up to epoch 300. Then, only the error of the 12-component decreases further, whereas the other five errors more or less remain on the same level. At epoch 6000, the 12-error starts to increase significantly, whereas the 22-error-the largest up to this epoch-starts to decrease. For epochs exceeding 10,000, the maximum error max i, j e p i j starts to increase again. Similar observations may be made for the creep error (3.11), shown in Fig. 5c. For the first 300 epochs, the error decreases. For higher epochs, the errors in the considered directions show a non-monotonic behavior. However, the creep error is much larger than the plasticity error, in a range up to 15%. Interestingly, there is a range of epochs, approximately at 10,000, where the creep error is low, below 5%. This smallness is neither reflected in the loss function nor the plasticity error. In the terminology of deep learning, the observed phenomenon may be interpreted as "overfitting," i.e., producing parameters that correspond too closely to a specific set of data, but fail to represent additional data or to predict future observations in a reliable way.
Classically, there are (at least) two ways to rectify this shortcoming. The first strategy consists of adding further data points to the considered set. In our case, we could augment the classical elasticity-based offline training by more cleverly chosen stiffness triples C s 1 , C s 2 ,C s FFT or by including inelastic data into the loss function [49]. Unfortunately, such a strategy does not protect us from overfitting relative to other cases which were not considered before. In the case of inelastic training for reproducing creep loading, training on inelastic data for the 0 • direction does not guarantee us a good fit in the 30 • and 90 • directions. Therefore, an alternative way to circumvent overfitting is to stop early [54, § 7.8], i.e., to stop at a stage where the fitting quality of the considered data in the linear elastic regime (2.6) and an additional independent nonlinear validation data set are both good. In the former case, the training and validation error can be monitored so that there is no overfitting during the offline elastic training. In the latter case, the performance of the DMN in the inelastic regime should be within engineering accuracy when compared with inelastic full-field simulations. Indeed, by monitoring the additional data, we may detect when overfitting happens in the inelastic regime.
The analysis at the beginning of this section, in particular given in Fig. 5c, shows that early stopping based on the additional term (3.11) does indeed work for DMN #6. The DMN #6 trained on linear elastic data can accurately predict the online inelastic response of highly nonlinear material models, providing a maximum relative creep error below 5%. Thus, the strategy proposed in Sect. 2.3 works and, in theory, we can perform the evaluation of the creep model during training for identification of the best parameter vector.
However, we are interested in a further improvement of the strategy. Due to the complexity of the considered creep model, the post-processing for evaluating the creep error e c ( n, v) ran for 24.4 h. In particular, the postprocessing took 30% longer than the previous elasticity-based training! Hence, performing the evaluation of the creep model during the elasticity-based training will lead to a cumbersome and time-intensive training procedure. Clearly, the long-term creep behavior demands certain characteristics of the DMNs to be accurate, and isolating the proper elastic training data appears difficult. To reduce the computational effort of evaluating the full creep model, we came up with the following idea. Instead of the full creep model, it might be sufficient to consider another creep model, which is, on the one hand, less expensive to evaluate, but, on the other hand, identifies the same weaknesses in generalization to long-term loading as the full-field model. For this purpose, we consider Hooke's law in combination with a Norton-type creep law [62] (3.13) with creep parameters A and n. We chose the parameters as detailed in Table 4. The elastic parameters coincide with the full creep model, as does the parameter A ≡ A 1 , see Table 1. The creep exponent n and the parameter A were identified with the help of the long-term creep experiments carried out at different load levels and an OptiSlang [63] procedure based on an Abaqus [64] simulation. We conducted full-field simulations on the considered microstructure, see Fig. 3b, with the Norton model (3.13) for the matrix material. For the three considered directions, see Fig. 6, we observe that both the elastic response and the first creep phase are captured rather accurately by the simplified model. Moreover, the onset of the secondary creep phase is also represented quite well. However, the effective strains are underestimated by the Norton-type model for the creep phase two and beyond. To assess the predictive quality of the Norton error e n , we trained two additional DMNs with a maximum learning rate β max = 0.0005. During the offline training of the DMN based on the linear elastic data, we introduce an inelastic validation data set based on the Norton errors for 0 • , 30 • and 90 • directions. We consider this validation data set in addition to the linear elastic validation data set which is monitored during the training to prevent the overfitting in the elastic regime, see Fig. 4b. The linear elasticity-based training proceeds as in Sect. 3.2, via a stochastic batch-gradient-descent-type algorithm. We evaluate the loss function J ( p) 3.2 involving the parameters p = ( n, v), determine the gradients ∂ J /∂ n and ∂ J /∂ v via automatic differentiation and finally update the fitting parameters. The parameter vector p is stored every 50 epochs to evaluate the inelastic validation data set, wherein the Norton error e n ( p) (3.15) is calculated. This serves as the basis for identifying the best parameter vector p best .
In a classical early stopping approach, the validation data set error is monitored and the training is stopped when the error does not improve for a predefined number of states. A copy of the model parameters are stored every time the error on the validation set improves. The parameters with the best validation error rather than the most recent parameters are returned once the training algorithm is terminated [54, § 7.8].
In our case, the inelastic validation error does not show a monotonic decrease and has persistent fluctuations, see Fig. 5b, c, which is a consequence of the learning rate modulation (3.5) used during training. Thus, terminating the training once the inelastic validation error has not improved for a predefined number of epochs (which results in an additional hyperparameter introduced into the procedure) is not appropriate. As an alternative, we monitor the Norton error e n ( p) on the fly every 50 epochs and store the parameters each time the Norton error improves, for the complete training. More precisely, after the end of 10,000 epochs, we return the parameter vector p best from the state (evaluated every 50 epochs) having the best inelastic validation error, i.e., the least Norton error. This quasi-early stopping approach for the inelastically informed training methodology is outlined in Algorithm 1.
The results of the evaluation of the inelastic validation data set over the course of training is shown in Fig. 7a. We also performed the creep and plasticity evaluation using the fully coupled plasticity-creep law as a post-processing step, see Fig. 7b, c respectively. The creep and plasticity errors are not used for identifying the best DMN and the plots are presented in Fig. 7 to highlight the correlation between evolution of the Norton error during training with the post-processed results.
Calculate error measure for Norton model (3.15) for on-the-fly inelastic validation e n ( p) ← max θ e n θ ( p) if e n ( p) < e n best then e n best ← e n ( p) Update best parameter vector to be returned at end of training with the current parameters p best ← ( n, v) end if end if end for Again we observe differences in the error measures between the considered DMNs. Indeed, taking a look at the plasticity error shown in Fig. 7c, we observe e p 22 is largest for DMN #1 and e p 11 is largest for DMN #2 at the end of the training. We also observe differences between the plasticity error, see Fig. 7c, and the creep error, see Fig. 7b, for both DMNs. For instance, the large dip of e c 90 • for DMN #1 is not reflected by particularly low values of the plasticity error in any component. Comparing the Norton and the creep error, see Fig. 7a, b, respectively, leads to a better resemblance. Indeed, after epoch 1000 there is a good agreement between the trends of the Norton errors and the creep errors for the individual loading directions. This correspondence is also underlined by the identified epochs where the respective error is lowest during training. For DMN #1, the best epoch is very similar for all three considered error measures. For DMN #2, the best plasticity error is reached a few thousand epochs before the best creep-error epoch, which is, in turn, closely matched by the best Norton error epoch. For both DMNs, the errors are reported in Table 5. We observe that the DMNs identified by early stopping based on the Norton error e n provide both a good creep and plasticity error, slightly above 3%. Actually, the minimum reached Norton error exceeds both other considered error measures.
The minimum creep error e c for DMN #1 over 10,000 epochs is 4.49% captured at epoch 7950. This is rather close to the creep error e c of 5.32% captured using the Norton-type model as shown in Table 5. Similarly, the minimum creep error e c for DMN #2 using the plasticity-creep coupled model is 2.33%. Moreover, just the evaluation of the coupled plasticity-creep law lasts for more than 8 h as described in Table 5, whereas the training time along with the evaluation of the Norton model requires only slightly more than 12 h. Thus, we save considerable computation time with only a slight loss of performance when using the Norton model.
To assess the generalization capabilities of the DMNs identified using the Norton error, we a posteriori evaluate their performance on an inelastic test data set. The test data sets consist of the creep response evaluated using the fully coupled plasticity-creep law at three different directions of 15 • , 45 • and 60 • with respect to the flow direction. Uniaxial stresses of 60.4 MPa, 60.0 MPa and 42.7 MPa were used for the 15 • , 45 • and 60 • directions, respectively, for obtaining the creep strain response. The loading is performed in a similar manner as for the 30 • direction outlined in Sect. 3.2. We extend the true creep error evaluated for 0 • , 30 • and 90 • to these directions and the maximum relative errors for both DMNs are recorded in Table 6. The relative errors for all the test directions for DMN #1 are rather close with a maximum error of 3.79% in the 60 • direction. The relative errors for DMN#2 vary by a factor of roughly two with a maximum of 7.61% in the 45 • direction. It is interesting to note that a smaller inelastic Norton validation error does not necessarily lead to a smaller test creep error, as evident from Tables 5 and 6. Although the errors in the test directions are not monitored during the inelastic validation, both DMNs are able to represent the a posteriori creep loading in the test data with sufficient fidelity.
Thus, we conclude that the novel early stopping-based training strategy serves as a low-cost methodology for reliably identifying the parameters of deep material networks suitable for creep loading.
Last but not least, we report on the performance of the DMNs, identified via the novel method, for predicting the creep response compared to high-fidelity simulations. For this purpose, we worked with DMN #2 from Table 5, and considered two different stress levels for each of the three directions from the inelastic validation data set and from the inelastic test data set. The results, shown in Fig. 8, reveal a close agreement between the DMN predictions and the full-field computations for the stress levels (shown in red) in the directions that the network was trained on. This comes as no surprise. By the way, it becomes apparent that for more than 10 4 s, the 30 • direction leads to inaccurate predictions. For the additional stress loading and the directions, which had not been monitored, the DMN closely matches the full-field predictions for strain levels below 1.5%. Actually, the agreement is quite remarkable. For higher effective strains, the agreement between DMN predictions and full-field simulation is worse. It remains to be studied whether this is caused by a lack of suitable training or results from insufficient material modeling. Indeed, for effective strains exceeding 1.5%, the local strains in the matrix exceed 5%. In particular, a material model at small strains appears questionable.
The runtimes of DMN and full-field simulations are compared in Table 7. The full-field simulations operate on a microstructure with 256 3 voxels. Moreover, the composite-voxel technique [76] is used. In contrast, the  DMN is integrated on a single voxel. The full-field simulations take between 3.2 and 7.0 h, whereas the DMN finishes after half a minute. Thus, we observe speedup factors between slightly less than 400 up to almost 850. Thus, we confirm the startling speedup factors reported by other authors [45,48,49].

Conclusion
Scientific progress consists of several phases. Initially, new territory gets explored and claimed. More often than not, these new islands of knowledge are established independently, lacking a proper connection to more established grounds. Moreover, the new territory needs to be charted and guarded. We consider the introduction of deep material networks by Liu and coworkers [44,45] as such a groundbreaking moment, which opened the eyes of the micromechanical community to the possibilities of the deep learning technology. Indeed, finding accurate yet computationally tractable surrogate models of microstructures materials on component scale for wide classes of industrially relevant materials has been an objective of micromechanics for a long time, and different strategies emerged, as discussed in Introduction. Several years have gone by, and the basic inner workings of DMNs were uncovered, and new variants of DMNs were proposed. The purpose of the work at hand was to critically review the concept of elasticity-based training of DMNs when a high degree of nonlinearity and rather larger timescales are involved. The available theory [46] is based on a perturbation argument around linear viscoelastic constitutive laws. In particular, it is conceivable that training based on linear elastic data alone might not be sufficient to prepare the DMN for the harsh conditions of creep loading. In this work, we provided (more or less explicit) examples of DMNs that fail to generalize from linear elasticity to creep. Due to a lack of a mathematical convergence theory for DMNs, it is not clear whether this is a defect of the linear elasticity-based training per se or a clever choice of sampled stiffnesses or appropriately designed loss functions may resolve this issue. Notwithstanding this lack of understanding from our side, we worked out a remedy for this shortcoming based on the early stopping strategy which is classical in deep learning. We demonstrated that such an approach leads to reproducible and reliable training results for DMNs that may be used with confidence. The introduced technology serves as a low-cost alternative to inelastic training [49] and is expected to enter the standard toolbox of the deep material networker.
As already indicated, we consider the state of affairs surrounding the DMNs in the second, contemplating, phase following the initial exploration phase. The underpinnings of DMNs need to be understood more thoroughly, and guidelines for its use, including preferred tools, need to be identified.

A Details on the plasticity-creep model
We consider a material model at small strains with an additive decomposition of the strain tensor ε = ∇ s u ε = ε e + ε p + ε c (A.1) into an elastic, plastic and a creep contribution. We consider the plastic strain and the creep strain as the internal variables, together with an isotropic hardening variable α. It is assumed that plastic and creep deformations are volume preserving. We consider an isotropic material model with stress tensor where κ and μ denote the compression and shear modulus, respectively. The evolution of the internal variables is governed by the equationṡ The material parameters encompass a yield stress y 0 , a limiting stress y ∞ , a plastic hardening modulus h, an exponential hardening factor ω, a viscosity η, the creep constants C and k, creep prefactors A 1 as well as A 2 , a reference creep rateε 0 , and the creep exponent n. Details for the derivation of the material model [90,91], which may be cast in the framework of generalized standard material (GSM) [92], will be discussed elsewhere. The material model is discretized with an implicit Euler method in time. The discretized evolution equations are solved by a classical return mapping algorithm based on a creep-plastic predictor and a plastic corrector step [93].