AutoMat -- Automatic Differentiation for Generalized Standard Materials on GPUs

We propose a universal method for the evaluation of generalized standard materials that greatly simplifies the material law implementation process. By means of automatic differentiation and a numerical integration scheme, AutoMat reduces the implementation effort to two potential functions. By moving AutoMat to the GPU, we close the performance gap to conventional evaluation routines and demonstrate in detail that the expression level reverse mode of automatic differentiation as well as its extension to second order derivatives can be applied inside CUDA kernels. We underline the effectiveness and the applicability of AutoMat by integrating it into the FFT-based homogenization scheme of Moulinec and Suquet and discuss the benefits of using AutoMat with respect to runtime and solution accuracy for an elasto-viscoplastic example.


Introduction
In recent years, the improving quality of micro x-ray computed tomography (CT) images led to a digitalization of the material characterization process for composites. Nowadays, standard CT-devices have a maximum resolution below one µm and produce 3D images of up to 4096 3 voxels. This permits a detailed view of the microstructure's geometry of composite materials up to the point where continuum approaches are still reasonable. In the context of material characterization, the physical description of the body leads to a partial differential equation (PDE) in which the behavior of the material itself is modeled in terms of a material law. Traditionally, a finite element (FEM) discretization is applied, and during the solution procedure, the material law is evaluated locally at quadrature points. To solve problems of this size with conventional FEM, large computing clusters are required to handle the global stiffness matrices [1,2].
In the last two decades, the FFT-based homogenization scheme of Moulinec and Suquet [33,34] emerged as a memory efficient matrix-free alternative that was adapted to operate on structured finite element meshes [53,44,45,28]. Besides the small memory footprint, the most favorable property of the so-called basic scheme is a tangent-free treatment of nonlinear material behavior. However, its required iteration count is proportional to the material contrast, i. e. the maximum of the quotient of the largest and the smallest eigenvalue of the algorithmic tangential stiffness field. Thus, for certain practical applications such as the homogenization of plastifying materials, the convergence behavior can be exceedingly slow [43].
To accelerate the solution process, Zeman et al. [55] and Brisard and Dormieux [6,7] applied Krylov-subspace solvers to FFT-based homogenization. These methods are extremely fast, but they are restricted to linear problems. By combination with inexact Newtonmethods, they were extended to the physically [14] and geometrically [23] nonlinear case and exhibited excellent performance [29,30]. The drawback of this approach consists in either loosing the small memory footprint or the need to calculate the tangential stiffness of the material laws in every iteration of the linear solver. Furthermore, the analytic derivation of the tangent can be tedious and its implementation may require considerable programming effort, and is thus prone to errors. This gave rise to applying Quasi-Newton methods in FFTbased micromechanics [47,42,8,9,52,43]. There, material tangents are replaced by suitable approximations. To sum up, the choice of the solver is driven by compromises between runtime efficiency, memory efficiency and the implementational effort of an accurate material tangent.
Especially during prototyping and modeling, it might be necessary to assess different material laws. Clearly, it is impractical to derive the material tangent from scratch for every material law under consideration. However, it is also undesirable to be restricted to tangent-free solvers during this phase. Motivated by the work of Rothe and Hartmann [37], we started the development of AutoMat, which leverages automatic differentiation and GPU computing to simultaneously address issues of flexibility, accuracy and performance.
Automatic differentiation (AD) refers to techniques for the automatic acquisition of machine accurate derivatives of computer codes [15]. These have applications in, e. g., the setup of adjoint solvers [40], parameter identification [4], shape optimization [13], and machine learning [17]. There, AD is applied to a full simulation. Here, we use AD locally for the automatic setup of solvers and eliminate the inconvenience of hand-computed derivatives. For classical CPU architectures, several mature AD tools are available as of now, for example ADOL-C [51], dco/c++ [26] and CoDi-Pack [39]. Advances in the direction of AD for GPU codes are more recent, examples include dco/map with applications in computational finance [27]. In [37], Rothe and Hartmann use the source transformation tool OpenAD [49] for the automatic computation of material tangents and the assembly of Jacobians for implicit solvers in the context of a multi-level Newton algorithm. In this work, the automatic differentiation ansatz is advanced in several directions.
We focus on the class of generalized standard materials (GSM) [20], which we introduce in Section 2. There, AD enables us to recover the constitutive equations of the material law automatically from given implementations of two potentials, resulting in a fully automatic solver setup. This allows for a highly usable and convenient integration of GSMs into mechanical solvers. We demonstrate this by integrating AutoMat into the FFT-based homogenization scheme of Moulinec and Suquet [34] as implemented in FeelMath 1 . As our benchmark example for AutoMat, we use an elastoviscoplastic material model with material parameters adjusted to measurements of a metalmatrix composite. The precise setup is taken from Michel and Suquet [31] and summarized in Section 3.
The consistent tangent operator is the algorithmic derivative of the stress as it is computed from the strain according to the material law. Its computation requires a differentiation through an integration scheme for ordinary differential equations (ODEs). The conventional backward Euler step is differentiated in [48] by hand. We show in Section 4.1 that this procedure can be fully automatized. To understand the numerical properties of the tangent computation, we interpret it in Section 4.2 as a single implicit Euler step applied to an ODE for the derivative. Since this ODE depends on the chosen loading step size, convergence of the tangent for decreasing step size is not guaranteed. This motivated us to explore schemes with adaptive time steps instead. The differentiation of ODE integration schemes in a blackbox manner, that is, without consideration of the structure of the integration algorithm and its approximative nature, usually leads to incorrect derivatives [11]. Therefore, we refine the strategy of solving simultaneously an ODE for the derivatives [11,36] in the presence of step size control for Rosenbrock methods and both explicit and implicit Runge-Kutta schemes. Particularly, the relation to blackbox differentiation is explored. With the results, we can guarantee that the tangent is as accurate as the primal solution. The overall robustness and accuracy of the proposed scheme is assessed in Section 4.3. We achieve further robustness with respect to the choice of the ODE solver by employing a stress-driven error control, which we present in Section 4.4.
In FFT-based homogenization, the computationally costly simulation components are the Fourier transform and material law evaluation [16]. For nonlinear materials, the latter 1 https://www.itwm.fraunhofer.de/feelmath tends to dominate the overall run time [25] and is hence performance critical. The spatial independence of material law evaluations allows for parallelization, which is typically used in an efficient implementation. Throughout Section 4, we compare parallelized material law evaluations on the CPU with GPU accelerated material law evaluation. We achieve a notable speedup for conventional material law evaluation, but particularly for the computationally more involved automatic evaluation strategies presented in this paper, there are significant performance gains. In our example and setup, we were able to close the performance gap between conventional material law evaluation on the CPU and automatic material law evaluation on the GPU. The good performance would not be possible without an efficient implementation of automatic differentiation on the GPU. Therefore, we developed an operator overloading AD tool specifically for the application presented in this paper. It is based on expression template techniques; previously in AD, these were successfully applied for the treatment of right hand sides in the forward mode [35] and in Jacobi taping [22] as well as primal value taping [38] in the reverse mode. The details of the implementation and its further optimizations are presented in Section 5. In Section 6, remaining influence factors on the performance are discussed. We analyze the performance limiters of AutoMat, present design choices and optimizations of the GPU implementation and discuss overlap of CPU workloads, GPU workloads, and data exchange as well as reductions of the memory footprint.
Finally, we summarize and conclude our work in Section 7.

Generalized Standard Materials
The notion of generalized standard materials is originally introduced in [20]; a compact introduction to the subject can be found in [31]. Let ε denote the right Cauchy-Green strain tensor, σ the Cauchy stress tensor and a ∈ R m the vector of internal variables, all depending on time and space. The constitutive equations of the material law are given in terms of a Helmholtz free energy density (ε, a) → ω(ε, a) and a force potential A → Ψ (A) and read A is referred to as generalized stresses and if both ω and Ψ are convex functions of their arguments, we speak of a generalized standard material. The dissipation potential which is the convex dual of Ψ is not used in the present study.
After space discretization, evaluations of above stress-strain relationship and evolution of internal variables are required in the quadrature points. We drop the x dependency in the notation as the specific location does not change throughout a single material law evaluation. After time discretization, the material law inputs at a quadrature point consist of a strain tensor ε n and internal variables a n at time t n as well as a strain tensor ε n+1 which is usually only a prediction of the actual strain tensor at time t n+1 in the context of the surrounding elasticity solver. Then, in each quadrature point, the material law can be evaluated as follows.

Solve the ODE for the internal variables
(2) with initial data (t n , a n ) on the time interval [t n , t n+1 ]. Recover ε(t) by means of linear interpolation between ε n and ε n+1 . This way, obtain a n+1 . 2. Compute σ n+1 via (1) from ε n+1 and a n+1 .
Additionally, the consistent tangent operator C n+1 which is the algorithmic derivative of σ n+1 with respect to ε n+1 is usually computed along with the material law. It is used in the FFT-based homogenization scheme to determine the optimal reference material.
In view of the decision for an integration scheme for (2), negative eigenvalues of the Jacobian of the ODE's right hand side indicate that explicit solvers might display unstable behaviour [18], that is, require extremely small steps. The following theorem states that evolution equations arising from GSMs are subject to this issue. Following [54], . This definition implies that each positive semi-definite complex matrix is Hermitean.
Theorem 1 Let a GSM be specified by ω and Ψ and assume that both are C 2 . If λ is an eigenvalue of the Jacobian with respect to a of the right hand side of (2), then λ ∈ R ≤0 .
Proof The Jacobian with respect to a of the right hand side of (2) reads d da As Hessians of C 2 functions, both ∂ 2 Ψ ∂A 2 and ∂ 2 ω ∂a 2 are symmetric. Since both ω and Ψ are convex and C 2 , ∂ 2 Ψ ∂A 2 and ∂ 2 ω ∂a 2 are also positive semidefinite as real matrices, that is, where M denotes any of both Hessians. Symmetric and positive semi-definite real matrices are also positive semi-definite as complex matrices. By Theorem 2.2 in [54], the product of positive semi-definite complex matrices is similar to a positive semi-definite complex matrix, that is, there exists an invertible complex matrix T such that T −1 ∂ 2 Ψ ∂A 2 ∂ 2 ω ∂a 2 T is a positive semi-definite complex matrix. All eigenvalues of a positive semi-definite complex matrix lie in R ≥0 . As similarity preserves eigenvalues, all eigenvalues of the product ∂ 2 Ψ ∂A 2 ∂ 2 ω ∂a 2 are contained in R ≥0 ; hence all eigenvalues of (3) are contained in R ≤0 .
Another example for an eigenvalue proof based on definiteness and convexity in the context of material simulation can be found in [10]. There, a time-marching scheme for the solution of a viscoplastic problem is identified as a system of ODEs for the stresses at integration points and the eigenvalues of the Jacobian of the right hand side are used to assess stability properties.
Back to Theorem 1, whether explicit solvers (with adaptive step size control) or implicit solvers are faster depends on the specific material law, internal variable values, applied strain and integration interval length. To give a short example, we ignore the elasticy solver and focus on the material law evaluation at a single voxel. Consider the ODE arising from the example (4), (5) that is introduced in the next section with parameters from Table 1. We set ε n = 0, a n = 0, ε xx, n+1 = 0.4 % and integrate over time intervals of varying length ∆t. Figure 1 displays the numbers of intermediate steps and run times observed with the ODE solvers available in MATLAB 2 . The performance of explicit solvers is competitive up to rather large integration interval lengths. It is clearly linked to the number of intermediate steps taken by adaptive step size control and only for large ∆t, the number of adaptive steps taken by explicit solvers is driven by stability rather than accuracy and increases with ∆t. In Section 4, we refine both explicit and implicit solution strategies.

Example
Throughout the paper at hand, we perform our numerical studies for a uni-axial tensioncompression test of a short fiber reinforced metal-matrix composite (MMC) taken from [31].
Microstructure The MMC consists of 10.2 vol% Al203 fibers embedded in an aluminum matrix. In our periodically generated microstructure (see Figure 2), the planar isotropic distributed fibers have a diameter of 9 µm and a length of 135 µm. This volume element 2 https://de.mathworks.com/products/matlab. html of 150 × 150 × 150 µm 3 was discretized by 150 × 150 × 150 voxels.

Material Model
The Al203 fibers are modeled linear elastic with Young's modulus E and Poisson's ratio ν and the aluminum matrix as the elasto-viscoplastic GSM given by the potentials and ε e = ε − ε vp , and with viscoplastic strain ε vp and equivalent plastic strain α as internal variables. C e is an elastic stiffness matrix given in terms of a second (E, ν) pair and K(α) describes the Includes also a lower order scheme "ode12" and for comparison an analogous custom implementation "ode23". Details on the schemes are included in Section 4.2.

Parameter Unit Aluminum Al203
E GPa 55 300 isotropic hardening and H the (linear) kinematic hardening, whereas the viscous effects are given by the drag stress σ d , the rate sensitivity n and the reference strain rateε 0 . For computational efficiency, the Voigt notation [50] is used for strain and stiffness tensors. For the studied example, the nonlinear parameters of the aluminum matrix were calibrated without isotropic hardening, i. e. K(α) was assumed to be equal to the initial yield stress σ Y , K(α) ≡ σ Y . The complete set of material parameters is reproduced in Table 1.
Boundary Conditions As described in detail by Michel and Suquet, the volume element is submitted to a uni-axial tension-compression test at constant strain rate with alternating sign in loading direction (see Figure 3), The loading path is discretized in an equidistant manner with a granularity between 20 and 320 steps. If not mentioned otherwise, 80 loading steps are used.

Material Law Evaluations
In each loading step, a stationary elastic problem is solved by FFT-based homogenization [34]. This method is relying on an FFT-based preconditioner [23] defined by the constant coefficient linear elastic problem div C ref ε = 0, where C ref is called the reference stiffness and has to be chosen depending on the locally varying tangential stiffness of the material laws [23]. The reference stiffness can be either fixed at the beginning of the time dependent simulation by using only the initial elastic stiffness of the material laws or it can be adjusted in each loading step to the current tangential stiffness to reduce the number of iterations necessary for convergence. In the first case, this involves one material law evaluation per voxel with tangent at the beginning of the initial loading step and in the latter case at the beginning of each loading step. The (matrix-free) FFT-based solver itself only performs one material law evaluation without tangent per iteration and voxel. The performance impact of the reference material setup prior to the first loading step is negligible; therefore, whenever we display time spent on material law evaluations with tangent, the configuration at hand updates the reference material. Then, material law evaluations with and without tangent are timed separately. We use the types of error control explained in Section 4.4 throughout.
Parallelization We perform our tests on a dual-socket workstation with two Intel Xeon E5-2687Wv4 processors at 3 GHz (2×12 cores) and an Nvidia Quadro GV100 graphics card. As this card has uncapped double precision performance, we keep the elasticity solver's double precision also for material law evaluations on the GPU. Nonetheless, single precision seems to work well for the material law presented above. This is of importance on GPUs without good double precision performance, and can also speed up computations in general; especially material law evaluations with tangent seem to benefit performancewise from single precision. We use OpenMP 5 for CPU parallelization; on the graphics card, 5 https://www.openmp.org/ CUDA 6 is used. Details on the computational layout can be found in Section 6.

Automatic Evaluation
Conventionally, efficient methods for the evaluation of specific material laws are derived by hand. For example, GSMs such as (4), (5) are discretized in Chapter 3 of [48] by means of a single backward Euler step. With the help of an explicit formula for the flow direction, the resulting nonlinear system of equations is reduced to a scalar equation that is then solved by Newton's method. For the computation of C n+1 , the derivative of the corresponding nonlinear equation solve is recovered in an implicit function theorem fashion. Numerical integration and algorithmic differentiation are both carried out by hand. We refer to this approach as conventional evaluation strategy -it is material law specific. For our performance studies, it serves as a baseline. In this work, we explore several flavours of the automatic evaluation strategy depicted in Figure 4 that relies on AD to evaluate the various partials of ω and Ψ , to assemble Jacobians as required for ODE integration schemes and finally, to compute the material tangent C n+1 , which involves a differentiation of the whole algorithm depicted in Figure 4. The strategy can easily be adapted to other material laws by exchanging the implementations of the potentials. We also explore the performance benefits of providing handderived implementations of the partials of ω and Ψ for an otherwise automatic evaluation; we refer to this as semi-automatic evaluation strategy.

Single Implicit Euler
Step AD allows us to turn the conventional scheme from [48] into an automatic evaluation strategy that is not specific to a certain GSM and  requires only implementations of ω and Ψ . Let h = t n+1 − t n be the loading step size and that is, the right hand side of the ODE (2). An application of a single implicit Euler step yields that is, the nonlinear system of equations for a n+1 , which we solve with Newton's method. We initialize a (0) n+1 = a n and iterate a The application of AD is twofold. Each evaluation of f (or F ) involves evaluations of the partials ∂Ψ ∂A and ∂ω ∂a . This can be automatized by first order automatic differentiation. Second, the evaluations of the Jacobian ∂F ∂a can be realized likewise by AD but require -due to the already involved partials -an additional derivative order. Note ∂F ∂a = I − h · ∂f ∂a , so it suffices to apply AD to f . The material tangent requires the derivative of the evolved internal variables with respect to the predicted strain.
Assuming -similar to the derivation of the scheme in [48] -that the primary system of equations was solved exactly, it holds by differentiating (6) with respect to a single strain component that is, Hence, the required derivative values can be obtained in a postprocessing step by six additional linear system solves, one for each Voigt component of the strain and with the same coefficient matrix the next Newton iteration would use. ∂f ∂εn+1, i can be evaluated with AD analogously to ∂f ∂a . Since dεn+1 dεn+1 = I, it is then straightforward to propagate the derivatives with respect to the strain with AD through an evaluation of the stress relationship (1) to obtain both σ and C n+1 . As before, this involves also a partial of ω and requires second order AD capabilities. Table 2 provides an overview over time spent with the implicit Euler variants on material law evaluations with and without tangent in our running example. Here, all displayed configurations perform the exact same number of both types of material law evaluations, and the timings are immediately comparable. We should also mention that here, all simulation results obtained are identical up to machine precision. The timings reveal two important trends. First, the automatization on the CPU is costly. Given the significant runtime improvements from switching to the semiautomatic evaluation strategy, part of this cost is due to AD and the automatic computation of the partials of ω and Ψ . Another part of the cost is due to the generality. Other than in the conventional implementation, we lack additional knowledge about the roles of internal variables. We have no formula for the flow direction and solve a full system of nonlinear equations with Newton's method instead. All evaluation strategies are notably accelerated by the GPU, and here, most important, even keeping the full automatization does not incur visible performance costs. This is due to overlap of CPU and GPU workloads as detailed in Section 6.

Rosenbrock and Runge-Kutta Schemes with Adaptive Step Size
With a single implicit Euler step, there is no direct form of error control for the involved material law evaluations. The surrounding elasticity solver cannot compensate this lack of accuracy and will therefore solve the time discretized elasticity problem with potentially wrong stress (and stiffness) input. This regards nonlinear effects in particular. Since we cannot know in advance if and when these take place, we have to discretize the whole loading path with small loading steps. As we detail in the following, while this helps with the accuracy of stresses, the accuracy of tangents can not necessarily be guaranteed this way.
To that end, we first establish an interpretation of the algorithmic derivative of a single implicit Euler step as introduced in the previous section as a single implicit Euler step applied to an ODE for the derivative. Let a parameter dependent ODE systeṁ be given. We differentiate both sides of (8) with respect to p and formally interchange the order of derivatives on the left hand side to obtain d dt Assuming sufficient smoothness [36], the derivative of y with respect to p is the unique solution to (9) together with an initial value. The implicit Euler scheme with step size h applied to the coupled system formed by (8) and (9) yields dy n+1 dp = dy n dp + h ∂f ∂y (y n+1 , p) dy n+1 dp + h ∂f ∂p (y n+1 , p). (11) Clearly, this can be solved in two stages. After a solve of the nonlinear equation (10) for y n+1 , one linear solve of (11) is sufficient to recover the derivative dyn+1 dp . However, (11) can equivalently be obtained in an algorithmic manner by differentiating (10) with respect to p as long as dh dp = 0. Hence, the algorithmic derivative of a single implicit Euler step has an interpretation as a single implicit Euler step applied to the ODE for the derivative. This holds likewise for the single implicit Euler step applied in the schemes in Section 4.1 where we have already seen the two-step solution procedure. Now we deduce properties of the numerical tangent approximation via the ODE it approximates. Let f (ε, a) = ∂Ψ ∂A − ∂ω ∂a (ε, a) denote the right hand side of the evolution equation (2). In the setting of Section 4.1, we have to consider the numerical ODE solve in the context of a single material law evaluation with initial data a n and dan dεn+1 = 0 with step size h = t n+1 −t n over the time interval [t n , t n+1 ]. Here, ε n+1 plays the role of the parameters. The evolution equatioṅ for the derivative. By the properties of the implicit Euler scheme, the numerical solve of the undifferentiated evolution equation is guaranteed to converge with order one to the exact solution as h → 0. Here, the user can influence accuracy by choosing smaller loading steps. For the ODE for the derivative (12), the situation is different. Independent of the loading step size, ε n+1 always refers to the strain value at time t n+1 . The linear interpolation between the known strain values leads to which is the linear interpolation between 0 and 1 over the integration interval [t n , t n+1 ]. Therefore, the ODE for this particular derivative changes its shape with h. As the ODE is not invariant with respect to the integration interval, we cannot expect convergence to the exact solution with h → 0 if only a single implicit Euler step is applied.
The following example illustrates that the relative error in the differentiated internal variables might even increase for h → 0. We compare the results obtained by single implicit Euler steps to the results obtained by implicit Euler with a simple step size control mechanism. Consider a single voxel of the elastoviscoplastic material (4), (5) with the parameters from Table 1. We use the mixed boundary conditions from Figure 3. This loading path is discretized by varying numbers of equidistant loading steps. For each loading step, a material law evaluation with or without substeps is performed. The relative errors observed in the derivative dεvp, n+1, xx dεn+1, xx can be seen in Figure 5. Clearly, the relative error increases for h → 0.
This shows that an accurate tangent evaluation cannot be performed without further discretization of the integration interval [t n , t n+1 ] and serves as an additional motivation for adaptive substeps that are otherwise studied e. g. in [3] in the context of material law evaluation. Specifically, the material law inputs and outputs still follow the global time discretization, but locally, each material law evaluation uses a further discretization of [t n , t n+1 ] to meet specified tolerances. In this section, we analyze well-known integration schemes with respect to automatic differentiation in the presence of step size control. Note that implicit Euler with adaptive steps is not used in the remaining parts of this paper; instead, schemes with step size control via an embedded method are considered.
For adaptive time step sizes, the computation of the material tangent still requires the derivative of the evolved internal variables with respect to the predicted strain. Even if it is in principle possible to propagate those derivatives by AD through multiple steps of an ODE integration scheme in a blackbox manner, this corresponds to an algorithmic differentiation of an approximation and comprises a risk of inaccurate derivatives. The issues of blackbox differentiation of ODE integration schemes and possible solutions are discussed in [11]. Particularly, two problems are mentioned. First, the step size is solely determined by the integration of the primal equation. Hence, there are no guarantees for the accuracy of the derivatives. Second, the differentiation of the step size control mechanism spoils the result with discretization dependent components. In [11], the focus is on an aposteriori error correction that recovers the desired derivatives from quantities obtained by blackbox differentiation. Here, we study the continuous approach to the problem in greater detail and refine the strategy of solving simultaneously an ODE for the derivative for the case of Rosenbrock methods and both explicit and implicit Runge-Kutta schemes in the presence of step size control. In Theorem 2 and Corollary 1, we show that the ansatz is equivalent to suitably modified blackbox differentiation. Particularly, we guarantee that the derivatives are as accurate as the primal solutions.
For the sake of notational simplicity, we develop the following theory for autonomous ODEs and require implicitly that the used integration schemes satisfy the consistency condition that they yield the same numerical solution before and after transformation of the ODE to autonomous form.
Assuming sufficient smoothness [36], the derivative of y with respect to p is the unique solution to (9). The combined system (8) and (9) inherits the stability properties of (8) in the sense that the Jacobian of the right hand side with respect to the unknowns is of block type ∂ ∂ y dy dp f (y, p) ∂f ∂y (y, p) dy dp + ∂f ∂p (y, p) = ∂f ∂y (y, p) 0 * ∂f ∂y (y, p) (13) and has the same eigenvalues as ∂f ∂y (y, p). For some classes of integration schemes, the simultaneous solve of (8) and (9) can be realized by means of automatically differentiating the numerical solve of (8) with respect to p in a blackbox manner with some additional adaptions. Let an integration scheme with s stages and both linear and nonlinear implicit terms be specified by the update relations where J = ∂f ∂y (y n , p) is the Jacobian of the right hand side and Theorem 2 Let initial data y n and dyn dp be given. The algorithmic derivative of a single step of the scheme (14), (15) with step size h applied to (8) yields the same value dyn+1 dp as an application of the same integration step to the combined system (8) and (9)  Proof For notational simplicity, let p be scalar. Let y denote the solution to (8), dy dp its algorithmic derivative with respect to p and yỹ the solution to the combined system (8) and (9). Likewise, we refer to the stage vectors for the solution step of (8) as k i and to the stage vectors for the solution step of the combined system as k iki . By the linearity of (15) and the initial value relationỹ n = dyn dp , it is sufficient to ensure thatk i = dki dp , i = 1, . . . , s. If we apply the integration step to the combined ODEs (8) and (9), the equations for the stage vector componentsk i read wherẽ is the lower left block of (13) evaluated at y n , y n and p. However, as long as dh dp = 0, the same system of equations is obtained if we differentiate both sides of (14) with respect to p and identifyk i = dki dp . To that end, noteJ = dJ dp . Hence, if we recover the algorithmic derivative of the k i from solves of the equations obtained by implicit differentiation, we obtain the same result as by solving an ODE for the derivative.
In the case of prescribed step sizes, Theorem 2 extends inductively to multiple subsequent integration steps. In the case of automatic step size control, for example via an embedded method according to [19], the same holds true after small additional modifications.
1. To meet the assumption dh dp = 0 of Theorem 2 in terms of AD, the step size control mechanism must remain undifferentiated. 2. To achieve the same accuracy for the solution components y and dy dp , all of them must be regarded in the step size control error measure.
These additional modifications can also be found among the general suggestions in [11]. Here, we have shown that they are -together with the appropriate treatment of equation solves -sufficient to turn blackbox differentiation of an ODE integration scheme of the type (14), (15) into an algorithm that is equivalent to solving an ODE for the derivative.

Corollary 1 Theorem 2 generalizes to subsequent integration steps also in the presence of automatic step size control as long as step size control is excluded from differentiation and derivative components are regarded in the error measure. The obtained derivative is as accurate as the primal solution.
Theorem 2 and Corollary 1 cover various classes of well-known integration schemes. If we choose a ij = 0 for j ≥ i and γ ij = 0 for j > i, (14) and (15) turn into a Rosenbrock scheme [18]. There, only linear implicit terms are used and (16) can be simplified to s linear solves The solve fork i can be performed immediately after the solve for k i . For the choice γ ij = 0 for all i and j, we obtain an implicit Runge-Kutta scheme [19]. The implicit Euler step discussed at the beginning of this section is an example for this and hence a special instance of Theorem 2. If additionally a ij = 0 for j ≥ i, we obtain an explicit Runge-Kutta scheme [19]. There, no equation solves are required and Theorem 2 simplifies to a straightforward application of forward AD to the stage vector updates. Otherwise, AD can be used to compute the derivatives required in the setup of (16). In the GSM context, the components of ε n+1 play the role of the parameter p, a n+1 corresponds to y and f is the right hand side of (2). We apply Corollary 1 for the computation of dan+1 dεn+1 . For each class of integration schemes, the AD tool must be capable of computing various higher order derivatives. For explicit Runge-Kutta schemes, as before, we need one derivative order for the computation of the material tangent and one for the evaluation of the partials. For Rosenbrock methods, however, the computation of the Jacobian of the right hand side requires an additional derivative order. This is due to the termJ = dJ dp = d dp ∂ ∂a ∂Ψ ∂A (. . . ) in (17). It is in principle possible to extend the AD tool presented Section 5 to third order derivatives. However, additional derivative orders incur an exponential increase in memory and/or runtime [15] and we do not expect reasonable performance. Thus, to recover one derivative order, the user has to implement the partials of ω and Ψ explicitly in this case, i. e. only the semiautomatic evaluation strategy is available.
We consider the pair of explicit Runge-Kutta schemes from [5] that is known from MATLAB's routine ode23 and a lower-order Runge-Kutta pair formed by the explicit Euler scheme and Heun's method. This pair is also used for DAE integration in the context of material law evaluation in [21] and we refer to it as ode12. Finally, we include the Rosenbrock scheme from [46] that is behind MATLAB's ode23s. We implement all three with automatic step size control according to [19] and keep the MATLAB default tolerances a tol = 10 −6 and r tol = 10 −3 . If we solve additionally for the derivatives, the solutions for the derivative of a with respect to ε n+1 enter the error measure in the same way as primal solution components. Table 3 displays the timings for Runge-Kutta and Rosenbrock evaluation strategies. Compared to the previous timings in Table  2 without adaptive step size control, we take notice that on the CPU, semi-automatic evaluations without tangent with ode12 and especially ode23 can be performed even faster than the conventional evaluation strategy. Often, one or a few adaptive steps are sufficient, and Runge-Kutta steps are computationally cheaper than those of implicit schemes since no equation solves are involved. Note that semi-automatic evaluation without tangent does not require AD. Material law evaluations with adaptive step size and tangent are quite expensive. This is attributed to the effort of solving an ODE coupled with one for the derivative components. Again, the GPU improves the performance significantly, especially for evaluations with tangents. While there are no significant performance differences without tangent, ode23 is fastest with tangent, and is also competitive to the implicit Euler scheme on the CPU. Also, semi-automatic evaluation improves performance mostly on the CPU, and automatic evaluation has insignificant performance drawbacks on the GPU. The bad tangent performance of ode23s is related to register usage; this is explained in Section 6.
As can be seen in Table 4 for the case of 80 loading steps, adaptive substeps tend to reduce the overall number of elasticity solver iterations so that there are less material law evaluations without tangent in total. Figure 6, however, reveals that the loading step size remains -consistently across all ODE solvers -the key influence factor on the number of iterations per loading step.
In Figure 7, the average number of substeps per loading steps are visualized for the four different ODE solvers. By design, implicit Euler always uses one substep per loading step. For the other three solvers, the average number of substeps varies. It is strongly increasing when nonlinear effects occur in the composite. As expected, the first/second order solver ode12 needs the most substeps to reach the prescribed accuracy. The second/third order solvers ode23 and ode23s need a comparable number of substeps. Consequently, the semiimplicit and computationally more expensive ode23s cannot outperform the explicit ode23.

Solution Accuracy
FFT-based homogenization of Moulinec-Suquet [34] applied to materials with nonlinear behaviour is subject to a spatial discretization error of the partial differential equation div σ = 0 investigated in detail by Schneider [41] and furthermore two types of time discretization errors. First, the interaction between different regions of the material (quadrature points) over time is neglected on the material law evaluation level. Second, each integration of the ordinary differential equations (2), that is, each material law evaluation, introduces a local error in the internal variables.
For our example presented in Section 3, we study the influence of the adaptive time step size control on the overall error by comparing the ODE solvers presented above.
The stress response in loading direction is shown in Figure 8. As expected, due to the error control, all ODE solvers with adaptive time steps predict the same effective stress response within the given tolerances. Moreover, as can be seen in Figure 9, the error for coarse loading steps is reduced to approximately 30% of the error of the implicit Euler solver. Thus, the error of the material law evolution, that is, the accuracy of the ODE solver, is dominating the overall error of the FFT-based based homogenization for this example.
For the tangential stiffness shown in Figures 10 and 11, the results depend on the time discretization as explained in detail in Section 4.2. Therefore, we cannot perform a convergence analysis with respect to the loading step size. We observe that all ODE solvers with adaptive time step size control predict almost the same algorithmic tangent due to the error control. The differences observed between single implicit Euler steps and schemes with adaptive substeps are in accordance with the example on the relative error amplification in Section 4. Note that the tangent formula (7) reads for the potentials (4) and (5) that is, linear combinations of errors as depicted in Figure 5 are substracted from the components of the elastic stiffness matrix. This effect regards voxels that follow the Michel Suquet law and can still be seen in the effective stiffness.

Stress-Driven Error Control
Internal variables do not always have a physical meaning, and the material law outputs   that are of immediate relevance to the elasticity solver are σ n+1 and C n+1 . Its convergence test, for example, amounts to an equilibrium check of the stress field [34], and the material tangents are used to determine a linear elastic reference material [32,12,23]. In the material law evaluations, however, the tolerances specified for the ODE solver relate to an error in the internal variables. We control the error in a n+1 and -if we apply Corollary 1 -as well the error in dan+1 dεn+1 .
In the GSM given by Equations (4) and (5), for example, the stress relationship (1) turns into σ = C e (ε − ε vp ), that is, any error in ε vp enters σ multiplied by the elastic stiffness tensor. Depending on the specific instance of C e , it might be necessary to adapt the tolerances Time of loading step [s] Average number of substeps [1] Impl. Euler with 20 loading steps Impl. Euler with 40 loading steps Impl. Euler with 80 loading steps Impl. Euler with 160 loading steps Impl. Euler with 320 loading steps ode12 with 20 loading steps ode12 with 40 loading steps ode12 with 80 loading steps ode12 with 160 loading steps ode12 with 320 loading steps ode23 with 20 loading steps ode23 with 40 loading steps ode23 with 80 loading steps ode23 with 160 loading steps ode23 with 320 loading steps ode23s with 20 loading steps ode23s with 40 loading steps ode23s with 80 loading steps ode23s with 160 loading steps ode23s with 320 loading steps   of the ODE solver to end up with stress values that are sufficiently accurate for the PDE solver. This is avoided by an error control on the ODE level that is directly tied to the accuracy of the stresses.
The step size control mechanism from [19] captures the deviation between two ODE solutions of different order of convergence in an error measure. Depending on the error, steps are accepted or rejected and the step size is adapted accordingly. Instead of using the internal variable approximations directly in the error measure, we transform them together with the adequate linear interpolation between ε n and ε n+1 for the substep of interest via the relationship (1) into a pair of stresses. If σ de- Impl. Euler with 20 loading steps Impl. Euler with 40 loading steps Impl. Euler with 80 loading steps Impl. Euler with 160 loading steps ode12 with 20 loading steps ode12 with 40 loading steps ode12 with 80 loading steps ode12 with 160 loading steps ode23 with 20 loading steps ode23 with 40 loading steps ode23 with 80 loading steps ode23 with 160 loading steps ode23s with 20 loading steps ode23s with 40 loading steps ode23s with 80 loading steps ode23s with 160 loading steps

C11 [GPa]
Impl. Euler with 20 loading steps Impl. Euler with 40 loading steps Impl. Euler with 80 loading steps Impl. Euler with 160 loading steps Impl. Euler with 320 loading steps ode12 with 20 loading steps ode12 with 40 loading steps ode12 with 80 loading steps ode12 with 160 loading steps ode12 with 320 loading steps ode23 with 20 loading steps ode23 with 40 loading steps ode23 with 80 loading steps ode23 with 160 loading steps ode23 with 320 loading steps ode23s with 20 loading steps ode23s with 40 loading steps ode23s with 80 loading steps ode23s with 160 loading steps ode23s with 320 loading steps pends -as above -linearly or, more generally, Lipschitz on the internal variables, this yields a pair of stresses with the analogous order relations. The rationale of the step size control carries over, and we evaluate the error measure on the stresses instead. If we solve additionally for the derivative dan+1 dεn+1 , the same evaluation of (1) (performed on forward AD types instead) transforms additionally the approximations of the internal variable deriva-tives into a corresponding pair of material tangents that may then enter the error measure in the same way the derivative components did before. This way, we control the error in σ n+1 and C n+1 .
As can be seen in Figure 12, stress-driven error control also reduces the impact of the ODE solver choice on the effective stress response for all numbers loading steps. Impl. Euler with 20 loading steps Impl. Euler with 40 loading steps Impl. Euler with 80 loading steps Impl. Euler with 160 loading steps Impl. Euler with 320 loading steps ode12 with 20 loading steps ode12 with 40 loading steps ode12 with 80 loading steps ode12 with 160 loading steps ode12 with 320 loading steps ode23 with 20 loading steps ode23 with 40 loading steps ode23 with 80 loading steps ode23 with 160 loading steps ode23 with 320 loading steps ode23s with 20 loading steps ode23s with 40 loading steps ode23s with 80 loading steps ode23s with 160 loading steps ode23s with 320 loading steps  Similar ideas can be employed for the convergence criterion of Newton's method in the schemes from Section 4.1. Instead of iterating until convergence in a, we may compute the stress resulting from the current iterate via (1) in each Newton iteration and converge σ instead.

Automatic Differentiation on GPUs
To summarize the basic ideas of automatic differentiation, we view a floating point computation with fully evaluated control flow as a function x → y that is composed of elementary mathematical operations like +, · or standard math library functions like sin. If we differenti-ate the composed operations according to the chain rule, we obtain the algorithmic derivative of the computer program. Automatic differentiation deals with techniques that obtain algorithmic derivatives in an automatic fashion. A comprehensive introduction is given in [15].
As both ω and Ψ are scalar valued functions and have -with respect to both ε and a -more inputs than outputs, it seems appropriate to use the reverse mode of automatic differentiation to evaluate the partial derivatives on the right hand sides of the GSM constitutive equations (1) and (2). C n+1 , on the other hand, arises as the derivative of σ n+1 with respect to ε n+1 , that is, six Voigt components with respect to six Voigt components. We compute it with the forward mode of automatic differentiation, possibly the forward vector mode. To compute both the partials and C n+1 with AD at the same time, we combine the forward and reverse mode in an adjoints of tangents fashion [15]. While the computation is generally executed on a forward AD data type, all local evaluations of partials are obtained by additional applications of the reverse mode. In the context of semi-automatic ode23s, we use the second order forward (vector) mode for the Jacobians and tangents.
The implementation of the first and second order forward (vector) mode follows the same principles as CPU implementations like [39]. The reverse mode of AD, however, is subject to a global information problem that is typically solved by taping. The sequence of operations is first executed in forward direction and remembered together with all intermediate results.
Then, the corresponding sequence of derivatives is evaluated according to the chain rule in reverse order. On the GPU, this memoryintensive approach is prohibitive. Since the reverse mode of AD is only needed in a very local manner, we may replace taping by recomputations: If an intermediate value is required during reverse evaluation, the sequence of operations is partially re-evaluated in for-ward direction up to the required point. Similar approaches are pursued in [27]. This can be realized by an operator overloading ansatz at low computational overhead on the expression level. We employ expression template techniques that have previously been shown to perform well for the treatment of right hand sides in the forward mode of AD [35] and in Jacobi taping [22] as well as primal value taping [38] in the reverse mode of AD. Here, we use expression templates to convert a composite operation into a structured data type that represents the computational graph and allows for its traversal in forward and reverse direction. This way, the structure of the computation is fully exposed to the compiler and can be optimized during compilation. The curiously recurring template pattern is used to shift overhead due to the interface in the inheritance tree in Figure 13 from run time to compile time. Figure 13 showcases the reverse mode without additional tangents using the example of a unary elementary operation f(). The interface ReverseExpression defines a routine v() for forward evaluation and a routine back() for backpropagation of derivatives. On the one hand, it is implemented as a type ReverseBasic that contains actual data, that is, a primal value _v and an adjoint value _bv. On the other hand, there are derived types that stand for applied elementary operations such as ReverseOpF. They are created by operation overloads such as ReverseOpF f(const ReverseExpression &expr) { return ReverseOpF(expr); } that do not immediately apply f() but store a reference to the arguments in the returned object. Types such as ReverseOpF implement the interface in a way that allows for the forward and reverse evaluation of the computational graph. A call to v() causes the forward evaluation of _arg and subsequent application of f(). A call to back() propagates derivative values in reverse direction where df() stands for the derivative of f() and must be implemented ex- where CompositeExpression stands for a composition of multiple elementary operations. Each elementary operation must be implemented according to Figure 13. The operation overloads are used to build up the computational graph of this right hand side and in the course of the assignment to result, ReverseBasic:: operator=() is used to trigger its forward and subsequent reverse evaluation. In the end, argn ._bv carries the machine accurate derivative of result._v with respect to argn._v where n = 1, 2, ....
The proposed AD tool can be implemented in C++ using C++11 features that are sup-ported both by standard compilers such as g++ and by Nvidia's CUDA compiler driver nvcc. Particularly, the AD tool can be applied both inside OpenMP threads and CUDA kernels.
We improve the performance of the AD tool by some adaptions that are specific to our problem and setting.

During expression tree forward traversal,
it is possible to evaluate primal values only once and store them in the nodes of the tree [35]. However, to consume as little memory as possible, we use recomputations instead. This is especially important for the GPU on which memory operations are costly and the number of registers used per thread can limit parallel execution. 2. Instead of a recursive ansatz for higher order derivatives, we implement second order expressions explicitly. This helps the compiler with the identification and elimination of common subexpressions, which it cannot always do automatically. 3. In the computation of the partials of ω, we are always only interested in the derivative with respect to either ε or a but never both. If we compute the derivative with respect to one, there is no need to propagate derivative values back to the other. There-fore, we provide mixed order expressions that actively avoid reverse propagation of derivative values to lower order type arguments.
The AD tool can only differentiate single expressions in reverse order and is overall limited to first and second order derivatives. The first and second order forward (vector) mode, however, are not restricted to single expressions and can be applied to general codes, like the ODE solvers in the case of AutoMat. In the presented design, automatic differentiation takes exclusively place in GPU registers (sometimes spilled but mostly actual, see Table 6).

Computational Layout, Profiling and Performance Limiters
The fields for the internal variables, the current strain field and the predicted strain field, that is, the material law inputs for all voxels, reside in host memory. In general, GPU memory is not large enough to hold all of them at the same time and the elasticity solver still runs on the CPU. Furthermore, data might reside in host memory in an array-of-struct layout that does not suite GPU computing and due to the heterogenity of the material, data for all voxels of a specific material law might be arranged in memory in a non-contiguous manner. Therefore, we divide the workload into multiple chunks of fixed size, in a way that GPU memory can at least hold the material law inputs and outputs of one or few chunks. In host memory, we allocate at least one staging area of chunk size and page-locked type that allows for fast CPU-GPU data exchange. On the host side, we copy the material law inputs of a chunk into the staging area in an OpenMP parallel manner. In doing so, we arrange them in a contiguous manner in a structof-arrays layout, and might convert from double to single precision. Then, we process the staging area with multiple CUDA streams. Each stream copies part of the inputs to the GPU and issues the corresponding material law evaluations. We use one CUDA thread per material law evaluation and a small multiple of 32 as block size for the computational grid. Once the evaluations are done, the stream copies the material law outputs back to the staging area. The purpose of multiple streams is an overlap of CPU-GPU data exchange with GPU computations. Once the entire staging area is processed, the material law outputs are collected from the staging area, transformed back to the original layout and otherwise postprocessed as required by the elasticity solver in an OpenMP parallel manner. By means of multiple staging areas, an overlap of CPU and GPU workloads can be achieved: During GPU computations, transformations of inputs and outputs involving other staging areas can already take place on the host side. We observed no benefits for more than two staging areas.
The CPU-GPU overlap becomes evident in Table 5. For material law evaluations without tangent, the time spent on material law evaluation is determined by the time it takes to stage and collect the data. CPU-GPU data exchange and GPU computations overlap almost completely with the CPU workloads. The minimum time needed for exchange of the combined data over the PCI Express bus (assuming full bandwidth and perfect overlap of both transfer directions) gives an impression of the amount of time that is at least hidden behind CPU workloads. The exemplary profilings presented in Figure 14 show that the GPU compute time is in turn dominated by CPU-GPU data exchange, and due to overlap mostly hidden behind it.
For material law evaluations with tangent, the observations are different. Here, staging and collecting cannot hide all GPU workloads, in particular the GPU computations which are also more expensive than the CPU-GPU data exchange. This has two reasons. First, the the postprocessing step for the tangent or solving the coupled ODE system, respectively, is in itself computationally more expensive. The derivative components, however, also increase the memory footprint of the GPU kernels, in  Tables 2 and 3 into CPU workloads and nonoverlapped GPU workloads. Includes also lower time bound for the PCIe data exchange. Some variations between material law evaluations without tangent are due to differences in the number of elasticity solver iterations. The variations in the PCIe bounds indicate this extent. (1) (4)  Table 6. This limits the overall number of threads that can run in parallel, and it is important to keep that number small. To that end, all ODE solvers with adaptive step size among the GPU configurations with tangent are subject to another performance optimization. Instead of propagating all six tangent directions simultaneously through one material law evaluation with the forward vector mode, we re-evaluate each material law six times, each with the standard forward mode and one tangent direction, i. e. we compute C n+1 column by column. This can also be seen in Figure 15. Note that this has no influence on staging, collecting or the amount of CPU-GPU data exchange. We trade memory for computations on the GPU, and the performance benefits of kernels with smaller (1) (2) (4) Fig. 15 Profilings performed for automatic material law evaluations with tangent -implicit Euler (1), ode12 (2), ode23 (3) -and semi-automatic ode23s (4). Time scales vary between (1)-(4). Note the six evaluation steps between data transfer in (2)  memory footprint outweigh the additional effort incurred by the re-evaluations.
Interestingly, the CPU-GPU data exchange is -due to overlap and the cost of staging and collecting -in none of the configurations discussed above a key limiting factor. Nonetheless, our implementation of the material law from Section 3 reduces that data. Material law evaluations with tangent, for example, copy back C n+1 but neither stresses nor internal variables. Specifically for the GSM given by (4), (5), we exploit ε vp ∈ range(dev), i. e. one component of the viscoplastic strain can be eliminated and is computed on the fly in the implementations of ω and Ψ from the others instead.
The effect of using AutoMat on the total runtime of FFT-based homogenization is summarized in Table 7. On the CPU, ode23 is the best choice. It needs approximately the same time as our conventional implementation and gives more precise results according to Section 4.3. On the GPU, the choice of the ODE solver does not influence the total runtime significantly with the notable exception of ode23s. For all other ODE solvers, AutoMat accelerates the FFT-based homogenization method by a factor of more than two on the GPU. For ode23s, this holds only true if the reference material is not updated. In all other cases, the ODE solver can be chosen without performance considerations on the GPU.

Conclusion
In this article, we have introduced and studied a universal method for evaluating GSMs. With automatic differentiation, the material law setup is reduced to the implementation of two potentials. This eliminates the inconvenience of hand-computed derivatives and greatly simplifies the material law implementation process.
In a first step, we automatized the conventional implicit Euler approach and were able to reproduce the solution of the elasticity problem up to machine accuracy. However, we also demonstrated that its tangent computation is subject to general accuracy issues. As these can be resolved by an integration of the evolution equation for the state variables with adaptive time step sizes, we detailed how blackbox automatic differentiation of Rosenbrock and Runge-Kutta methods must be modified in the presence of time step size control to obtain derivatives that are as accurate as the primal solution. Material law evaluations with adaptive time steps improved the solution accuracy of the elasticity problem significantly for large loading steps, especially when stress and stiffnes error measures are used for time step size control. Thus, we have a method at hand to assess the time discretization error disregarding contributions from solving the evolution equation.
To make the method applicable to CTscale problems, we finally moved the material law evaluation to the GPU. Various kinds of overlap resulted in run times for the stress response that are independent of the chosen integration scheme and are moreover much faster than our conventional implementation on the CPU. Especially automatic evaluation strategies are accelerated significantly, which would not be possible without our efficient implementation of automatic differentiation on the GPU.
We conclude that the framework for integrating GSMs into mechanical solvers presented in this article is unmatched in its simultaneous flexibility, accuracy and performance. It is particularly well suited to improve and accelerate matrix-free solvers like FFT-based homogenization. With the resulting user-friendly and fast method, it becomes feasible to investigate the non-linear material behavior, like viscoelasticity and viscoplasticity, of composites on a single workstation.  Table 7 Total runtime of FFT-based homogenization with and without updated reference material for different settings of AutoMat on the CPU and GPU.