openQ*D code: a versatile tool for QCD+QED simulations

We present the open-source package openQ*D-1.0, which has been primarily, but not uniquely, designed to perform lattice simulations of QCD+QED and QCD, with and without C* boundary conditions, and O(a) improved Wilson fermions. The use of C* boundary conditions in the spatial direction allows for a local and gauge-invariant formulation of QCD+QED in finite volume, and provides a theoretically clean setup to calculate isospin-breaking and radiative corrections to hadronic observables from first principles. The openQ*D code is based on openQCD-1.6 and NSPT-1.4. In particular it inherits from openQCD-1.6 several core features, e.g. the highly optimized Dirac operator, the locally deflated solver, the frequency splitting for the RHMC, or the 4th order OMF integrator.


Introduction
QED radiative corrections to hadronic observables are generally rather small but they become phenomenologically relevant when the target precision is at the percent level. For example, the leptonic and semileptonic decay rates of light pseudoscalar mesons are measured with a very high accuracy and, on the theoretical side, have been calculated with the required non-perturbative accuracy by many lattice collaborations. Most of these calculations have been performed by simulations of lattice QCD without taking into account QED radiative corrections. A recent review [4] of the results obtained by the different lattice groups shows that leptonic and semileptonic decay rates of π and K mesons are presently known at the sub-percent level of accuracy. At the same time, QED radiative corrections to these quantities are estimated to be of the order of a few percent, by means of chiral perturbation theory [5]. These estimates have recently been confirmed in the case of the leptonic decay rates of π and K by a first-principle lattice calculation of the QED radiative corrections at O(α) in refs. [6,7].
Other remarkable examples of observables for which QED radiative corrections are phenomenologically relevant are the so-called lepton flavour universality ratios. For example R(D ( * ) ) is defined as the branching ratio for B → D ( * ) ν with = e, µ divided by the branching ratio for B → D ( * ) τν τ . Most of the hadronic uncertainties cancel in these ratios that are built in such a way that they are trivial in the Standard Model, in the limit in which the two leptons have the same mass. Presently, a combined analysis [8] of the R(D) and R(D * ) ratios shows a deviation of the experimental measurements from the theoretical predictions of the order of 3 standard deviations. On the other hand, QED radiative corrections are different for the two leptons because of the different masses and an improved theoretical treatment of these effects (see for example refs. [9,10] for a discussion of this point) can possibly enhance or reconcile the observed discrepancy between the experimental measurements and the theoretical expectations.
QED radiative corrections to hadronic observables can be computed from first principles by performing lattice simulations of QCD coupled to QED, treating the photon field on an equal footing as the gluon field. This approach, pioneered in refs. [11][12][13], is highly non-trivial from both the numerical and theoretical point of view, because of the peculiarities of QED. Numerically, lattice calculations are unavoidably affected by statistical and systematic uncertainties and it can be challenging to resolve QED radiative corrections from the leading QCD contributions within the errors of a simulation. Theoretically, a big issue arises because lattice calculations have necessarily to be done on a finite volume. QED is a long-range interaction and, consequently, finite-volume effects are the key issue in presence of electromagnetic interactions.
In fact, as a consequence of Gauss' law, it is impossible to have a net electric charge on a periodic torus. Because of this strong theoretical constraint, it is particularly challenging to calculate from first principles physical observables associated with electrically charged external states, such as the phenomenologically relevant quantities discussed above. Several approaches have been proposed over the years to cope with this problem, see ref. [14] for a recent review. The most popular approaches to the problem of charged particles on the torus solve the Gauss' law constraint by introducing non-local terms in the finitevolume action of the theory. 1 The effects induced by the non-locality of the action are 1 A different approach is based on the idea that one can write QCD+QED observables at first order in expected to disappear once the infinite-volume limit is properly taken and, as far as O(α) QED radiative corrections are concerned, it is generally possible to show that this is indeed the case.
On the one hand, the non-local formulations of the theory are particularly appealing because of their formal simplicity. On the other hand, it has been shown in ref. [18] that it is possible to probe electrically charged states on a finite volume by starting from a local formulation of the theory and, remarkably, in a fully gauge-invariant way. This is possible by using C-parity (or C * ) boundary conditions for all the fields and by using a certain class of interpolating operators originally introduced by Dirac in a seminal work [19] on the canonical quantization of QED.
The formulation of ref. [18] has also been studied numerically. The results for the meson masses extracted in a fully gauge-invariant way from lattice simulations of QCD+QED with C * boundary conditions obtained in ref. [20] provide a convincing numerical evidence that, beside being an attractive theoretical formulation, the proposal of ref. [18] is also a valid numerical alternative for the calculation of QED radiative corrections on the lattice. This motivated the present work.
In this paper we present the open-source package openQ*D, which can be used to simulate QCD+QED, QCD, the pure SU(3) and U(1) gauge theories. 2 The code allows to choose a wide variety of temporal and spatial boundary conditions. In particular, it allows to perform dynamical simulations of QCD+QED with C * but also with periodic boundary conditions along the spatial directions. Simulations of QCD with C * boundary conditions can be a valuable starting point for the application of the RM123 method [21], in which observables are calculated order-by-order in the electromagnetic coupling. A fully tested and stable relase of openQ*D can be downloaded from [1].
The openQ*D package is based on the openQCD [2] package from which it inherits the core features, most notably the implementation of the Dirac operator, of the solvers and the possibility of simulating open and Schrödinger functional boundary conditions in the time direction. One of the inherited solvers implements the inexact deflation algorithm of ref. [22]. An added value of the openQ*D package is the possibility of using more deflation subspaces in a single simulation. This is particularly important in the case of QCD+QED simulations because different deflation subspaces have to be generated for quarks having different electric charges.
Another important feature present in the openQ*D package is the possibility to use Fourier Acceleration [23,24] for the molecular dynamics evolution of the U(1) field. The used implementation of the Fast Fourier Transform (FFT) is an adaptation of the corresponding module in the NSPT [3,25] package.
The remaining of this paper is organised as follows. In section 2 we give an overview of the theoretical background needed to understand the actions simulated by openQ*D, and we describe some peculiar aspects of the simulation algorithm. In particular, the specific implementation of C * boundary conditions and of the Fourier Acceleration for the U(1) field are discussed. In section 3 we provide instructions on how to compile the code, construct a sample input file, and run the program that generates QCD+QED configurations. Section 4 is a collection of tests and performance studies. In particular, we present α as QCD observables with analytic (possibly infinite-volume) QED kernels, e.g. [15][16][17]. 2 The code allows also for (inefficient) simulations of QED in isolation, even though a main program for this purpose is not provided in the 1.0 version.  scalability tests, and studies of the performance of solvers for the Dirac equation for electrically charged fields. We also illustrate the outcome of some sample runs performed for testing purposes. In figure 1, we provide a schematic view of the openQ*D functionalities.

Theoretical background
An overview of the main algorithmic choices made in the code will be given in this section. The fundamental fields are the SU(3) link variable U µ (x) and the real photon field A µ (x). Since only the compact formulation of QED is implemented at present, all observables are written in terms of the U(1) link variable which implies that the real photon field can be restricted to −π ≤ A µ (x) ≤ π with no loss of generality. Various boundary conditions can be chosen for the gauge fields: periodic, open [26], Schrödinger Functional (SF) [27,28] and open-SF boundary conditions [29] in the Euclidean time direction µ = 0, periodic and C * boundary conditions [30][31][32][33] in the spatial directions. The implementation of C * boundary conditions is discussed in section 2.1. After integrating out the fermion fields in a usual way, the target distribution of QCD+QED if no C * boundary conditions are used is where the gauge actions S g,SU(3) (U ) and S g,U(1) (A) are briefly discussed in section 2.2, the product runs over the simulated fermion flavours indicized by f , and the Dirac operator D is introduced in section 2.3. If C * boundary conditions are used, the determinant is replaced by a Pfaffian, i.e.
where C is the charge conjugation matrix and T is a field-independent matrix satisfying T 2 = 1, whose detailed definition can be found in section 2.1. While in the continuum limit the determinant and the Pfaffian are positive, this is not the case with Wilson fermions. The absolute value is considered in both cases, which amounts to replacing The sign should be separately calculated and included in the evaluation of observables as a reweighting factor [34,35]. It is important to stress that this is a mild sign problem [18], which becomes irrelevant sufficiently close to the continuum limit, and which is also present in standard QCD simulations for the strange quark. The presented strategy is in line with state-of-the-art QCD and QCD+QED simulations, in which the sign of the determinant is simply ignored. Future work will be planned to investigate the importance of the sign especially at lighter quark masses. After introducing the standard even-odd preconditioned operatorD [36], one rewrites the quark part of the distribution as where α f is either 1/2 or 1/4. The definitions ofD f and S sdet can be found in section 2.3. Instead of this target distribution, the openQ*D code simulates a slightly different distribution written in terms of a rational approximation R f [37] where µ f is a tunable parameter introduced to suppress configurations with exceptionally small eigenvalues ofD † fD f (twisted-mass reweighting [38,39]). If µ f is small enough and the rational approximation is accurate enough, the simulated distribution ρ sim (U, A) is very close to the target one ρ tar (U, A). The difference is corrected by means of reweighting which have to be separately calculated and included in the expectation values of observables as follows The detailed discussion of the supported reweighting factors can be found in appendix A.
The rational function R f can be decomposed in a product of positive factors R f, (frequency splitting [39]). More details on frequency splitting are provided in section A.2. The determinant of the rational functions is finally represented by means of a pseudofermion quadratic action as in The distribution is generated by means of a Hybrid Monte Carlo (HMC) algorithm with Fourier acceleration for the U(1) field. The molecular dynamics (MD) Hamiltonian is given by where Π µ (x) and π µ (x) denote the momentum fields associated to the SU(3) and U(1) fields, the operator (−∆) is a discretization of the Laplace operator, and the action is given by Details on the implementation of the Fourier acceleration are presented in appendix B. The HMC consists of three steps.
1. The momentum and pseudofermion fields are randomly generated with probability distribution given by e −H ; 2. The gauge fields are evolved with a discretized version of the MD equations, i.e.  The top diagram represents a section of the extended lattice along a (1, k) plane where k = 2, 3 is a direction with C * boundary conditions. All fields are periodic along the extended direction 1. C * boundary conditions in the direction k = 2, 3 are replaced by shifted boundary conditions in the extended lattice. Shifted boundary conditions are imposed by properly defining the nearest neighbours of boundary sites. Empty circles in the red (resp. green, blue) rectangle have to be identified with the corresponding solid circles in the red (resp. green, blue) rectangle. The bottom diagram represents a section of the extended lattice along a (1, k) plane where k = 2, 3 is a periodic direction. In both diagrams, the black circles represent the sites of the physical lattice, and the grey circles represent the sites of the mirror lattice.

C * boundary conditions
Other than the variety of boundary conditions in the temporal direction inherited from openQCD-1.6, the openQ*D code allows for periodic or C * boundary conditions to be chosen in the spatial directions. If the gauge fields satisfy periodic boundary conditions in all spatial directions k, the fermion fields ψ f (x) andψ f (x) satisfy general phase-periodic boundary conditions (f is the flavour index), i.e. 14) Phase-periodic boundary conditions are incompatible with C * boundary conditions. If the gauge fields satisfy C * boundary conditions in at least one direction, say k, then θ f,j = 0 for all f and j, and The charge-conjugation matrix C satisfies C * boundary conditions are implemented by means of an orbifold construction. Assume that k = 1 is a direction with C * boundary conditions, 3 in order to simulate a physical lattice with size V = L 0 × L 1 × L 2 × L 3 the openQ*D code allocates a lattice with size V C * = L 0 × (2L 1 ) × L 2 × L 3 , which we will refer to as the extended lattice. Points in the physical lattice are assumed to have coordinates which satisfy 0 ≤ x µ < L µ . The extended lattice can be interpreted as a double-covering of the physical lattice, with coordinates satisfying 0 ≤ x µ < L µ for µ = 1 and 0 ≤ x 1 < 2L 1 . Points outside the physical lattice constitute the mirror lattice. On the extended lattice, points x and x + L kêk do not coincide, so eqs. (2.16) and (2.17) have to be interpreted as constraints which define the admissible gauge and fermion fields. These are referred to as the orbifold constraints. While the admissible gauge fields in the mirror lattice are completely determined by the value of the gauge field in the physical lattice via (2.16), the orbifold constraint has a different meaning for fermion fields, providing a relation between ψ in the physical lattice andψ in the mirror lattice, and vice versa. Given that the fermion fields ψ andψ are independent Grassmanian variables on the physical lattice, then one can equivalently choose the value of ψ in each point of the extended lattice as a complete set of independent variables. The integration of the Grassmanian variables yields the Pfaffian of the operator CT D [18], where T is the translation operator defined by (2.19) One easily proves that which justifies the need for α f = 1/4 in eq. (2.5). Since the square of the charge-conjugation operation is the identity, all fields must obey periodic boundary conditions along the extended direction k = 1, i.e.
C * boundary conditions in directions k = 2, 3 are implemented by modifying the global topology of the extended lattice (see fig. 2). In fact in these directions, C * boundary conditions in the physical lattice imply shifted boundary conditions in the extended lattice, i.e.
When the determinant of the Dirac operator is stochastically estimated by means of a pseudofermion action as in eq. (2.12), the pseudofermion field Φ f, is natively defined on the extended lattice, i.e. Φ f, (x) are truly independent variables for each x in the extended lattice. Moreover it satisfies the same boundary conditions as ψ f in eqs. (2.22) and (2.24). It is worth noticing that C * boundary conditions can be implemented in different ways. For instance, the implementation proposed in appendix D of ref. [18] does not double the lattice, but the number of pseudofermion fields. Roughly speaking one needs to represent quarks and antiquarks by means of independent pseudofermion fields which are mixed by the boundary conditions. The openQ*D implementation simply maps each pair of pseudofermion fields in the geometry of the extended lattice. The cost of the application of the Dirac operator implemented as in openQ*D and as in [18] is exactly identical. Therefore, as far as the application and inversion of the Dirac operator, the orbifold contruction does not introduce any overhead with respect to more standard implementations of C * boundary conditions. On the other hand, the gauge field is evolved twice. In principle one could evolve the gauge field only on the physical lattice and then copy its value to the mirror lattice. This strategy will be considered in the future. However, simulations close to the physical point are dominated by the inversion of the Dirac operator and the overhead due to the evolution of the gauge field is expected to be negligible. Evidence of this fact has been presented in [42].

Gauge actions
The SU(3) and compact U(1) gauge actions that can be simulated with openQ*D are where U (C) and z(C) denote the SU(3) and U(1) parallel transports along a path C on the lattice. S 0 and S 1 are the sets of all oriented plaquettes and all oriented 1 × 2 planar loops respectively and the overall weight ω C * is 1 if no C * boundary conditions are used. With C * boundary conditions ω C * = 1/2 corrects for the double counting introduced by summing over all plaquette and double-plaquette loops in the extended lattice instead of the physical lattice (c.f. section 2.1). The coefficients c 0,1 satisfy the relation c 0 + 8c 1 = 1. For SU (3), the Wilson action is obtained by choosing c 0 = 1, the tree-level improved Symanzik (or Lüscher-Weisz) action is obtained by choosing c 0 = 5 3 , and the Iwasaki action is obtained by choosing c 0 = 3.648. The parameters g 0 and e 0 are the bare SU(3) and U(1) gauge couplings respectively, which are related to the β parameter and the bare fine-structure constant α 0 by β = 6 In the compact formulation of QED, all electric charges must be integer multiples of some elementary charge q el which is defined in units of the charge of the positron. As discussed in ref. [18], q el appears as an overall factor in the gauge action and essentially sets the normalization of the U(1) gauge field in the continuum limit. Even though in infinite volume q el = 1/3 would be an appropriate choice in order to simulate quarks, in finite volume with C * boundary conditions one needs to choose q el = 1/6 in order to construct gauge-invariant interpolating operators for charged hadrons [18,20]. Note that by using a compact formulation of QED, no gauge fixing is added to the action, and furthermore the user is free to choose simulating (QCD+)QED without C * boundary conditions. The actions in eqs. (2.25) and (2.26) assume periodic boundary conditions in time. In the more general case, the actions are modified at the time boundary in order to allow for O(a) improvement. The general form of the gauge actions can be found in [43].

Dirac operator
The Dirac operator implemented in openQ*D is given by a sum of terms where D w is the (unimproved) Wilson-Dirac operator, δD sw is the Sheikholeslami-Wohlert (SW) term, and δD b is the time boundary O(a)-improvement term. For simplicity, periodic boundary conditions in the time direction will be assumed, which means δD b = 0. The definition of δD b for other boundary conditions can be found in [44]. The Wilson-Dirac operator of eq. (2.28) can be written as where the covariant derivatives are defined as The SW term is given by The SU(3) field tensorF µν (x) and the U(1) field tensor A µν (x) are constructed in terms of the clover plaquette. The explicit expression of the SU(3) field tensor used in openQ*D can be found in ref. [45], while the U(1) field tensor is given here, The normalization is chosen in such a way that −ie 0Âµν (x) is the canonically-normalized field tensor in the naive continuum limit. Notice that the field tensors are anti-hermitian.
In presence of electromagnetism, the Dirac operator depends on the electric charge of the quark field. Let q be the physical electric charge in units of e (i.e. q = 2/3 for the up quark, and q = −1/3 for the down quark). In the compact formulation of QED, all electric charges must be integer multiples of an elementary charge q el , which appears as a parameter in the U(1) gauge action (2.26). The integer parameter is the one appearing in the hopping term in eqs. (2.30) and (2.31). On the other hand, notice that the SW term (2.32) is written in terms of the physical charge q. This normalization corresponds to a definition of c sw which is equal to 1 at tree level. The definition of the even-odd preconditioned Dirac operatorD is standard [36] and so is the definition of the small-determinant action S sdet appearing in eq. (2.5) 3 Simulating QCD+QED with openQ*D

Structure of the openQ*D program package
The openQ*D code includes several main programs, roughly divided in three categories: programs to generate configurations, programs to measure observables, and utility programs. The following programs (in the main directory) can be used to generate gauge configurations for various theories: • iso1: SU(3)×U(1) gauge theory with dynamical fermions; • qcd1: SU(3) gauge theory with dynamical fermions; • ym1: SU(3) pure gauge theory; • mxw1: U(1) pure gauge theory.
Finally, the following utility programs are also included: • minmax/minmax: it generates the rational approximations needed for the RHMC algorithm; • devel/nompi/read*: they can be used to read the binary *.dat files generated by the other programs.

Compiling and running the main program
A complete guide to the usage of all programs listed in section 3.1 can be found in the headers of the source-code files, and in the README files in the corresponding directories.
Often the user will be referred to other sources of documentation (e.g. README files in some of the modules subdirectories, or the headers of other source-code files, and some of the PDF files in the doc directory). This section is intended to be neither a replacement nor a duplicate of these sources of documentation, but rather an overview of the main steps that are needed to use the iso1 program to generate QCD+QED configurations. 2. Set the environment variables. The Makefile in the main directory assumes that the C compiler can be called by using $(GCC), the MPI header file is found at $(MPI_INCLUDE)/mpi.h, the MPI compiled library is found in the $(MPI_HOME)/lib/ directory, and the mpicc command is available. The needed environment variables can be defined in the appropriate shell initialization files, e.g.

# !/ bin / bash # [ Stuff ]
export GCC = " gcc " export MPI_INCLUDE = " / usr / local / include / " export MPI_HOME = " / usr / local / " 3. Choose the intrinsics acceleration options. Some pieces of code exist in several versions: plain C, inline-assembly with SSE instructions, and inline-assembly with AVX instructions. corresponds to an 8 4 local lattice, replicated on an 8 2 × 4 2 MPI process grid (the code will need to be run with 1024 MPI processes), which yields a 64 2 × 32 2 global lattice. As explained in section 2.1, this choice of simulation parameters corresponds to a 64 2 × 32 2 physical global lattice if no C * boundary conditions are used, or to a 64 × 32 3 physical global lattice if C * boundary conditions are used in at least one spatial direction. In our implementation, NPROCn has to be a multiple of 2 if C * boundary conditions are used in the direction n = 1, 2, 3.

5.
Compile the iso1 program and prepare for running. At this point, the code is ready to be compiled. Assuming that the root directory of the code is $HOME/openQxD, this is done by executing the following commands in a bash shell.
cd $ { HOME }/ openQxD / main make iso1 One can set up the directories and files to run the code by executing the following commands in a bash shell.
cd $ { HOME }/ openQxD mkdir test cd test mkdir cnfg dat log input cp ../ main / iso1 iso1 > input / pedro01 . in > runtest . sh chmod a + x runtest . sh 6. Edit the input file. The input file input/pedro01.in must contain all adjustable parameters of the simulation (except the few ones that have been set at compile time). A rough guide on how to construct an input file for the iso1 program is found in section 3.2.2. Alternatively, a sample input file can be cut and paste from appendix C.
7. Start the simulation. Edit the runtest.sh script as follows: The runtest.sh script contains the command that invokes the iso1 program. It can be launched via a standard mpirun command, or incorporated in a script for a 14 job scheduler. Recall that the number of needed MPI processes has been decided at compile time, and it is equal to 1024 in this case. The iso1 program takes a number of command-line options: the input file is specified with the -i option, the -noloc option specifies that the configuration files must be saved by a single MPI process, the -rmold specifies that only the most recent configuration must be kept and all previous ones must be deleted. The program will start the simulation from a randomly generated configuration. More details about the command-line options can be found in the main/README.iso1 file.
8. Interrupt the simulation. Assuming that no error is produced, the simulation code will end naturally when all the configurations requested in the input file are generated. If the simulation needs to be interrupted earlier, one can just execute the following commands in a bash shell.
The simulation code will stop gracefully right after the next configuration is saved.
9. Resume the simulation. Assuming that the last generated configuration was pedro01n42, edit the input file and set the nth variable in the [MD trajectories] section to 0 (see below for a description of the input file), and edit the runtest.sh script as follows: Once this is executed, the simulation will continue from where it was interrupted.

Constructing the input file for iso1
Most of the parameters needed to generate configurations are passed to the iso1 program by means of a human-readable input file, in this case pedro01.in in the test/input directory. For a full description of the various parameters, the reader is referred to the main/README.iso1 and doc/parms.pdf files (and references therein). A rough guide to the various sections that compose the input file is provided here, with no ambition of completeness.
For every file in the log and dat directories, a backup file identified by a tilde at the end of its name is created and updated every time a configuration is saved.
[ The program iso1 will print one entry in the log file every 5 MD trajectories, will measure and print Wilson flow observables every 10 MD trajectories, will save a configuration every 50 MD trajectories. The first 100 trajectories are considered of thermalization (no observables are measured), a total of 800 MD trajectories will be generated and 15 configurations will be saved. In this case periodic boundary conditions are chosen in time, and C * boundary conditions in all 3 spatial directions. The implementation of C * boundary conditions in openQ*D is described in section 2.1.

Ranlux [46] initialization.
If SF or open-SF boundary conditions are chosen in time, the number of parameters in this section increases, as one needs to specify the value of the fields on the SF boundaries. For a full description of these parameters, refer to doc/parms.pdf.

Gauge actions.
[ SU (3)  If different boundary conditions in time are chosen, the number of parameters in these sections increases, as one needs to specify the O(a)-improvement boundary coefficients. Also, if no C * boundary conditions are used, one can choose phase-periodic boundary conditions for fermions in space. Refer to doc/dirac.pdf, doc/parms.pdf for a detailed explanation of all these parameters.
7. Rational approximation. With C * boundary conditions, the Pfaffian of the evenodd preconditioned Dirac operatorD is needed, whose absolute value can be generated by a pseudofermion effective action of the type ψ † (D †D ) −1/4 ψ. The fractional power ofD †D is replaced by a rational approximation, which must be generated by means of the minmax program [47,48]. We sketch here how to use this program, see minmax/README for more details.

MPLIBPATH = / usr / local
The minmax program is compiled and executed with the following commands in a bash shell.
cd $ { HOME }/ openQxD / minmax make ./ minmax -p -1 -q 4 -ra 1.98 e -03 -rb 7.62 -goal 6e -05 A rational approximation for (D †D ) α is requested, with α = (−1)/(4) (-p and -q options), assuming that the eigenvalues of (D †D ) 1/2 are in the interval [1.98 × 10 −3 , 7.62] (-ra and -rb options), with a target relative precision of 6 × 10 −5 (-goal option). The spectral range of (D †D ) 1/2 must be guessed at first, but after some configurations have been generated it can be calculated with the program main/ms2. The minmax program creates a directory with a very long name, in this case p-1q4mu0.00000000e+00ra1.98000000e-03rb7.62000000e+00 which contains several files named n*.in. The integer in the file name corresponds to the order of the generated rational approximation. Only the highest order rational approximation, n10.in in this case, meets the requested precision. The full content of the n10.in must be pasted in the input file in a section of the following type, Notice that more than one rational approximation can be used in the same input file (e.g. one may want to use different rational approximations for the up, down and strange quarks). Each rational approximation is identified by the integer in the section title.

MD Hamiltonian and integrator.
[ The MD Hamiltonian is given by the canonical kinetic term of the SU(3) gauge field, the kinetic term of the U(1) gauge field, and a sum of terms which do not depend on the MD momenta and are referred to as actions. The kinetic term of the U(1) gauge field can be chosen to be of two types: the canonical one (facc=0), or the Fourieraccelerated one (facc=1). Refer to doc/fourier.pdf and section 2 for details on Fourier acceleration. The MD equations are solved by means of an approximate symplectic multilevel integrator, built in terms of standard elementary integrators.
For each level, one needs to specify how many times the elementary integrator needs to be applied and which forces need to be integrated. Refer to doc/parms.pdf and module/update/README.mdint for details on the integrator. The actions and forces are uniquely identified by an ID. Obviously there is a one-toone correspondence between actions and forces. Corresponding actions and forces must share the same ID. The gauge actions and forces must be included, i.e.
[ Notice that openQ*D allows for frequency splitting (not used in this example): the poles and zeroes of the rational approximations can be separated in different pseudofermion actions. This is convenient because one may want to integrate different poles and zeroes in different levels of the integrator, and also one may want to use different solvers for different poles. For details on the pseudofermion actions and forces, and on the frequency splitting, one should refer to doc/rhmc.pdf and section 2.
9. Solvers. Two multi-shift CG solvers are used in this example, with different residue for the actions and the forces.

Code performance on parallel machines
For future reference and comparison, benchmark measurements have been performed for the timing of the application of the double precision Wilson-Dirac operator and the SAP (Schwartz-Alternating-Procedure) preconditioner. The HPC cluster at CERN has been used, which features 72 nodes, each of them with two 8-core Intel R Xeon processors (E5-2630 v3, Haswell) running at about 2.4 GHz base frequency (3.6 GHz max.). Nodes are connected with Mellanox R Infiniband FDR (56 Gb/s). The timings are obtained with the time2 programs located in the subdirectories devel/dirac and devel/sap. All measured times have been normalised to the smallest partition (one node or 16 cores). The results of these scaling tests are shown in fig. 3. A QCD+QED setup with open boundary conditions in time and C * boundary conditions in one spatial direction has been used. The weak scaling test has been performed with a local lattice size of 8 × 16 × 8 × 8, giving an extended lattice with total volume V C * = 2N proc 8 4 . Because of the C * boundary conditions this corresponds to a physical lattice with volume V = N proc 8 4 , cf. section 2.1. While for the Dirac operator, parameters similar to the Quark flavours example (point 6) in section 3.2 have been used, the SAP preconditioner specifically employs a block size of 4 4 with five SAP cycles (ncy 5) and five iterations (nmr 5) of the even-odd preconditioned Minimal Residue (MinRes) block solver. The setup is similar for the strong scaling study but with a constant total volume of V C * = 2 · 64 × 32 3 and varying local lattice sizes. In case of the double precision Wilson-Dirac operator, a much larger lattice volume with V C * = 2 · 64 4 total lattice points was probed as well. As it can be seen in the left panel of fig. 3 the larger lattice is performing even better than the smaller one.
In summary, the overall scaling studied here is close to optimal and small deviations may partly result from remaining indigestions of the underlying network. Similar studies have to be done on other machines but the overall behaviour is expected to be similar to the original openQCD code.

Low-level tests
The openQ*D code has been tested by means of an extensive battery of check programs, which can be found in the subdirectories of devel. 4 These programs have been taken over from openQCD-1.6 and NSPT-1.4, and extended in order to test the specific feature of the openQ*D code. Roughly speaking, the check programs in each devel subdirectory test features of the corresponding module subdirectory. Many check programs test also interactions between different modules. These programs are meant to be used by developers only and contain very limited documentation. Providing a description of the check programs is outside of the scope of this paper, and a short description can be found in the INDEX files in each devel subdirectory. However, it is worth to point out a few facts. All check programs have been run with all possible combinations of boundary conditions in the space and temporal directions. Whenever possible, all check programs have been run in a pure QCD setup (i.e. only the SU(3) gauge field is allocated), a pure QED setup (i.e. only the U(1) gauge field is allocated), and a QCD+QED setup (i.e. both gauge fields are allocated). All check programs have been run with various geometric configurations, i.e. lattice size and processor grid. Besides a plethora of minor details, specific check programs have been written to test: • the implementation of C * boundary conditions for both gauge fields and for the Dirac operator; • general properties of the Dirac operator with generic electric charge (e.g. gauge convariance, translational covariance, γ 5 -hermiticity, comparison to analytic expression in case of zero gauge field); • the rational approximation of generic powers, and the associated reweighting factors; • the forces for the U(1) field, the QED action, the U(1) Wilson flow, the U(1) observables (e.g. clover field tensor, electromagnetic fluxes); • the MD with the U(1) field, with and without Fourier acceleration.

Conservation of the Hamiltonian with Fourier Acceleration
The use of Fourier Acceleration in QCD+QED simulations modifies the MD Hamiltonian and, consequently, the MD equations. In order to test the consistency between the two, one can look at the violation ∆H of Hamiltonian conservation as a function of the MD integration step-size ∆τ . The violation should vanish as a positive power of the integration step-size in the ∆τ → 0 limit. The power depends on the chosen integrator. When the total trajectory length is kept constant, the leap-frog integrator (LF) and 2nd or-     Figure 4 shows the violation ∆H as a function of ∆τ for all integrators, with and without Fourier Acceleration. A two parameter function ∆H = a ∆τ b has been fitted to the data points. In all cases the obtained exponent is reasonably close to the expected one. This test has been performed on a single thermalized configuration taken from the Q*D1 ensemble (table 1).
As expected there is a clear hierarchy among the three integrators. More interestingly, Fourier Acceleration has the effect of reducing significantly ∆H. While no definite conclusion can be drawn from a single-configuration experiment in this regard, the experience collected so far suggests that this is quite generally the case: when Fourier Acceleration is turned on, if one wants to keep the acceptance rate the same, larger values of ∆τ can be typically chosen. Obviously this does not mean that it is always convenient to use Fourier Acceleration. In order to understand whether this is the case, one should take into account the computational overhead and the variation in autocorrelations. Fourier acceleration is known to reduce significantly autocorrelations in the case of the free scalar theory, but also in the case of non-compact pure U(1) theory [11], which is a theory of free photons. Nevertheless, as soon as a full QCD+QED simulation is done, our experience suggests that autocorrelation times are quite insensitive to the use of Fourier acceleration for the U(1) fields. These findings need further investigations.

Performance of locally deflated solver in QCD+QED
The use of efficient solvers is a key factor in enabling simulations at quark masses close to the physical point. The openQ*D code inherits all the solvers of the openQCD-1.6 pack-  Table 1: Details of test runs employing C * boundary conditions in 3 spatial directions. Note that due to the C * boundary conditions, the global (simulated) lattice V C * is two times larger than the physical lattice because of the orbifold contruction. of ref. [50], and the lattice spacing was determined in ref. [51]. All runs have degenerate quarks with hopping parameter κ. Values for the neutral pseudoscalar mass m PS are given, as well as the flow time t 0 /a 2 from which we naively derive the approximate lattice spacing of Q*D1 using results of ref. [52].
age: Conjugate Gradient (CG), Multi-Shift Conjugate Gradient (MSCG), Generalized Conjugate Residual algorithm with Schwartz-Alternating-Procedure as preconditioning (SAP+GCR), and a deflated version of it (DFL+SAP+GCR). The deflated solver implements the idea of inexact deflation introduced in [22,53] and an improvement involving inaccurate projection in the deflation preconditioner proposed in [54].
As the Dirac operator is passed as an argument to these solvers, their implementation is blind to the coupling to the U(1) field and to C * boundary conditions. The efficiency of these solvers may be affected in principle by the coupling to the U(1) field, i.e. may depend on the electric charge of the Dirac operator. However this turns out not to be the case. The goal of this section is to describe two tests in support of this statement. These tests have been run on Altamira HPC at IFCA-CSIC, which consists of 158 computing nodes, each of them with two Intel R Xeon processors (E5-2670) at 2.6 GHz. Nodes are connected with Mellanox R Infiniband FDR (56 Gb/s).
An electroquenched (QCD+qQED) setup has been considered for both tests, with SU(3) configurations from the QCD1 ensemble (table 1) and pure U(1) configurations generated with α 0 = 0.05 and q el = 1/6. Two degenerate valence quarks Q and Q have been considered, with electric charge q and bare mass m 0 . The mass m PS of theQ γ 5 Q valence pseudoscalar neutral meson has been calculated as a function of q and m 0 and is shown in fig. 5. Notice that the critical bare mass depends very heavily on the electric charge, as expected. For this reason it makes sense to compare the solver performance for different electric charges keeping fixed the value of m PS (rather than the bare mass).
In the first test, the time needed to invert the even-odd preconditioned Dirac operator (with a representative QCD+qQED configuration) on 15 random sources has been measured, using the CG, SAP+GCR, and DFL+SAP+GCR solvers. The shortest time has been plotted in fig. 6 for electric charges q = 0, −1/3, 2/3 and a range of values of m PS . It is evident that the performance of all solvers is insensitive to the electric charge.
One important caveat needs to be pointed out for the DFL+SAP+GCR solver. Before applying this solver, one needs to generate the deflation subspace, which is constructed from approximate eigenvectors of the Dirac operator. The code allows the possibility to choose different parameters for the Dirac operator used in the solver and the one used to generate the deflation subspace. This is very useful in practice since having a slightly heavier bare mass or even a twisted mass for the generation of the deflation subspace generally speeds up the calculation without affecting the performance of the solver. On the other hand, it is crucial to generate the deflation subspace with the same electric charge of the Dirac operator that needs to be inverted. If this is not done, the DFL+SAP+GCR solver loses efficiency dramatically. For this reason, in contrast to openQCD-1.6, the openQ*D code can handle simultaneously several deflation subspaces. These deflation subspaces can be generated with different parameters and will all be updated during the MD evolution. The user can specify in the input file which deflation subspace should be used for each DFL+SAP+GCR solver independently. In practice, in a realistic QCD+QED simulation, one would need to generate only two deflation subspaces, one for up-type quarks and one for down-type quarks. It has been checked also that the time needed to generate the deflation subspace is insensitive to the electric charge as long as m PS is kept fixed. In the second test, a single value of m PS 354 MeV has been chosen, and the time needed to invert (D †D + µ 2 ) has been measured for various values of the twisted mass µ, using the CG and DFL+SAP+GCR solvers. One representative QCD+qQED configuration and 48 random sources have been used. The shortest time has been plotted in fig. 7 for electric charges q = 0, −1/3, 2/3 and a range of values of µ. The inversion of (D †D + µ 2 ) is relevant to calulate the rational approximation of non-integer powers of D †D (see section 2). Also in this case, the performance of the two solvers is seen to be CG (q = 0) CG (q = −1/3) Figure 6: Comparison of performance of various solvers and various electric charges as a function of the mass m PS of the valence neutral pion. In all cases, the inverse of the even-odd preconditioned Dirac operator has been calculated on random sources. One representative QCD+qQED configuration has been used (SU(3) configuration from the QCD1 ensemble, table 1, and pure U(1) configuration generated with α 0 = 0.05 and q el = 1/6). The same residue of 10 −10 has been chosen for the three solvers. The solver performance is insensitive to the electric charge.
insensitive to the electric charge as long as m PS is kept fixed.

Key observables for HMC simulations of QCD+QED
Beside the electroquenched tests in the previous section, a new set of tests is done using dynamical QCD+QED simulations with Wilson fermions and C * boundary conditions. The dynamical degrees of freedom of the U(1) gauge field are included in the simulation labeled Q*D1 in table 1. Q*D1 takes over the parameters from the H200 ensemble of the N f = 2+1 CLS [56] effort, except that the lattice extent is halved in each of the space-time directions. As the dynamical U(1) degrees of freedom contribute to the renormalization of the bare parameters, the estimate for the lattice spacing and pion mass cannot be taken over from the CLS ensembles, 5 but rather need to be estimated independently. Such an endeavour is beyond the scope of this paper. However, an estimate for t 0 /a 2 is given in table 1 for future reference. The reference flow time t 0 is implicitly given by [t 2 0 E(t 0 ) ] = 0.3 using the Wilson flow and clover discretisation of the SU(3) field strength tensor in the definition of the energy density E(t) [57]. A rough estimate of a is given after naively matching t 0 /a 2 to the data provided in table III of ref. [52].
Although openQ*D allows for twisted-mass reweighting, that option is not required for Q*D1 (µ = 0.0). All three bare sea quark masses, am 0,i = 1/(2κ i ) − 4, are taken to be degenerate. As demonstrated in the previous section and shown in fig. 5, this necessarily leads to a large difference in the neutral pseudoscalar masses due to the differences in quark charges. One thus ends up with a degenerate pair of down-type quarks (q = −1/3), and a single but significantly heavier up-type quark (q = 2/3). Hence, the simulations are essentially probing a somewhat unphysical version of the N f = 2 + 1 theory, but are sufficient to probe standard observables and performance of the code. In fig. 8 a summary of selected observables is given for simulation Q*D1. The run was stable and did not show any particular issue during the course of the simulation. Most of the observables presented in the following include the thermalisation part. Starting from a random configuration, the HMC energy violations, measured every trajectory (τ = 0.  after rapid changes during the thermalisation phase of the run. The smallest eigenvalues of |γ 5Du | and |γ 5Dd/s | follow, confirming that the lower end of the spectral ranges of the rational approximations have been chosen correctly. No exceptionally small values are present, which is not surprising considering the heavy pseudoscalar mass simulated here. The Q*D1 run has been produced with a rational approximation with relative precision δ = O(10 −11 ). A second run has been performed with the same parameters as Q*D1 except for the rational approximation, which has been chosen with relative precision δ = O(10 −9 ). The logarithms of the reweighing factors for both runs are shown in the last two panes of fig. 8. As expected, the reweighting factor for the run with a better rational appriximation is closer to 1 (and its logarithm is closer to 0).

Summary and outlook
We presented openQ*D [1], the first open source package which allows to perform full lattice simulations of QCD+QED, QCD or QED. The code implements the proposal of ref. [18] and allows to choose C * boundary conditions along the spatial directions but also periodic boundary conditions can be simulated efficiently. Moreover, the chosen theory can be simulated by choosing either periodic, Schrödinger Functional or open boundary conditions along the time direction.
The new code is based on the openQCD [2] package from which it inherits the highly optimized implementation of the Dirac operator, of the solvers, of the HMC and of the RHMC algorithms. The openQ*D package extends the algorithmic functionalities of the openQCD code by giving the possibility of using multiple deflation subspaces in a single simulation, of implementing rational approximations of generic powers of the Dirac operator (with and without twisted-mass preconditioning) and by implementing Fourier Acceleration for the evolution of the U(1) field.
We presented the main functionalities of the code and discussed the theoretical motivations behind the algorithmic choices and their specific implementations. We also presented a guide to instruct the user to run a full QCD+QED simulation with openQ*D and discussed the results of some tests. These include low-level tests aiming at assessing the correctness of the implementation of the different algorithms but also some benchmarks to measure the performance of the code.
Given the good performance and high scalability on modern supercomputing cluster architectures, openQ*D can profitably be used to generate QCD+QED gauge configurations with C * boundary conditions (but not only) in a realistic setup with the aim of computing QED radiative corrections to phenomenologically relevant observables.
Acknowledgements. The simulations were performed on the following HPC systems: Altamira, provided by IFCA at the University of Cantabria; FinisTerrae II, provided by CESGA (Galicia Supercomputing Centre); the Lonsdale cluster maintained by the Trinity Centre for High Performance Computing; and the Lattice-HPC cluster at CERN. FinisTerrae II was funded by the Xunta de Galicia and the Spanish MINECO under the 2007-2013 Spanish ERDF. Lonsdale was funded through grants from the Science Foundation Ireland. We thankfully acknowledge the computer resources offered and the technical support provided by the staff of these computing centers. We thank the Theoretical Physics Department at CERN for hospitality during the workshop Advances in Lattice Gauge Theory 2019, allowing us to jointly finalise the present work.

A.1 Rational approximation
It is convenient to introduce the hermitian operatorQ = γ 5D , in terms of whichD †D = Q 2 . Assume that the spectrum of |Q| is contained in the interval [r a , r b ], and choose an integer n. A rational function of order [n, n] in q 2 has the form (A.1) Without loss of generality one can assume ρ(q 2 ) is chosen to be the optimal rational approximation of order [n, n] of the function (q 2 +μ 2 ) −α in the domain q ∈ [r a , r b ], i.e. the rational function of the form (A.1) which minimizes the uniform relative error As explained in sec. 3.2.2, the optimal rational approximation can be calculated with the minmax code which implements the minmax approximation algorithm in multiple precision. If ρ(q 2 ) is the desired optimal rational approximation, the operator R which appears in eq. (2.6) is defined simply as (A.4) Eq. (A.3) implies the following norm bound

A.2 Frequency splitting and pseudofermion action
openQ*D inherits from openQCD the frequency splitting of the rational approximation: the factors of the rational approximation can be split in different pseudofermion actions; the corresponding forces can be included in different levels of the MD integrator, providing a useful handle to optimize the algorithm. This procedure is similar to the Hasenbusch decomposition for the HMC algorithm [58]. The rational approximation constructed in section A.1 is broken up in factors of the form (A.6) For example, if n = 12 a possible factorization is R = AP 1,5 P 6,9 P 10,12 . (A.7) The contribution of R to the quark determinant is det R −1 = constant × det P −1 1,5 det P −1 6,9 det P −1 10,12 . (A.8) Each P −1 k,l determinant is simulated as usual by adding a pseudofermion action of the form S pf,k,l = (φ k,l e , P k,l φ k,l e ) , (A.9) where the fields φ k,l e are independent pseudofermions that live on the even sites of the lattice. By using a partial fraction decomposition (A.10) the pseudofermion action in eq. (A.9) is cast into a sum of terms of the type

A.3 Reweighting factors
LetR and R be the optimal rational approximations of order [n, n] for (D †D ) −α and (D †D +μ 2 ) −α respectively. It is assumed that the relative errors of the two rational approximations are not greater than δ in the common spectral range [r a , r b ]. The reweighting factor W defined in eq. (2.8) is decomposed in two factors which are calculated separately, i.e.

A.3.1 Reweighting factor W rat
In the calculation of the reweighting factor W rat in eq. (A.14), it is assumed that the exponent α is a positive rational number of the form 16) where u and v are natural numbers. The reweighting factor can be represented as where the operator Z is defined as The determinant in eq. (A.17) is estimated stochastically where the fields η j e are N independent normally-distributed pseudofermions that live on the even sites of the lattice. From the norm bound in eq. (A.5) forμ = 0, and the positivity ofR (which is guaranteed if the relative error δ is small enough), it follows that which yields the norm bound Therefore the Taylor series converges rapidly in operator norm. The exponent in eq. (A.19) can be estimated from the first few terms of It is possible to estimate the size of these terms by noting that η j e 2 is very nearly equal to 12 times the number N e of even lattice points. Taking the bound (A.21) into account, the following estimate is obtained The statistical fluctuations of the exponents in eq. (A.19) derive from those of the gauge field and those of the random sources η j e . For a given gauge field, the variance of the exponent is equal to These fluctuations are guaranteed to be small if, for instance, 12N e δ 2 ≤ 10 −4 . One can then just as well set N = 1 in eq. (A.19), i.e. a sufficiently accurate stochastic estimate of W rat is obtained in this case with a single random source. When the stronger constraint 12N e δ ≤ 10 −2 is satisfied, the reweighting factor W rat deviates from 1 by at most 1%. Larger approximation errors can however be tolerated in practice as long as the fluctuations of W rat remain small.

A.3.2 Reweighting factor W rtm
Let us choose a rational approximation R of order [n, n] for (D †D +μ 2 ) −α of the form (A.26) where the fields η j e are N independent normally-distributed pseudofermions that live on the even sites of the lattice. It is useful to consider the partial fraction decompositioñ . (A.36) Typically σ j andσ j are found to have opposite signs. Also, for small values of j, |σ j | and |σ j | are of the same order of magnitude, therefore it is convenient for numerical stability to use the following representatioñ The basis functions e µ (p 0 , x 0 ) (for fixed µ) are orthogonal with respect to a weighted scalar product The set P is given by all spatial momenta p = (p 1 , p 2 , p 3 ) of the form where c k = 0 if k is a periodic direction and c k = 1 if k is a C * direction. The sets E µ and the eigenfunctions e µ (p 0 , x 0 ) depend on the boundary conditions in time. In the following k = 1, 2, 3.