Quark Mass Models and Reinforcement Learning

In this paper, we apply reinforcement learning to the problem of constructing models in particle physics. As an example environment, we use the space of Froggatt-Nielsen type models for quark masses. Using a basic policy-based algorithm we show that neural networks can be successfully trained to construct Froggatt-Nielsen models which are consistent with the observed quark masses and mixing. The trained policy networks lead from random to phenomenologically acceptable models for over 90% of episodes and after an average episode length of about 20 steps. We also show that the networks are capable of finding models proposed in the literature when starting at nearby configurations.


Introduction
Machine learning in particle and string theory has developed into a fruitful and growing area of interdisciplinary research, triggered by the work in refs. [1,2]. (For a review and a comprehensive list of references see ref. [3].) Much of the activity to date has been in the context of supervised learning (see, for example, refs. [4][5][6][7][8][9][10][11]), where data sets which arise in physics or related areas of mathematics have been used to train neural networks. However, there has also been some interesting work using reinforcement learning (RL), particular in relation to string model building [12,13].
In the present paper, we are interested in reinforcement learning with environments which consist of classes of particle physics models. More precisely, we would like to address the following question. Can techniques of reinforcement learning be used to train a neural network to construct particle physics models with certain prescribed properties? At its most ambitious, such a network might be used to explore large classes of quantum field theories in view of their consistency with experimental data, thereby facilitating the search for physical theories beyond the standard model of particle physics. However, such a wide-ranging approach would require considerably conceptual work as well as computing JHEP08(2021)161 resources and does not seem feasible for a first exploration. (For a different approach to quantum field theory via methods of machine learning see ref. [14]. ) For this reason, we will focus on a much more limited arena of particle physics models which can be relatively easily described and where extracting relevant physics properties is straightforward. Specifically, we will consider Froggatt-Nielsen (FN) models of fermion masses [15,[21][22][23][24][25], focusing on the quark sector. (For related early work on mass model building with horizontal U(1) symmetries see also refs. [16][17][18][19][20].) The standard model of particle physics contains the up and down quark Yukawa couplings Y u ij and Y d ij , where i, j, · · · = 1, 2, 3 label the three families. Within the standard model, these couplings are mere parameters inserted "by hand". Upon diagonalisation, they determine the masses (m u,i ) = (m u , m c , m t ) and (m d,i ) = (m d , m s , m b ) of the up and down type quarks as well as the CKM mixing matrix V CKM .
FN models attempt to explain the values of Y u ij and Y d ij by introducing U a (1) symmetries, where a = 1, . . . , r, and singlet fields φ α , where α = 1, . . . , ν, in addition to the structure present in the standard model. The idea is that the Yukawa couplings are either zero, if forbidden by the U a (1) symmetries, or given in terms of the vacuum expectation values (VEVs) φ of the scalar fields, such that Y u ij ∼ φ n ij and Y d ij ∼ φ m ij . Here, n ij and m ij are (non-negative) integers whose values are determined by U a (1) invariance of the associated operator. A FN model is easily described by its charge matrix (Q a I ) = (q a (Q i ), q a (u i ), q a (d i ), q a (H), q a (φ)), where q a denotes the charge with respect to U a (1), Q i are the left-handed quark-doublets, u i and d i are the right-handed up and down quarks and H is the Higgs doublet. (As we will discuss, the VEVs φ α , which may also be considered as part of the definition of a FN models, will be fixed to certain optimal values for a given charge assignment.) We can, therefore, think of the space of FN models as the space of charge matrices Q. For practical reasons, we will impose limits, q min ≤ Q aI ≤ q max , on the entries of this matrix, so that the space of models becomes finite. However, note that, even for one U(1) symmetry (r = 1), one singlet (ν = 1) and a modest charge range −q min = q max = 9 we have of the order of 10 13 models. For two U(1) symmetries, two singlets and for the same charge range this number rises to roughly 10 28 . This is quite sizeable, even though it is small compared to typical model numbers which arise in string theory. At any rate, gives these numbers, systematic scanning of all or a significant fraction of the state space is clearly not practical or even feasible. Exploring such large environments requires different methods and this is where RL comes into play. The idea of RL is to train a neural network with data obtained by exploring an environment, subject to a goal defined by a reward function. (See, for example, ref. [26] for an introduction.) It has been shown that RL can lead to impressive performance, even for very large environments, where systematic scanning is impossible [27]. It is, therefore, natural to ask whether RL can help explore the large model environments realised by quantum field theory and string theory. In the present paper, we will use RL to explore the space of FN models for the quark sector. More specifically, our environment consists of the set {Q} of all FN charge matrices for a given number, r, of U(1) symmetries, a given number, ν, of singlets φ α and charges constrained by q min ≤ Q aI ≤ q max . An action within this environment simply amounts to adding or decreasing one of the charges Q a I by one and JHEP08(2021)161 a reward is computed based on how well the models reproduce the experimental quark masses and mixings. A terminal state is one that reproduces the experimental masses and mixing to a given degree of accuracy. We use a simple policy-based RL algorithm, with a single policy network whose input is, essentially, the charge matrix Q and whose output is an action. The hope is that a successfully trained policy network of this kind will produce episodes starting from arbitrary (and typically physically unacceptable) FN models and efficiently lead to phenomenologically viable FN models.
The plan of the paper is as follows. In the next section, we briefly review the theoretical background of this work, namely RL and FN model building, mainly to set the scene and fix notation. In section 3 we describe our RL set-up and section 4 presents the results we obtained for the cases of one singlet and one U(1) symmetry and two singlets and two U(1) symmetries. The appendices contain a number of interesting FN models found by the neural network.

Reinforcement learning
We start with a quick overview of RL, focusing on the aspects needed for this paper. For a comprehensive review see, for example, ref. [26] and [3].
The main components of an RL system are the environment, the agents and the neural network(s). The latter are set up to learn certain properties of the environment, based on data delivered as the agent explores the environment. The mathematical underpinning of RL is provided by a Markov decision process (MDP), defined as a tuple (S, A, P, γ, R). Here S is a set which contains the environment's states, A is a set of maps α : S → S which represent the actions, P provides a probability P(S = s |S = s, A = α) for a transition from state s to state s via the action α, γ ∈ [0, 1] is called the discount factor and R : S ×A → R is the reward function. Among the states in S a subset of so-called terminal states is singled out which may, for example, consist of states with certain desirable properties. Within this set-up we can consider a sequence of states s t and actions α t , producing rewards r t , where t = 0, 1, 2, · · · , which is referred to as an episode. In principle, an episode can have infinite length, although in practice a finite maximal episode length, N ep , is imposed. If an episode arrives at a terminal state before it reaches its maximal number of steps it is stopped. The return, G t , of a state s t in such an episode is defined as The discount factor γ can be dialled to small values in order to favour short-term rewards dominating the return, or to values close to one so that longer-term rewards affect the return as well. The choice of action in a MDP is guided by a policy π, which provides JHEP08(2021)161 probabilities π(α|s) = P(A t = α|S t = s) for applying a certain action α to a state s. Relative to such a policy, two important value functions, namely the state value function V π and the state-action value function Q π , can be defined as expectation values of the return.
The purpose of an RL system is to maximise a value function (state or state-action) over the set of possible policies. In practice, this can be realised in a number of ways which differ by which of the functions π, V π and Q π are represented by neural networks and how precisely these neural networks are trained via exploration of the environment. Common to all algorithms is an iterative approach, where a batch of data, in the form of triplets (s t , a t , G t ), is collected from episodes which are guided by the neural network(s) in their present state. This data is then used to update the neural network(s), followed by a further round of exploration and so on. For our purposes, we will consider what is probably the simplest approach, a basic policy-based algorithm referred to as REINFORCE. This set-up contains a single neural network π θ with weights θ which represents the policy π. Its inputs are states and the outputs are probabilities for actions. Exploration of the environment is guided by the policy, meaning the steps in an episode are selected based on π θ , so Data is collected by performing such episodes successively, so we can say that the system contains only one agent. According to the policy-gradient theorem, the neural network π θ should be trained on the loss function L(θ) = Q π (s, a) ln(π θ (s, a)) , (2.4) where Q π (s, a) can, in practice, be replaced by the return G of the state s. Schematically, the algorithm then proceeds as follows.
(2) Collect a batch of data triplets (s t , a t , G t ) from as many episodes (2.3) as required.
New episodes start at random states s 0 .
(3) Use this batch to update the weights θ of the policy network π θ , based on the loss (2.4).
(4) Repeat from (2) until the loss is sufficiently small so that the policy has converged.

Froggatt-Nielsen models
Before we discuss Froggatt-Nielsen models, we quickly review fermion masses in the standard model of particle physics, in order to set up notation and present the experimental data.  Table 1. Experimentally measured masses in GeV and mixing angles of quarks from ref. [28].
The standard model contains Yukawa interactions, which are responsible for generating the masses and mixing of quarks and leptons. In this paper, we focus on the quark sector for simplicity, although we expect that our work can be generalised to include the lepton sector. The quark Yukawa couplings in the standard model take the form where Q i are the left-handed quarks, u i , d i are the right-handed up and down type quarks and H is the Higgs doublet. We use indices i, j, . . . = 1, 2, 3 to label the three families. Within the standard model, the Yukawa matrices Y u and Y d are not subject to any theoretical constraints -their (generally complex) values are inserted "by hand" in order to fit the experimental results for masses and mixing.
Once the charge-neutral component H 0 in the Higgs doublet develops a VEV, v = H 0 , the above Yukawa terms lead to Dirac mass terms with associated mass matrices These matrices need to be diagonalised, (2.8) The CKM matrix is unitary and can, hence, be written in terms of three angles θ 12 , θ 13 , θ 23 and a phase δ, as in the above equation, where the abbreviations s ij = sin(θ ij ) and c ij = cos(θ ij ) have been used. The experimentally measured values for these quantities are given in

JHEP08(2021)161
In the context of the standard model, the Yukawa matrices Y u and Y d in eq. (2.5) have to be chosen to fit these experimental values for masses and mixing but this still leaves considerable freedom. Only 10 real constrains are imposed on the 36 real parameters which determine Y u and Y d . Froggatt-Nielsen (FN) models provide a framework for adding more structure to the Yukawa sector of the standard model, in an attempt to remove some of this ambiguity and provide a theoretical explanation for the observed masses and mixing. Two main ingredients are added to the picture: a number of global U(1) symmetries U a (1), where a = 1, . . . , r, and a number of complex scalar fields φ α , where α = 1, . . . , ν, which are singlets under the standard model gauge group. The standard model fields as well as the scalar singlets are assigned U a (1) charges which we denote by q a (Q i ), q a (u i ), q a (d i ), q a (H) and q a (φ α ). In fact, to simplify matters, we assume that we have the same number of U(1) symmetries and singlet fields, ν = r, and that the a th singlet φ a is only charged under U a (1). The resulting singlet charges are then denoted by q a (φ).
Given this set-up, the standard model Yukawa couplings (2.5) are no longer in general consistent with the U a (1) symmetries and should be replaced by where n a,ij and m a,ij are non-negative integers. For a term (ij) in the up-quark sector to be invariant under U a (1) we require the conditions Hence, the term (ij) in the u-quark sector is allowed if the n a,ij given by eq. (2.11) are non-negative integers for all a = 1, . . . , r. In this case, the coefficient a ij is of order one, otherwise it is set to zero. An analogous rule applies to the terms for the down-type quarks.
Once the scalars φ a develop VEVs, v a = φ a , Yukawa couplings are generated. 1 The main model building idea in this setting is that moderately small singlet VEVs v a can generate the required large hierarchies in masses, in a way that is controlled by the integers n a,ij and m a,ij and, hence, ultimately, by the choices of U a (1) charges. At this stage the environment of FN models consists of U a (1) charges for all fields, the singlet VEVs v a and the coefficients a ij , b ij . In principle, the singlet VEVs are meant to be fixed by a scalar potential but implementing this in detail adds another layer of model building. Instead, for a given choice of charges and coefficients a ij , b ij , we will fix the VEVs v a such that the model provides an optimal fit to the experimental masses and mixing. Note this does not imply that the VEVs are inserted "by hand". Rather, for each JHEP08(2021)161 state, that is, for each set of charges, the system determines the best choices for these VEVs in view of matching the data. This means the RL system returns both the charges as well as the VEVs of a model. The non-zero coefficients a ij , b ij might be considered as part of the environment definition but, to keep things simply, we will fix those to specific numerical values of order one. While, in general, a ij and b ij can be complex, we simplify this scenario by only allowing them to take real values. Consequently, we will not attempt to fit the CP violating phase δ in the CKM matrix. As a further simplification, we require that the top Yukawa termQ 3 H c u 3 is present without any singlet insertions, a condition which seems reasonable given the size of the top Yukawa coupling. This requirement can be used to fixed the U a (1) charges of the Higgs multiplet as (2.13) Altogether, this means a FN model within our set-up is specified by the charges choices which we have assembled into the r × 10 integer charge matrix Q. In practice, the charges in Q will be restricted to a certain range with q min and q max to be specified later. While this leads to a finite space of charge matrices and associated FN models, numbers can be considerable. For example, for −q min = q max = 9 we have ∼ 10 13 models in the case of a single U(1) symmetry and ∼ 10 26 models for the case of two U(1) symmetries. The environment (2.14) of FN models has a number of permutation degeneracies, since the assignment of charges to families and the order of U a (1) symmetries does not carry physical meaning, although part of this symmetry is broken by designating Y u 33 the top Yukawa coupling. This means there is a permutation degeneracy isomorphic to in the environment (2.14). For the purpose of RL we will not attempt to remove this redundancy, as this would complicate the constraints on the charges in Q. From the viewpoint of particle physics the task is now to investigate the model landscape defined by eq. (2.14) and extract the phenomenologically promising cases. Considerable effort has been invested into this, since the original proposal of Froggatt and Nielsen [15]. It is precisely this task we wish to carry out using reinforcement learning.

Mass models and reinforcement learning
We now explain how we propose to map the problem of FN model building onto the structure of reinforcement learning. We begin by describing the set-up of the RL environment.

The environment
We need to identify how the various ingredients of a MDP are realised in our context. We take the set S of states to consists of all FN models for a fixed number, r, of U(1) symmetries and the same number of singlet fields. These models are represented by the r × 10 integer charge matrices Q in eq. (2.14), with entries restricted as in eq. (2.15). The set A of actions α consists of the basic operations that is, increasing or decreasing a single charge Q a I by one while keeping all other charges unchanged. These are deterministic actions so we do not need to introduce transition probabilities P. The number of different actions is 2 × r × 10 = 20r. For the discount factor γ we choose the value γ = 0.98.
Defining the reward function R requires a bit more effort. We start by defining the intrinsic value for a state Q as Here, µ runs over the six quark masses as well as the entries of the CKM matrix, µ Q,va is the value for one of these quantities predicted by the model with charge matrix Q and scalar fields VEVs v a , computed from eqs. (2.12), (2.6), (2.7), (2.8) (using fixed random values of the order-one coefficients a ij , b ij ), and µ exp is its experimental value as given in table 1 and eq. (2.9). The minimisation is carried out over the scalar field VEVs v a , in a certain range I = [v min , v max ], with typical values v min = 0.01 and v max = 0.3. From this definition, the intrinsic value of a state Q is simply the (negative) total order of magnitude by which predicted masses and mixings deviate from the experimental ones, for optimal choices of the scalar field VEVs.
We have deliberately chosen a value function which checks order of magnitude agreement, rather than one which measures the quality of a state relative to the experimental errors of the masses and mixings. This is because the information from U(1) charges and resulting powers of VEVs which make up our environment are only expected to get to the correct order of magnitude. Finer adjustments can be made by choosing the order one coefficients a ij and b ij which are not fixed by the U(1) symmetries. However, for simplicity we have opted to fix these coefficient, rather than make them part of the environment. A check based on experimental error would, therefore, be too sensitive and miss many models which may become acceptable after a suitable adjustment of these order one coefficients.
A terminal state Q in our environment is one which is phenomenologically promising, that is, a state which gives rise to (roughly) the correct masses and mixings. More specifically, we call a state terminal if its intrinsic value V(Q) is larger than a certain threshold value V 0 and if each individual deviation −|log 10 (|µ Q |/|µ exp |)| (computed for the scalar field VEVs which minimise eq. (3.2)) is larger than a threshold value V 1 . Since we have fixed our order-one parameters a ij , b ij these threshold values are chosen relatively generously, so as to not miss any promising models. For our computations, we have used V 0 = −10 and V 1 = −1.

JHEP08(2021)161
Based on this intrinsic value, the reward R(Q, α) for an action Q α → Q of the form (3.1), connecting two states Q and Q , is defined by Here, R offset is a fixed (negative) value which penalises a decrease of the intrinsic value, typically chosen as R offset = −10. In addition, if the new state Q is terminal a terminal bonus R term , typically chosen as R term = 100, is added to the reward (3.3).

Neural network
To represent the policy π, we use a fully connected network f θ with the following structure.
Here, "affine" refers to an affine layer performing the transformation x → W x + b with weight W and bias b, "SELU" is the standard scaled exponential linear unit activation function and "softmax" is a softmax layer which ensures that the output can be interpreted as a vector of probabilities which sum to one. The input of this network is the charge matrix Q, in line with the input dimension of 10r while the output is a probability vector whose dimension, 20r, equals the number of different actions (3.1). Training data is provided in batches which consist of triplets (Q t , α t , G t ), where the actions α t are represented by a standard unit vector in R 20r . The probability of an action can then be written as π θ (Q t , α t ) = α t · f θ (Q t ) and the loss (2.4) takes the form (

3.4)
Based on this loss function, the above network is trained with the ADAM optimiser, using batch sizes of 32 and a typical learning rate of λ = 1/4000.

Agent
The FN environment will be explored by a single agent, following episodes (2.3) of maximal length N ep = 32, and guided by the policy network π θ . Each new episode is started from a random state, to improve exploration of the environment. Terminal states which are encountered during training are stored for later analysis. The FN environment and the REINFORCE algorithm are realised as MATHEMAT-ICA [29] packages, the latter based on the MATHEMATICA suite of machine learning modules. For terminal states found during training or by applying the trained network we perform a further Monte Carlo analysis in the space of order one coefficients a ij , b ij (which were held fixed during training) in order to optimise their intrinsic value V(Q).

Results
In this section, we present the results we have obtained by applying the REINFORCE algorithm to the FN environment, as described in the previous section. We focus on the two cases of one U(1) symmetry with one singlet scalars and two U(1) symmetries with two singlet scalars, starting with the former.

One U(1) symmetry
The entries of the 1×10 charge matrix Q are restricted as in eq. (2.15), with −q min = q max = 9, so the environment contains 19 10 ∼ 10 13 states. Training of the network in section 3.2 takes about an hour on a single CPU and the measurements made during training are shown in figure 1. After an initial phase of exploration, lasting for about 15000 rounds, the network learns rapidly and the fraction of episodes which end in terminal states (plot (c) in the figure 1) rises to > 90% within 10000 rounds or so. This pattern is quite characteristic and persists under variation of the various pieces of meta data, including the depth and width of the network, the constants which enter the definition (3.3) of the reward and the definition of a terminal state. The result is also stable under modest variations of the learning rate λ = 1/4000, although too large learning rates (λ > 1/1000) suppress exploration and lead to convergence to the "wrong" policy. The residual positive loss in figure 1(a) can

JHEP08(2021)161
be attributed to the fact that frequently more than one efficient path to a terminal state exists. In other words, there are several very similar optimal policies. During training, 4924 terminal states are found, which are reduced to 4630 after the redundancies due to the permutations (2.16) are removed. Episodes guided by the trained network, starting at a random state and with maximal length 32, lead to terminal states in 93% of cases, and the average episode length is 16.4.
As figure 1 shows, training has lasted for about 50000 episodes, each with a maximal length of 32 (and actual average episode length decreasing to about 16 during training). This means that the network has explored of the order of 10 6 states during training. We emphasise that this is a tiny fraction, ∼ 10 −7 , of the size of the environment. Hence, we are not performing a systematic scan, but rather, the network learns based on a relatively small sample. It is instructive to compare the efficiency of this learning process with random sampling. If we randomly generate 10 6 states from the environment if turns out about 40 of them are terminal states. This should be compared with the 4924 terminal states the network has found based on sampling a similar number of states.
The intrinsic values of the terminal states found during training are optimised by performing a Monte-Carlo search over the order one coefficients a ij , b ij . In this way, we find 89 models Q with an intrinsic value V(Q) > −1. From these, the model with the highest intrinsic value is given by 2 For a scalar VEV v 1 0.224 and the order one coefficients JHEP08(2021)161  Of course, the trained network can be used to find new models. For example, consider starting with the initial state The optimal intrinsic value for this state, achieved for a singlet VEV v 1 0.112, is V(Q) −15, so this is definitely not a phenomenologically viable model. Using (4.5) as the initial state of an episode, guided by the trained network, it takes 18 steps to reach the terminal state with intrinsic value V(Q) −3.94 for a singlet VEV v 1 0.056. The intrinsic value and the reward along this episode, as well as a two-dimensional projection of the path mapped out by the episode is shown in figure 2. We can also test the trained network by checking whether it can guide us towards a model known in the literature, starting at a nearby state. For example, consider the model from ref. [22], given by the charge matrix which has an intrinsic value of V(Q) −4.3 for a singlet VEV v 1 0.159. Suppose we use the initial state which is a perturbation of the literature model (4.7) but, as is, does not amount to a potentially viable model. Generating an episode starting at the state

Two U(1) symmetries
Next, we present results for an environment with two U(1) symmetries and two singlet scalar fields. The entries of the 2 × 10 charge matrix Q are constrained as in eq. (2.15) but we now consider a somewhat smaller range with −q min = q max = 5. This still leads to a considerably larger environment than previously, with a total of 11 20 ∼ 10 21 states. Training for this environment on a single CPU takes about 25 hours and leads to the measurements shown in figure 4. The networks finds 60686 terminal states which reduce to 57807 once the permutation redundancies (2.16) are removed. Episodes guided by the trained network and with maximal length 32 lead to terminal states in 95% of cases and the average episode length is 19.9 steps.
As with the single U(1) case, the network has sampled of the order of 10 6 states during training which is a tiny fraction of about 10 −14 of the total. Generating 10 6 states randomly produces a few terminal states while the network finds over 60000, based on a similar sample size.
After a Monte-Carlo optimisation of the order one coefficients a ij , b ij we find 2019 from the 57807 models found during training have an intrinsic value V(Q) > −1. The best of these has charge allocation  We can also demonstrate that the trained network is capable of finding models which have been constructed in the literature. Consider the model from ref. [22] which is described by the charge matrix  For singlet VEVs v 1 0.158 and v 2 0.028 it is a terminal state with intrinsic value V(Q) −4.1 which, however, has not been found during training. To see that this model can be obtained we start an episode at a nearby state with charge matrix (4.14) The trained network then takes us from this state to the literature model (4.13) in three steps, as can be seen in figure 5.

Conclusion and outlook
In this paper, we have explored particle physics models with reinforcement learning (RL). We have focused on a simple framework -Froggatt-Nielsen (FN) models for quark masses and mixing -and the simplest policy-based RL algorithm. Our results show that the space of these models can be efficiently explored in this way. For both cases we consider, that is, for FN models with one U(1) symmetries and two U(1) symmetries, the network can be trained to settle on a highly efficient policy which leads to terminal states in > 90% of all cases and in an average number of < 20 steps. Training is accomplished based on sampling about 10 6 states, which is a tiny fraction, of the order of 10 −7 and 10 −14 for the two cases, of the total number of states. Therefore, training does not amount to systematic scanning but rather a guided exploration of the environment. At the same time, the network is significantly more efficient, by factors of the order of 10 2 and 10 4 for the two cases, at finding terminal states than simple random sampling. This shows that reinforcement learning is a powerful method to explore large environments of particle physics models, which defy systematic scanning. The trained networks can be used to find JHEP08(2021)161 promising models from random initial states and it is capable of finding literature models, provided it is started at a near-by state.
There are numerous extensions of this work. At a basic level, there are various steps to extend the system within the context of fermion mass models, by enlarging the environment to cover more general classes of theories. (i) The lepton sector can be included, that is, the lepton charges become part of the environment. (ii) The order one coefficients, suitably discretised, are included in the environment. (iii) A class of scalar field potentials is added to the environment. The scalar field VEVs which are determined by an optimal fit to the data in our present system would then be fixed by minimising these potentials. Adding all three components to our environment is feasible and would only require modest computing resources, such as a single machine with a GPU. Our present results strongly suggest that this is likely to produce a successful RL system which finds suitable charge assignments for all fermions as well as scalar potentials which produce the required VEVs. Getting all these elements right simultaneously is not necessarily an easy task for a model builder and we believe such an RL system could provide valuable assistance in finding promising models of fermion masses.
Looking further ahead, we can ask if other classes of particle physics models, such as, for example, supersymmetric or dark matter extensions of the standard model, can be explored in this way. As its most ambitious, this line of thought suggests an RL environment which consists of large classes of quantum field theories, extending the standard model of particle physics. The actions available to the agent would allow for changes of the symmetry, the particle content and the interaction terms in the Lagrangian. The intrinsic value of such models might be determined by comparing their predictions with a wide range of experimental data. Realising such an environment would require significantly more theoretical preparation than was necessary for the FN environment. All required observables have to be readily computable for the entire class of quantum field theories considered. With rapid progress in amplitude computations over the past years this may well be in reach. Of course substantially more computing power will also be required in order to facilitate a fast evaluation of each model against the data. It is conceivable this could be achieved by a small cluster where the computation of a large number of observables can be parallelised. The benefits of such a system might be considerableit would allow exploring large classes of standard model extensions and their consistency with experimental data and might help to find the correct path for physics beyond the standard model.

A Example models for one U(1) symmetry
In this appendix we list some models with two U(1) symmetries with a high intrinsic value V(Q), found during training.  Table 2. Models with high intrinsic value for a single U(1) symmetry.

B Example models for two U(1) symmetries
In this appendix we list some models with a single U(1) symmetry with a high intrinsic value V(Q), found during training.  Table 3. Models with high intrinsic value for two U (1) symmetries.