Exploring the flavor structure of quarks and leptons with reinforcement learning

We propose a method to explore the flavor structure of quarks and leptons with reinforcement learning. As a concrete model, we utilize a basic value-based algorithm for models with U(1) flavor symmetry. By training neural networks on the U(1) charges of quarks and leptons, the agent finds 21 models to be consistent with experimentally measured masses and mixing angles of quarks and leptons. In particular, an intrinsic value of normal ordering tends to be larger than that of inverted ordering, and the normal ordering is well fitted with the current experimental data in contrast to the inverted ordering. A specific value of effective mass for the neutrinoless double beta decay and a sizable leptonic CP violation induced by an angular component of flavon field are predicted by autonomous behavior of the agent. Our finding results indicate that the reinforcement learning can be a new method for understanding the flavor structure.


Introduction
The origin of flavor structure of quarks and leptons is one of the unsolved problems in particle physics.To understand the peculiar pattern of fermion masses and mixings, the flavor symmetries have been utilized to explain these flavor puzzle. 1 An attractive feature of the flavor symmetry is that it may connect the bottom-up approach of flavor model building with the top-down approach based on an ultra-violet completion of the Standard Model such as the string theory.In this paper, we adopt a bottom-up approach to explore the flavor structure of quarks and leptons.
In most traditional approaches to address the flavor structure of quarks and leptons, one assumes a certain representation of quarks and leptons under the flavor symmetry among all possible configurations.Indeed, it will be difficult to exhaust all possible realistic flavor patterns from a broad theoretical landscape.For instance, in a global U (1) flavor symmetric model using the Froggatt-Nielsen (FN) mechanism [11], we have a degree of freedom of U (1) charge assignment of each quark and lepton.When we consider the flavor dependent U (1) charges of the quarks q i within the range −9 ≤ q i ≤ 9, it results in O (10 14 ) patterns of U (1) charges even for the quark sector.When we combine with the lepton sector, we are faced with the problem of doing a brute-force search over a higher-dimensional parameter space.This is a simple flavor model using the continuous flavor symmetry, but in general, it is difficult to find a realistic flavor pattern from a huge amount of possibilities in flavor models with discrete symmetries.Thus, it motivates us to apply recent machine learning techniques for an exhaustive search of flavor models.
In order to explore such a huge landscape of flavor models, in this paper, we will deal with a reinforcement learning (RL), which is known as one of machine learning techniques 2 .In the framework of RL, an agent autonomously discovers desirable behavior to solve given problems, where a systematic search is impossible.So far, such a technique was utilized to find the parameter space of FN models with an emphasis on the quark sector [12], where only the experimental values of quark masses and mixing angles are efficiently reproduced.However, it is quite important to see whether one can reproduce the flavor structure of all the fermion masses and mixings.Throughout this paper, we assume the Type-I see-saw mechanism to realize active neutrino masses and large mixing angles in the lepton sector.We will utilize a basic value-based algorithm, where the neural network is trained by data given by an environment.To find the flavor structure of quarks and leptons efficiently, we set the environment where the inputs consist of U (1) charges of quarks and leptons under the U (1) flavor symmetry and the coefficients appearing in Yukawa couplings are randomly fixed as O(1) real values.The outputs of the neural network are probabilities for the action determined by a policy.Here, the action of agent is given by increasing or decreasing one of the U (1) charges by one, and the agent receives the reward (punishment) for this action when the fermion masses and mixings determined by the U (1) charges approach (deviate from) the experimental values.Specifically, the reward function is defined by the intrinsic value consisting of fermion masses and elements of CKM and PMNS matrices whose values are minimized under a vacuum expectation value (VEV) of complex flavon field.
In addition to reproducing the experimental values, parameter search with RL will provide new insights on the neutrino mass ordering and CP phase in the lepton sector.Note that a source of CP violation is assumed to be originating from the phase of complex flavon field.By training neural networks without specifying the neutrino mass orderings, RL can help to find whether the neutrinos are in the normal ordering or in the inverted ordering.From the results of trained network, we find that the normal ordering is statistically favored by the agent.Furthermore, the sizable Majorana CP phases and effective mass for the neutrinoless double beta decay are predicted around specific values.This paper is organized as follows.After briefly reviewing RL with an emphasis on Deep Q-network in Sec. 2, we establish the FN model with RL in Sec. 3. We begin with the model building with RL by focusing on the quark sector in Sec. 4, and the training of the lepton sector is performed in Sec. 5.In particular, we analyze two scenarios for the neutrino sector.In Sec.5.1, we implement the FN model with fixed neutrino mass ordering to the neural network, but the neutrino mass ordering is not specified in the analysis of Sec.5.2.Sec.6 is devoted to the conclusion and discussion.In Appendix B, we list our finding U (1) charge assignment of quarks and leptons.

Reinforcement learning with deep Q-network
In this section, we briefly review RL with the Deep Q-Network (DQN) used in the analysis of this paper.For more details, see, e.g., Ref. [13].The RL is constructed by an agent and an environment.At a certain time, the agent observes the environment and takes some action.Depending on the change of the environment caused by the action, the agent will receive rewards or penalties.By repeating those processes and searching for actions that maximize the total rewards, the agent is designed to exhibit autonomous behavior in the environment.
To determine the action, we utilize the neural network model.In the multi-layer perceptrons, a n-th layer with N n−1 -dimensional vector ⃗ x n−1 = (x n−1,1 , x n−1,2 , • • • , x n−1,N n−1 ) in multi-layer perceptrons transforms into a N n -dimensional vector ⃗ x n = (x n,1 , x n,2 , • • • , x n,Nn ): with h, w, b being the activation function, the weight and the bias, respectively.In the analysis of this paper, we employ a fully-connected layer.Then, the DQN known as one of the RL methods is characterized by Q network, target network, and experience replay.In this paper, we consider the neural networks whose output can be constructed by a softmax layer.Note that Q network and target network have same structures, but weights and biases in those are generically different from each other.
RL using the DQN proceeds through the following 5 steps: 1.An agent observes the environment state s which is given as an input in the target neural network, as shown in Fig. 1.The target network (TN) gives the probabilities p as an output.Since we adopt the softmax layer defined by with f (x) i = e x i / n i=1 e x i , the output will also be regarded as probabilities.2. In a second step, the agent determine the action a, taking into account the probabilities p given by the first step.At the initial stage, the neural network cannot judge whether the action is an appropriate one.Let us denote the action with the highest probability b.To acquire the ability of autonomous behavior for the agent, we adopt the ϵ-greedy method, where the greedy action b is selected with probability 1 − ϵ and a random action c is selected with probability ϵ (see Fig. 2), that is, By repeating this process, a sequence of the states is defined as follows: and this chain is called an episode.The initial environment state s 1 is chosen randomly.The number of actions is specified by N step for one episode and the agent repeats this step N ep times as shown in Table 1.Note that the greedy action is determined by taking into account the probabilities p in the first step.The value of ϵ is chosen to ensure that the agent gradually takes the greedy action, whose explicit form will be given by with k = 1, 2, ..., N ep .In the following analysis, we adopt ϵ 0 = 1, r = 0.99999 and ϵ min = 0.01.This definition means that the agent gains various experiences for the large ϵ, using which the agent gradually takes a plausible action. Step

Episode N ep s
Table 1.The environment states s are changed by the actions.The agent performs at most N step step for one episode.
3. The state s is updated to s ′ through the action a. Depending on the states s ′ , the agent receives a reward R. In a third step, the transition e = ⟨s, a, s ′ , R⟩ corresponding to trajectories of experience is stored in the replay buffer as seen in Fig. 3. 4. A fourth step consists of "experience replay" and "stochastic gradient method".The experience replay is to extract a mini-batch of transitions randomly sampled from the replay buffer, where the Q network is optimized by using at most a batch of transitions times epoch number.The advantage of this experience replay is twofold.First, the transitions in a batch are uncorrelated due to the random selection of past experiences.Second, one can reuse each transition in the training because all the experience is stored in the replay buffer.
In the framework of DQN, there are two neural networks: Q network and target network.The Q network is updated by the stochastic gradient method where the mini-batch of transitions is used in the training data (see Fig. 4).When we denote outputs of the Q network and the target network by y(s) and y ′ (s ′ ), respectively, the weights and the biases are updated by minimizing a loss function L(y, y ′ ).In this paper, we adopt the Huber function: with y ′ (R) = R + γy ′ , γ = 0.99 and δ = 1, which combines a mean squared error and a mean absolute error3 .Note that the training of Q network is carried out at the end of one episode, including at most the N step step as shown in Table 1. 5. The Q network and the target network have different parameters Θ = {w, b} and Θ ′ = {w ′ , b ′ }, respectively.Lastly, the parameters Θ in the Q network are slightly reflected in Θ ′ in the target network (see Fig. 5).Specifically, in the case of a soft update, this reflection proceeds as follows: where α is called the "learning rate".This procedure can suppress rapid parameter changes and update the target network while maintaining learning stability.When α is large, the stability of the learning will be lost, but the small α will lead to a slow learning.In this paper, we adopt α = 2.5 × 10 −4 .Since the stochastic gradient method is not used in updating the parameters of the target network as described above, no loss function is defined for this network.3 Froggatt-Nielsen model with reinforcement learning

The environment
In FN models, the hierarchical structure of fermion masses and the flavor structure are addressed by the global U (1) symmetry.For simplicity, we introduce only one complex scalar field (so-called flavon field), charged under U (1).The relevant Yukawa terms of quarks and leptons are given by where {Q i , u i , d i , L i , l i , N i , H} denote the left-handed quarks, the right-handed up-type quarks, the right-handed down-type quarks, the left-handed leptons, the right-handed charged leptons, the right-handed neutrinos, and the SM Higgs doublet with H c = iσ 2 H * , respectively.Here, we assume three right-handed neutrinos and tiny neutrino masses are generated by Type-I seesaw mechanism where the parameter M is chosen as M = 10 15 GeV throughout the analysis of this paper, and the Yukawa couplings {y u ij , y d ij , y l ij , y ν ij , y N ij } are O(1) real coefficients.Since the SM fields and the flavon field are also charged under U (1), let us denote their U (1) charges by {q(Q i ), q(u i ), q(d i ), q(L i ), q(l i ), q(N i ), q(H), q(ϕ)}. (3.2) To be invariant under the U (1) symmetry, the integers n ij satisfy the following relations: where n ij are considered positive integers throughout this paper. 4Furthermore, we require the presence of Yukawa term Q3 H c u 3 , irrelevant to q(ϕ): otherwise one cannot realize the value of top quark mass.Once ϕ and H develop VEVs, ⟨ϕ⟩ = v ϕ and ⟨H⟩ = v EW = 174 GeV, the Dirac mass matrices of quarks and leptons as well as the Majorana mass matrix are given by (3.6) The light neutrino mass matrix is obtained by integrating out heavy right-handed neutrinos: The quark and lepton mass matrices are diagonalized as and the flavor mixings are given by the difference between mass eigenstates and flavor eigenstates:  with c ij = cos θ ij and s ij = sin θ ij , which also holds for the CKM matrix V CKM = (U u ) † U d in the quark sector except the Majorana phases {α 21 , α 31 }.Since the quarks and the leptons are charged under U (1), the flavon VEV ⟨ϕ⟩ = v ϕ will lead to the flavor structure due to the smallness of |ϵ|: Recalling that the U (1) charge of Higgs doublet is determined by Eq. (3.5), the flavor structure of quarks and leptons is specified by the following charge vector: consisting of 19 elements.This is the input for target network and Q network as the state s.In the following analysis using RL, we restrict ourselves to the following range of U (1) charge: corresponding to total 19 19 ∼ 10 24 possibilities for the charge assignment. 5It will be a challenging issue to find a realistic flavor pattern by the brute force approach.Furthermore, it is generally difficult to use the gradient descent method, because the U (1) charges are discrete and even a small difference in the charges will result in exponential differences in calculated values such as masses.Against those backgrounds, the necessity of applying reinforcement learning arises.Note that a generic U (1) charge of flavon will lead to the non-integer n ij ; thereby we focus on q(ϕ) = 1 or −1 with 50% probability in the following analysis.

Neural Network
A state s given by the charge assignment Q will be updated to s ′ through the action a.To determine the action, we utilize the neural network as shown in Table 2.The activation function h (in Eq.(2.1)) is chosen as a SELU function for hidden layers 1,2,3 and the softmax function (2.2) for the output layer.We employ the ADAM optimizer in TensorFlow [15] 6 , where the weights and biases are chosen to minimize the loss function given by the Huber function (2.6).
In the FN model, the flavor structure of quarks and leptons is determined by the charge vector Q a , including total 10 24 possibilities for the charge assignment as pointed out before.When we focus on only the quark sector, the parameter spaces of U (1) charges reduce to 19 10 ∼ 10 12 possibilities.To achieve a highly efficient learning in a short time, it is better to perform a separate training for the U (1) charge assignment of quarks and leptons.Note that only the flavon U (1) charge connects the quark sector with the lepton sector since the U (1) charge of the Higgs is determined by the charge of third generation quarks (3.5).As mentioned before, we focus on q(ϕ) = 1 or −1 with 50% probability in the following analysis.Thus, we first analyze the parameter space of quark U (1) charges by RL as will be discussed in detail in Sec. 4, and move to the lepton sector with fixed U (1) charge of Higgs fields as will be discussed in detail in Sec. 5.
The hyperparameters are set as N ep = 10 5 and N ep = 6 × 10 4 for the episode number in the quark and lepton sector respectively, N step = 32 for the step number, batch sizes of 32, epoch number of 32, and the learning rate α = 2.5 × 10 −4 , respectively.The hyperparameters in ϵ-greedy method are described in the previous section.About the step number N step = 32, the same value was used in the previous research that focuses on only the quark sector [12].In Ref. [12], it was shown that terminal states can be reached after a sufficient amount of learning.Therefore, it is expected that N step = 32 is enough to achieve terminal states in the current situation where the quark sector and the lepton sector are searched separately.

layer
Input Hidden 1 Hidden 2 Hidden 3 Output Dimension In the neural network, the input is the charge assignment Q a with dimension A, and the activation functions are the SELU function for hidden layers 1,2,3.Since we use the softmax function (2.2) for the output layer, the output with dimension 2A is interpreted as probabilities.
The dimension of the output layer is twice that of the input layer due to the action of the agent (3.13).

Agent
To implement the FN model in the context of RL with DQN, we specify the following action a of the agent at each step: where A corresponds to {Q i , u i , d i , ϕ} in the analysis of Sec. 4 and {L i , l i , N i , ϕ} in the analysis of Sec. 5.These two candidates of the action make the dimension of the output layer 2A in Table 2.At the initial stage, the O(1) coefficients in Yukawa terms (3.1) are picked up from the two Gaussian distribution with an average ±1 and standard deviation 0.25 (see Fig. 6) and after the training by neural network introduced in the previous section, they are optimized to proper values by the Monte-Carlo simulation.Thus, once the charges are fixed, one can compare the masses and mixings of quarks or leptons given by the action a with the experimental values.Specifically, we define the intrinsic value: whose components will be defined below.Note that the flavon VEV is chosen to maximize the intrinsic value relevant for the quark sector in Sec.4; thereby there is no flavon dependence in the intrinsic value of the lepton sector.
1. Quark and lepton masses: M quark (M lepton ) consists of the ratio of the predicted quark (lepton) masses by the agent to the experimental values: with The experimental values are listed in Tables 3 and 4 for quarks and leptons, respectively.

Neutrino masses:
Since the ordering of neutrino masses has not been confirmed yet, we search the neutrino structure in two cases: RL with fixed neutrino mass ordering in Sec.5.1 and RL without specifying the neutrino mass ordering in Sec.5.2.In each case, the intrinsic value relevant to the neutrino masses is defined as: with where the experimental values are listed in Table 4.

Mixing angles:
In addition, the intrinsic value includes the information of quark mixings and lepton mixings in C and P: with where E ij C and E ij P represent the ratio of the predicted quark and lepton mixings by the agent to the experimental values, respectively.From Tables 3 and 4 Here, we use the data from Ref. [16] for charged lepton masses and NuFIT v5.2 results with Super-Kamiokande atmospheric data for the lepton mixing angles and CP phase [17].
The flavon VEV is defined to maximize the intrinsic value, and we search for the VEV within where the angular component of the flavon VEV determines the CP phase.The large intrinsic value indicates that the obtained charge assignment well reproduces the experimental values.Such a charge assignment is called terminal state.Specifically, the terminal state is defined to satisfy the following requirement: In this paper, we adopt V 0 = 10.0,V 1 = 1.0 and V 2 = 0.2.Here, V 1 = 1.0 (V 2 = 0.2) means that the ratio of the predicted fermion masses (mixings) to the observed masses (mixings) is considered within 0.1 ≤ r mass ≤ 10 (0.63 ≤ r mixings ≤ 1.58).
Let us denote the charge assignment Q observed by the agent and Q ′ after the action a.For the action of the agent (Q, a), we will give the reward R in the following prescription: 1. Give the basic point R base , depending on the value of intrinsic value: where R offset corresponds to a penalty, chosen as R offset = −10.
2. When the Q ′ lies outside −9 ≤ Q ′ ≤ 9 or the flavon charge satisfies q(ϕ) = 0, we give the penalty R offset and the environment comes back to the original charge assignment Q.
3. When the Q ′ is turned out to be a terminal state, we give the bonus point R term , chosen as R term = 100.
4. Summing up the above points, we define the reward R(Q, a).
The structure of the neural network and the design of the reward largely determine the behavior of RL.Therefore, to facilitate comparison with Ref. [12], we use the same architecture 7 .

Learning the quark sector
In this section, we analyze the charge assignment of the quark sector, following the RL with DQN introduced in the previous section.Even for the quark sector within the range of U (1) charges −9 ≤ q ≤ 9 , there exists 19 10 ∼ 6.1 × 10 12 possible states in the environment.By training the neural network about 15 hours on a single CPU8 , it turned out that terminal states are found after O(20, 000) episodes as shown in Fig. 7.The loss function tends to be minimized as in Fig. 7, where the small positive loss corresponds to the existence of various paths to terminal states as commented in Ref. [12].We also check that the reward increases when the loss function decreases.The network leads to terminal states in >6% of all cases for total episode N ep = 10 5 .Then, after removing the negative integers of n ij in Eq. ( 3.3), it results in 21 independent terminal states.When we focus on only the quark masses in the training of neural network, we will obtain terminal states in 90%.It implies that the implementation of masses and mixings will be a more difficult task for the agent to find a realistic flavor pattern.
By performing the Monte-Carlo search with the Gaussian distribution shown in Fig. 6, the O(1) coefficients y ij are optimized to more realistic ones, according to which the intrinsic value is also optimized.We show the benchmark point of charge assignment with the highest intrinsic value in Table 5, where the masses and mixing angles are well fitted with observed values up to O(0.1)%.This will be improved by a further brute-force search over the parameter space of O(1) coefficients.Furthermore, from Eq. (3.14), the averaged intrinsic value of the terminal states in [12] is calculated as V ≃ −0.646.Thus, we argue that the reinforcement learning constructed in this work is able to search for U (1) charges with the equivalent accuracy as previous research, even when v ϕ is extended to be a complex number.Note that there is no CP phase in the quark sector.Even when the angular component of flavon is non-zero, the CP phase is chosen to 0 due to the phase rotation of quark fields.Nonvanishing CP phase in the quark sector will be realized by introducing multiple flavon fields [18], but it will be left for future work.
The above fact that the realistic model is a very rarefied distribution means that learning results can change with changes in the random number seed.The RL algorithm developed in this work involves a large random numbers (such as the initial Yukawa couplings, the initial U (1) charges, the choice of greedy actions, and the behavior in random actions).While the agent aims to maximize the reward, it does not always reach the terminal states due to the weak distribution of such states.Therefore, if the random seed is changed, there is no guarantee that the same terminal states described in this paper will be discovered.Nevertheless, even in that case, the discovery of different terminal states would be expected.9

Learning the neutrino structure
In this section, we move to the numerical analysis of the lepton sector, following the RL with DQN introduced in the Secs. 2 and 3. Based on the analysis in Sec. 4, we fix the Higgs U (1) charges and the VEV v ϕ to realize the 21 realistic FN models in the quark sector.However, there still exists 19 9 ∼ 3.2 × 10 11 possible states within the range of U (1) charges −9 ≤ q ≤ 9 in the environment.We first analyze the lepton sector with fixed neutrino mass ordering; normal ordering or inverted ordering in Sec.5.1.In the next analysis of Sec.5.2, the neutrino mass ordering has not been fixed yet.Thus, one can find plausible FN models whether the neutrino masses are in the normal ordering or in the inverted ordering.

Fixed ordering of neutrino masses
By training the neural network about 8 hours on a single CPU, it turned out that terminal states are found after O(5, 000) episodes as shown in Fig. 8 with the normal ordering.The loss function tends to be minimized as in Fig. 8 until O(50, 000) episodes. 10It is notable that the reward increases when the loss function decreases.After these critical numbers of episodes, the loss function increases, indicating a sign of overtraining.This is because the lepton sector rapidly leads to the terminal states compared with the quark sector.Indeed, the network leads to terminal states in >0.06% of all cases for total episode N ep = 6 × 10 4 .Therefore, for the lepton sector, we provided an upper bound for the computational cost, which is that O(50, 000) episodes are sufficient for the agent to acquire the optimal behavior.After removing the negative integers of n ij in Eq. (3.4) and picking up flavon U (1) charge to be consistent with quark sector, we arrive at 63 and 121 terminal states with normal ordering and inverted ordering, respectively.By performing the Monte-Carlo search over the O(1) coefficients y ij with the Gaussian distribution shown in Fig. 6, the lepton masses and mixings are further optimized to more realistic ones, according to which the intrinsic value is also optimized.Specifically, we performed the Monte-Carlo search 10 times to search the realistic values within 3σ.In the first 10,000 trials, the O(1) coefficients y ij are optimized by using the Gaussian distribution shown in Fig. 6.Then, for the O(1) coefficients with highest intrinsic value among them, we performed the second 10,000 trials with the Gaussian distribution where an average is the coefficients obtained by the first Monte-Carlo search and the standard deviation is 0.25.After carrying out the same procedure 10 times in total, we find that the results of 6 models with normal ordering are in agreement with experimental values within 3σ.We show the benchmark point with the highest intrinsic value in Table 6 for the normal ordering.Here, we list the effective Majorana neutrino mass m ββ for the neutrinoless double beta decay:

Episode
which would be measured by the KamLAND-Zen experiment [19].In this analysis, we assume the parameter M = 10 15 GeV to realize the tiny neutrino masses with O(1) coefficients of Yukawa couplings, but we leave the detailed study with different values of M for future work.Note that the angular component of flavon leads to the nonvanishing Majorana CP phases in contrast to the quark sector. 11Thus, one can analyze the correlation between mixing angles and CP phase as shown in Figs. 9 and 10 for the normal ordering, in which all the terminal states within 3σ are shown.The CP phase α 21 is predicted at 0. Note that the information of the CP phase has not been implemented in the learning of neural network.
Remarkably, one cannot find the neutrino mass of inverted ordering to be consistent with the experimental values within 3σ, although we perform the Monte-Carlo search 10 times over the O(1) coefficients y ij of 121 terminal states.It indicates that normal ordering will be favored by the autonomous behavior of the agent.Indeed, the intrinsic value of normal ordering after the Monte-Carlo search tends to be larger than that of inverted ordering as shown in Fig. 11.[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The sum of neutrino masses is constrained by 0.15 eV (95% CL) corresponding to the black solid line in the case of ΛCDM model [20].We denote a best-fit point within 3σ by a square, and the intrinsic value (3.14) is written in the legend.Note that the neutrino mass ordering is fixed as NO in the training of the neural network.

Unfixed ordering of neutrino masses
In this subsection, we train the neural network without specifying the neutrino mass ordering.For each of the 21 realistic FN models in the quark sector, we performed the training twice to obtain a sufficient number of realistic models.Similar to the previous analyses, the neural network is trained about 12 hours on a single CPU.It turned out that terminal states are found after O(2, 000) episodes as shown in Fig. 12, where the loss function tends to decrease until O(8, 000) episodes.It is notable that the reward increases when the loss function decreases, and the lepton sector rapidly leads to the terminal states compared with the quark sector.The network leads to terminal states in about >60% of all cases for total episode N ep = 6 × 10 4 .In contrast to the previous analysis, the trained network efficiently leads to terminal states. 12After removing the negative integers of n ij in Eq. (3.4), we arrive at 13,733 (13,432) and 22,430 (20,357) terminal states with normal ordering and inverted ordering in the first (second) learning, respectively 13 .By performing the Monte-Carlo search over the O(1) coefficients y ij with the Gaussian distribution shown in Fig. 6, the lepton masses and mixings are optimized to more realistic ones, according to which the intrinsic value is also optimized.Specifically, we performed the Monte-Carlo search two times to search the realistic values within 3σ.In the first Monte-Carlo search, we ran 10,000 trials with the Gaussian distribution shown in Fig. 6.Then, for the O(1) coefficients with highest intrinsic value among them, we performed the second 10,000 trials with the Gaussian distribution where an average is the coefficients obtained by the first Monte-Carlo search and the standard deviation is 0.25.After carrying out the Monte-Carlo analysis, we find that the results of 15 models with normal ordering are in agreement with experimental values within 3σ.Two best fit points with the highest intrinsic value are shown in Tables 7 and 8 for the normal ordering.As presented in the previous section, one can analyze the correlation between mixing angles and the other observed values as shown in Figs. 13 and 14 for the normal ordering, in which all the terminal states within 3σ are shown.It turned out that the Majorana CP phases are typically nonzero, and the summation of neutrino masses and the effective mass are not widely distributed but tend to be localized at i m ν i ∼ 60 meV and 2 meV ≤ m ββ ≤ 6 meV, respectively.

Episode
Charges  7. Benchmark point for the lepton sector with NO (corresponding to the diamond in Figs. 13 and 14), where the neutrino mass ordering is not specified in the learning of the network.and 14), where the neutrino mass ordering is not specified in the learning of the network.

Charges
Remarkably, one cannot obtain the experimental values of neutrino masses and mixings within 3σ for the inverted ordering, although we perform the Monte-Carlo searches over the O(1) coefficients y ij of all the terminal states.Thus, the normal ordering of neutrino masses is also favored by the trained neural network, although the neural network itself was trained without any knowledge of neutrino mass ordering.Indeed, the intrinsic value of normal ordering after the Monte-Carlo search tends to be larger than that of inverted ordering as shown in Fig. 15.This conspicuous feature can also be seen by looking at the intrinsic value including both the quark and lepton sectors.[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The sum of neutrino masses is constrained by 0.15 eV (95% CL) corresponding to the black solid line in the case of ΛCDM model [20].We denote two best-fit points within 3σ by a diamond and a square, and the intrinsic value (3.14) is written in the legend.Since the neutrino mass ordering is unfixed in the training of the neural network, we show only NO results from the terminal states.

Conclusion
The flavor symmetries are one of attractive tools to understand the flavor structure of quarks and leptons.To address the flavor puzzle in the Standard Model, we have applied the reinforcement learning technique to flavor models with U (1) horizontal symmetry.RL will shed new light on the phenomenological approach to scan over the parameter space of flavor models in contrast to the brute-force approach.
In this paper, we have extended the analysis of Ref. [12] to explore the flavor structure of quarks and leptons by employing the RL with DQN.Based on the neural network architectures in the framework of U (1) flavor model with RL established in Secs. 2 and 3, the agent is designed to exhibit autonomous behavior in the environment (parameter space of U (1) charges).Since the parameter space of U (1) charges is huge, we have performed a separate search for the U (1) charge assignment of quarks and leptons.Trained neural network leads to phenomenologically promising terminal states in >6% for the quark sector and >60% for the lepton sector in the case of unfixed ordering of neutrino masses.In the analysis of Sec.5.2, we have not specified the neutrino mass ordering in the evaluation of intrinsic value, meaning that the agent does not have any knowledge of neutrino mass ordering.However, the autonomous behavior of the agent suggests us that the intrinsic value of normal ordering tends to be larger than that of inverted ordering as shown in Fig. 15, and the normal ordering is well fitted with the current experimental data in contrast to the inverted ordering.Remarkably, the effective mass for the neutrinoless double beta decay is predicted around specific values, and the Majorana CP phases are nonzero in general.
Before closing our paper, it is worthwhile mentioning a possible application of our analysis: • We have focused on the flavor structure of Yukawa couplings, but it is easily applicable to reveal the flavor structure of higher-dimensional operators (see for the Standard Model effective field theory (SMEFT) with U (1) flavor symmetry [21] and discrete symmetry [22].).Since the trained neural network predicts the plausible charge assignment of quarks and leptons, one can also determine the flavor structure of higher-dimensional operators.It would be interesting to clarify whether RL technique we proposed can explore the flavor structure of the SMEFT.
• On top of that, the CP-odd fluctuation of complex flavon field (flaxion) would be regarded as QCD axion as discussed in Refs.[23][24][25][26], where the cosmological problems (such as the origin of dark matter, baryon asymmetry of the Universe, and the inflation) are simultaneously solved by the dynamics of flavon field.Since the flavon field has flavor changing neutral current (FCNC) interactions with quarks and leptons controlled by the U (1) flavor symmetry, charge assignment of quarks and leptons plays an important role of determining the FCNC processes.It is fascinating to apply our finding charge assignment to such an axion physics, which left for future work.
• We have focused on the U (1) horizontal symmetry, but it is easily applicable to other flavor symmetries such as discrete flavor symmetries.We hope to elucidate a comprehensive study about the global structure of flavor models in an upcoming paper.
• By analyzing the underlying factors that characterize reasonable flavor models from neural networks, we will figure out general patterns behind the flavor models.The search for the flavor structure by RL is expected not only to explore new physics beyond the Standard Model by validating flavor models, but also to unravel the black box of machine learning itself.

Figure 1 .
Figure 1.As a first step, the state s observed by the agent is given as input in the target network.

Figure 2 .
Figure 2. In the second step, the agent selects the action a through the ϵ-greedy policy.

Figure 3 .
Figure 3.In the third step, the transition e = ⟨s, a, s ′ , R⟩ is stored in the replay buffer.

Figure 4 .
Figure 4.In the fourth step, we randomly pick up the transitions with batch size from the replay buffer, and the weights and the biases of the Q network are updated by the stochastic gradient method in terms of these transitions.

Figure 5 .
Figure 5.In the last step, the weights and the biases of the target network are updated, following the soft update with the learning rate α.

Figure 7 .
Figure 7. Learning results for the quark sector.The results are the output of neural network leading to the best-fit model shown in Table 5.From left to right, three panels show (a) the loss function vs episode number (b) the fraction of terminal episodes vs episode number (c) the number of terminal states vs episode number, respectively.

Figure 8 .
Figure 8. Learning results for the lepton sector with fixed NO of neutrino masses.The results are the output of neural network leading to the best-fit model (the square in Figs. 9 and 10).From left to right, three panels show (a) the loss function vs episode number (b) the fraction of terminal episodes vs episode number (c) the number of terminal states vs episode number, respectively.

Figure 9 .
Figure 9. Neutrino masses vs mixing angle θ 23 , where the dotted line represents the global best fit value in NuFIT v5.2 results with Super-Kamiokande atmospheric data[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The sum of neutrino masses is constrained by 0.15 eV (95% CL) corresponding to the black solid line in the case of ΛCDM model[20].We denote a best-fit point within 3σ by a square, and the intrinsic value(3.14)   is written in the legend.Note that the neutrino mass ordering is fixed as NO in the training of the neural network.

5 Figure 10 .
Figure 10.Majorana phases α 21 , α 31 and effective Majorana neutrino mass m ββ vs mixing angle θ 23 , where the dotted line represents the global best fit value in NuFIT v5.2 results with Super-Kamiokande atmospheric data[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The effective Majorana neutrino mass is upper bounded by 0.036 eV (90% CL) corresponding to the black solid line[19].We denote a best-fit point within 3σ by a square, and the intrinsic value(3.14)  is written in the legend.Note that the neutrino mass ordering is fixed as NO in the training of the neural network.

Figure 11 .
Figure 11.Boxplots of intrinsic values for the lepton sector, where the neutrino mass ordering is fixed in the learning of neural network.In the left panel, we show intrinsic values obtained in the RL with the lepton sector, but in the right panel, we incorporate the values of the quark sector analyzed in Sec. 4.

Figure 12 .
Figure 12.Learning results for the lepton sector by RL without specifying the neutrino mass ordering.The results are the output of neural network leading to the best-fit model (the diamond in Figs. 13 and 14).We observe a similar behavior for other outputs.From left to right, three panels show (a) the loss function vs episode number (b) the fraction of terminal episodes vs episode number (c) the number of terminal states vs episode number, respectively.

Figure 13 .
Figure 13.Neutrino masses vs mixing angle θ 23 , where the dotted line represents the global best fit value in NuFIT v5.2 results with Super-Kamiokande atmospheric data[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The sum of neutrino masses is constrained by 0.15 eV (95% CL) corresponding to the black solid line in the case of ΛCDM model[20].We denote two best-fit points within 3σ by a diamond and a square, and the intrinsic value(3.14)  is written in the legend.Since the neutrino mass ordering is unfixed in the training of the neural network, we show only NO results from the terminal states.

Figure 14 .
Figure 14.Majorana phases α 21 , α 31 and effective Majorana neutrino mass m ββ vs mixing angle θ 23 , where the dotted line represents the global best fit value in NuFIT v5.2 results with Super-Kamiokande atmospheric data[17], and the inside region of each line represents dashed line ≤ 1σ, dotdashed line ≤ 3σ CL, respectively.The effective Majorana neutrino mass is upper bounded by 0.036 eV (90% CL) corresponding to the black solid line[19].We denote two best-fit points within 3σ by a diamond and a square, and the intrinsic value (3.14) is written in the legend.

Figure 15 .
Figure 15.Boxplots of intrinsic values for the lepton sector from the result of second learning, where the neutrino mass ordering is unspecified in the learning of neural network.In the left panel, we show intrinsic values obtained in the RL with the lepton sector, but in the right panel, we incorporate the values of the quark sector analyzed in Sec. 4. We obtain similar results in the first learning.

Table 3 .
[16]es, mixing angles, and CP phase in the quark sector[16], where we show the topquark mass from direct measurements.

Table 4 .
Experimental values for the lepton sector obtained from global analysis of the data, where ∆m 2 3l ≡ ∆m 2 31

Table 5 .
Benchmark point for the quark sector.

Table 6 .
Benchmark point for the lepton sector with NO, where the neutrino mass ordering is specified in the learning of the network.