Deep learning and k-means clustering in heterotic string vacua with line bundles

We apply deep-learning techniques to the string landscape, in particular, SO(32) heterotic string theory on simply-connected Calabi-Yau threefolds with line bundles. It turns out that three-generation models cluster in particular islands specified by deep autoencoder networks and k-means++ clustering. Especially, we explore mutual relations between model parameters and the cluster with densest three-generation models (called “3-generation island”). We find that the 3-generation island has a strong correlation with the topological data of Calabi-Yau threefolds, in particular, second Chern class of the tangent bundle of the Calabi-Yau threefolds. Our results also predict a large number of Higgs pairs in the 3-generation island.


Introduction
The deep learning is an attractive method not only for image recognition, but also for applications to string theory as well as particle physics. So far, neural networks have been applied to explore the vacuum structure of string theory [1][2][3][4], for instance, a conjecture for the gauge group rank in F-theory compactifications [5], identification of fertile islands in the toroidal orbifold landscape [6], the prediction of the Hodge numbers of Calabi-Yau (CY) manifolds [7], exploring Type IIA compactifications with intersecting D6-branes [8] and landscape of Type IIB flux vacua [9] and E 8 ×E 8 heterotic line bundle models [10], and finding the numerical CY metric [11]. (For more details, see, ref. [12].) Utilizing machine learning techniques for exploring the Standard Model (SM) vacua from string theory will give new insights into the string landscape.
Among the huge number of CY manifolds, there exists a restricted class of non-simply connected CYs where Wilson lines can be introduced to obtain the SM gauge group. To search for the Minimal Supersymmetric SM (MSSM) on a vast class of CYs, a hypercharge flux breaking scenario is proposed in refs. [13][14][15][16][17][18][19]. Such a direct flux breaking scenario does not require an existence of Wilson lines and is applicable to a vast class of CYs. However, in the usual random scan in the string landscape, it is difficult to reveal the origin of the MSSM-like models from a huge number of compactification parameters. To resolve this issue, we apply the machine learning technique for an exhaustive search of four-dimensional JHEP05(2020)047 (4D) string models based on the direct flux breaking scenario. It opens up a new possibility of searching for the three-generation SM. Especially, we follow the systematic approach proposed in ref. [19] on the basis of SO (32) heterotic string theory with line bundles. The direct flux breaking scenario in SO (32) heterotic string theory is motivated by its S-and T-dual intersecting D6-brane models in Type IIA string theory, where several stacks of D-branes directly lead to the MSSM-like gauge group.
The purpose of this paper is to reveal the nature of SM vacua in SO(32) heterotic line bundle models by employing the neural network technique, in particular the origin of three generations of quarks and leptons. So far, the machine learning techniques have been utilized to reproduce known physical quantities, like topological data and numerical metric of CYs, but our approach attempts to not only reproduce the known physical data, but also find a characteristic feature of the CY compactifications.
In this paper, we adopt a deep autoencoder [20] to find the parameter region leading to the SM-like spectra on CY threefolds, in particular, Complete Intersection CY (CICY) [21,22]. 1 The autoencoder is useful to reduce the higher-dimensional parameter space to the 2D one by extracting characteristic features of the data, although the neutral network itself does not have the knowledge of SM spectra. To classify the characteristic features from the autoencoder, we deal with the k-means++ clustering [24]. After performing the autoencoder and the k-means++ clustering to the vacua, we find that three-generation models are clustered in particular islands, in a similar to the analysis in the toroidal orbifold models [6] and we call the cluster with densest three-generation models "3-generation island". By introducing Kullback-Leibler (KL) divergence [25], we also find that the 3generation island is strongly correlated with the topological data of CY, in particular, the second Chern number of CY threefolds. Our approach enables us to capture the nature of not only the known toroidal orbifold landscape but also a large class of CY compactifications.
This paper is organized as follows. In section 2, we briefly review the phenomenological and theoretical constraints in SO(32) heterotic line bundle models to implement the deep autoencoder. In section 3, we show the algorithms of the autoencoder and k-means++ clustering utilized in the dataset of heterotic line bundle models. As discussed in section 4, we find that n-generation models are clustered in the 2D space derived by the autoencoder. Especially, we focus on the 3-generation island and extract its phenomenological consequence. Section 5 is devoted to the conclusions and discussions.

Line bundle models
Before applying the deep autoencoder to the dataset of heterotic line bundle models, we briefly review the heterotic line bundle models developed in refs. [15,16,19,[26][27][28][29] with an emphasis on the phenomenological and theoretical constraints.

Consistency conditions
We start from the low-energy effective action of SO(32) heterotic string theory on smooth CY manifolds with multiple line bundles. 2 The total internal gauge bundle consists of the JHEP05(2020)047 multiple internal line bundles L a with structure group U(1), namely where U(1) is supposed to descend from U(N ) ⊂ SO(2N ) ⊂ SO(32) and the concrete embedding into SO(32) is shown later. In contrast to the standard embedding scenario, such a non-standard embedding has a phenomenological and theoretical rich structure. For instance, SO(32) gauge group is broken down to the 4D gauge group G rather than E 6 because of the non-vanishing background field values of the SO(32) gauge field strength.
The introduction of the internal line bundles each with structure group U(1) corresponds to an existence of internal U(1) a gauge fluxes F a , tr(F a ) = 2π where the m i a are flux quanta in the basis of Kähler form w i , i = 1, 2, · · · , h 1,1 with h 1,1 being the Hodge number of CY M. Here T a are U(1) a generators and "tr" represents for the trace in the fundamental representation. We remark that the internal gauge fluxes are now turned on only H 1,1 (M), otherwise the supersymmetry is broken by the F -term in the 4D effective action. Even when the fluxes are restricted on H 1,1 (M), it is required to check the D-term condition (zero-slope poly-stability condition) for each U(1) a . Since it depends on the structure of Kähler cone and the D-term conditions give rise to Diophantine equations, we do not take into account the D-term condition in the neural network. It is possible to check the D-term conditions for each model after obtaining the three-generation models. 3 As a consequence of the above gauge fluxes, we obtain not only the SM-like gauge group, but also the chiral fermions. Before going to discuss the phenomenological models, we enumerate the theoretical constraints for gauge fluxes. We will take into account these constraints when we implement the line bundle models to the deep neural networks.
1. Tadpole cancellation condition. Internal gauge fluxes cause the gauge and gravitational anomalies in the 4D effective action. To cancel these anomalies, we require the cancellation of the tadpole for the Neveu-Schwarz (NS) sector, where we expand the second Chern character of W and second Chern class of the tangent bundle of CY M in the basis ofŵ k ∈ H 2,2 (M) satisfying

JHEP05(2020)047
We denote the triple intersection number of CY by d ijk = M w i ∧ w j ∧ w k and we introduce the NS5-branes wrapping holomorphic two-cycles. To ensure the stability of our system, we prohibit an existence of anti NS5-branes, namely 2. K-theory condition. Next, we require the existence of spinor bundles on CY manifolds, corresponding to the requirement of the trivial Stiefel-Whitney class, for i = 1, 2, · · · , h 1,1 . This condition is related to K-theory condition developed in the S-dual Type I string theory [30,31]. In some cases, we replace (2.6) with trT a m i a = 0 (2.7) for simplicity because of the limitation of the computing power.
3. U(1) Y masslessness conditions. Non-trivial gauge background induces the 4D Stückelberg couplings between string axions associated with Kalb-Ramond B-field and U(1) gauge bosons through Green-Schwarz terms. Indeed, it causes the mass terms for some anomalous U(1)s [15,16], . (2.8) Hence, when U(1) Y gauge boson is defined as a linear combination of multiple U(1)s, where R p and C p denote the representations of the 4D gauge group G and W , respectively. From the Hirzebruch-Riemann-Roch theorem, the net number of chiral zero-modes with U(1) a charge Y a is counted by the index with L = ⊗ a L Ya a . When we include the contribution from the freely-acting discrete symmetry group of CY (Γ), the index is divided by its order |Γ|, namely where n represents the net number of chiral zero-modes regarded as the generations of the quarks, leptons/Higgs as well as the exotic particles. 4 To prohibit chiral exotics in the low-energy effective action, we impose χ exotic = 0 for exotic particles. To clarify the difference between 3-generation models and n-generation models in the autoencoder, we search for the n-generation models as discussed in detail in the next section.

n-generation models
For definiteness, we focus on the decomposition of SO(32) by the existence of multiple line bundles following ref. [19], where the internal gauge fluxes are inserted in the following U(1) a generators, T 1 = diag(1, 1, 1, 0, 0, 0, 0, 0, 0, · · · , 0), in the basis of Cartan directions of SO(32) H r with r = 1, 2, · · · , 16 and SO(32) roots are chosen as (±1, ±1, 0, · · · , 0) under H r . Here, the underline represents the possible permutation. Note that U(1) Y is a linear combination of the above five U(1) a , U(1) Y = The subscript indices correspond to the U(1) a charges Y a . On the other hand, the adjoint representation of SO (16) includes the candidates of SM particles: Then, we define the quarks and lepton/Higgs in {q, l, u c , e c } such that they have a proper hypercharge, where it is noted that Y a (φ) stands for the U(1) a charge of the field φ. When the particles in {q, l, u c , e c } satisfy these conditions, they are identified with the quarks and leptons/Higgs, otherwise the others do not belong to the spectra in the SM, regarded as the exotic particles. From the view point of the representations of the SM gauge group, we cannot distinguish between Higgsino fields and the charged leptons, but it is distinguishable when we clarify the SO(32) gauge invariant Yukawa couplings among elementary particles, such

JHEP05(2020)047
In the neural network implemented in the next section, we impose the following phenomenological constraints in addition to the consistency conditions in section 2.1: 5 (2.21) where each n * is evaluated by employing eq. (2.13).

Classification methods for SO(32) line bundle models
In this section, we show the detailed method to apply the machine learning technique to the heterotic line bundle models satisfying the several conditions discussed in the previous section. Especially, we restrict ourselves to CICY threefolds with Hodge number h 1,1 ≤ 5. There exist 5, 36, 155, 425 and 856 CICY threefolds with h 1,1 = 1, 2, 3, 4 and 5 respectively. Before going to the detailed description, we outline the processing flow given in the following four key steps: 1. Make dataset of n-generation models on the large number of CICYs, satisfying the conditions in section 2.
2. Reduce the dimension of input parameters to the 2D charts by the autoencoder.
3. Classify the model data based on the results of dimension reduction using k-means++ algorithm. Then, calculate the percentage of 3-generation models for each cluster in the 2D space and decide "3-generation island".
4. Find the difference between the 3-generation island and other region. This corresponds to the feature of 3-generation model.
In the following subsections, we give a detailed description of each step.

Collect data
The first step of our method is to obtain a dataset of line bundle models satisfying constraints discussed in section 2. It is known that solving the constraints is mathematically difficult because they have many integer variables in the equations called Diophantine equations. It was proved that there is no general method to solve this equation even though all of the constraints are polynomial [33]. Then we adopt a brute force approach as detailed in algorithm 1 in which we employ the following simplification [19]. Since there is no summation over i in K-theory condition (2.7) and the hypercharge masslessness condition (2.10),

JHEP05(2020)047
Search both are rewritten as When the K-theory condition is given by (2.6), only the hypercharge masslessness condition boils down to with We carry out the above brute force approach for three times named as search (I), for R 1 max times 13: Construct m i a from { µ} at random. 14: Find models satisfying (2.5) and (2.13). 15: else 16: for all possible patterns of constructing m i a from { µ} 17: Find models satisfying (2.5) and (2.13). 18: As a result, we obtain n-generation models satisfying the phenomenological and theoretical conditions. 19: However, the obtained models have typically 0-generation of quarks and leptons. Hence, let us extract fa leading to nonzero-generation models from the possible fa list. 20: Repeat step 3 to step 18 for the specific fa with replacing the random attack R 1 max by R 2 max . 21: Finally we obtain many n = 0-generation models. Algorithm 1. Brute force search.

Autoencoder
After collecting the data of n-generation line bundle models, we perform the autoencoder known as a kind of multi-layer perceptrons (MLP). The advantage of the autoencoder is to reduce the higher-dimensional parameter space of the input data to the compressed data in the 2D charts and at the same time, to extract characteristic features of the data without giving any information how to extract the data. The fundamental component of the MLP is called a perceptron, which transforms a N 0 -dimensional vector x 0 into a number x 1 where h is a generally non-linear function typically chosen as a sigmoid function or ReLU function, w is a N 1 -dimensional vector called weight and the number b represents a bias. A layer consists of multiple perceptrons. When the layer consists of N 1 perceptrons, they JHEP05(2020)047 transform N 0 -dimensional vector into N 1 -dimensional vector, and weight and bias become a N 0 × N 1 matrix and N 1 dimensional vector respectively. Let us suppose that the input data is the N 0 -dimensional vector and the output is N M -dimensional vector. Then, the n-th layer in MLP has N n−1 -dimensional input vector x n−1 = (x n−1,1 , · · · , x n−1,N n−1 ) and N n -dimensional output vector x n = (x n,1 , · · · , x n,Nn ). These two are related by x n,i = h n (w n ij x n−1,j + b n i ), (3.6) where the weight w n and the bias b n are described by N n ×N n−1 matrix and N n components, respectively. In this way, MLP is constructed by connecting M layers in series, as drawn in figure 1. In the context of MLP, learning corresponds to tune the weights and biases to minimize an error function E( x M , y) representing the difference between training data y and outputs of MLP x M . The autoencoder consists of the MLP with N 0 > N 1 > · · · > N b and N b < N b+1 < · · · < N M = N 0 (b = (M − 1)/2) as shown in figure 1, in which the first and latter half are called encoder and decoder, respectively. Here, we denote the training data by x 0 and design x 0 such that the outputs resemble the inputs as closely as possible. After learning, x b has lower dimension than x 0 but it is possible to construct x M = x 0 . It indicates that all information (features) of inputs is compressed into the outputs of b-th layer.
Algorithm 2. K-means++ clustering. take N b = 2 to compare our results with the heterotic orbifold results [6] and to visualize the result easily. After trial and error, we arrive at the expression for 7 layers in the encoder and decoder with the following dimensions, The activation functions are chosen as sigmoid functions for h 1 , · · · , h b and identical maps for h 0 , h b+1 , · · · , h M , respectively and the error function is given by The learning method is followed by Adam-Optimizer in TensorFlow [34]. To avoid becoming trapped in a local minima of the error function, we first decompose the autoencoder into partial autoencoders with three-layer (N q−1 , N q , N M +1−q )(q = 1, · · · , 7) and after that the whole autoencoder is learned to minimize the error function. The learning is repeated (40000,20000,20000,20000, 18000, 16000, 14000) times for each partial autoencoder and 20000 times for the whole autoencoder. Since the two-dimensional scatter plots of the bottleneck layer have a cluster structure as demonstrated later, we apply the clustering method to the result of autoencoders.

K-means++ clustering
To classify the compressed information in the bottleneck layer, we adopt the famous k-means++ method which has advantages that the algorithm itself is simple and computational cost is not significant. We employ KMeans class in scikit-learn for k-means++ clustering [35]. K-means++ classifies a given data D with distance d(·, ·) into N cl clusters as explained in algorithm 2. In our case, D corresponds to the encoded vectors x b and we take the distance as Euclidean norm.
To decide an appropriate N cl , we employ so called elbow method explained in algorithm 3 from which a critical value of N cl (N * cl ) is estimated. Then, we tune the number of clusters around N * cl by eye-estimation in order to extract the 3-generation structure significantly. 6: This N * cl is considered a suitable number of clusters indicated from the elbow method.

Statistical analysis
To find the factor of differences between the 3-generation island and other region, we introduce the KL divergence KL(ρ 1 , ρ 2 ) which represents the distance between two distributions ρ 1 and ρ 2 . It is defined by where ρ(m) denotes a probability of taking m under ρ. Note that KL(ρ 1 , ρ 2 ) = 0 under ρ 1 = ρ 2 . In our case, ρ 1 and ρ 2 stand for the distributions of an input parameter X ∈ (d ijk , m i a , c 2,i , |Γ|) of the 3-generation island and all region, respectively. In the following, we define KL(X) def = KL(ρ 1 , ρ 2 ). Note that X with small KL(X) does not contribute to the identification of the 3-generation island, whereas X with large KL divergence plays a crucial role in distinguishing between the 3-generation island and other region.

Results
In this section, we summarize the results by implementing the autoencoder and k-means++ clustering in SO (32)   First, we discuss the case of search (I) with h 1,1 = 3 and N cl = 26 as a concrete example. Figure 2 shows the result of k-means++ clustering of x b at the bottleneck layer and black circles correspond to centroids of each colored cluster. Figure 3 represents the ratio between n = 3 models and n = 0-generation models in each cluster. The density of 3-generation JHEP05(2020)047  models in the deep blue region is higher than the other region. The cluster located around (0.54,0.47) in figure 3 has the highest ratio ( ∼ = 19.15%) among total 26 clusters and then this cluster is identified with the 3-generation island. Recalling that this 3-generation island contains only 2.08% of the whole line bundle models we consider, it is easy to find the 3generation models by focusing on this fertile island. Such a phenomena is also discussed in the heterotic Z 6 -II orbifold landscape [6]. It is remarkable that 19 clusters in all 26 clusters do not have 3-generation model. In this respect, we argue that our clustering procedure extracts features of n = 3 models.
In table 3, we list the KL(X) for all X where X = d ijk are shown only i ≤ j ≤ k because of the permutation symmetry of the indices. As mentioned above, X with large KL(X) captures an information of the 3-generation island. We find that 3-generation models have a strong correlation with the topological data of CY rather than flux parameters. Especially, characteristic values of c 2,i , in particular 36 or 54, are selected as shown in figures 4, 5 and 6. We conclude that 3-generation models prefer c 2,i ∈ 18Z, although there is no bias JHEP05(2020)047    Other d ijk are typically 0, 3 or 9. Such specific values also represent a characteristic feature in the 3-generation island due to their large KLs, but they take different values for different h 1,1 and/or searches.

Other cases
We perform a similar analysis for other searches. In the case of search (III) with h 1,1 = 2, there are two islands with the same percentage of n = 3 models and we define both of two as the 3-generation island.
From table 4 summarizing all our searches, particular values of the second Chern numbers of CICYs are favored in the class of 3-generation models, although there is no bias in the topological data of CICYs we employed. For the cases of c 2,i = 24, their KL divergences are relatively smaller than others, indicating that these cases are not important for analysis of 3-generation models. For instance, in search (I) with h 1,1 = 4, KL(c 2,1 ) = 0.026 and KL(c 2,2 ) = 0.231 are suppressed compared with KL(c 2,3 ) = 0.558 and KL(c 2,4 ) = 1.363. From the above discussion, favored second Chern numbers in the 3-generation island are provided by c 2,i ∈ 18Z. Then we conclude that 3-generation models have a strong correlation with c 2,i ∈ 18Z.
Here we comment on whether the 3-generation island is realized by the decoder described in (at most) 161-dimensional parameter spaces. The region obtained by feeding points in the 3-generation island to the decoder is a two-dimensional plane embedded in the 161-dimensional parameter spaces because our activation functions of decoder are the identical map, although the relations of parameters in the 161-dimensional spaces are nonlinear. Then the decoder is not competent for specifying the 3-generation island in the 161-dimensional spaces.
Let us also comment on the geometrical interpretation of this second Chern number of CICY. The instanton number of the tangent bundle of CY T M with the curvature JHEP05(2020)047  Table 4. Number of clusters N cl , # of n = 0 models and percentage of n = 3 models in the 3-generation island. Favored c 2,i in the 3-generation island is listed.
two-form R is given by where we employ the self-dual condition of the curvature two-form * R = Ω ∧ R (4.2) with Ω satisfying the condition dΩ = 0 [36]. Recall that Kähler form of CY manifolds is a closed form, it is possible to take Ω = w i , namely It is interesting to ask why these specific instanton numbers are favored in a class of 3-generation models. We hope to report on this relationship in the future.

Number of generations of Higgs
In this section, we count the number of generations of Higgs (Higgsino) by implementing the analysis of the previous subsection. Note that we only take into account the Higgs pairs which are vector-like under the SM gauge group, but chiral with respect to other extra U(1)s. For definiteness, we restrict ourselves to n φ ≥ 0 (∀φ ∈ q ∪ l ∪ u c ∪ e c ) cases and define the Higgs doublets from l by checking the Yukawa couplings of quarks and leptons. In the obtained models, the generation number of up-type Higgs n Hu is same with that of down-type Higgs n H d . The condition n φ ≥ 0 is so tight that only 3 cases (search (I), (II) and (III) with h 1,1 = 3) are able to be analyzed in our numerical analysis. Figures 7, 8  and 9 show histograms of the number of Higgs pairs n H in the 3-generation island. From these figures, we find that n H listed in table 5 (except for n H = 0) is favored in the 3generation island. Although there are not so many models in our limited search, it turns out that 1-pair Higgs models is disfavored. We expect that an existence of a large number of Higgs pairs is a generic property in heterotic string vacua.

Conclusions and discussions
In this paper, we applied the deep autoencoders and k-means++ clustering to the string landscape by employing the topological data of CY threefolds and internal gauge fluxes as input data. In particular, we investigated SO(32) heterotic string vacua on smooth CICY threefolds with line bundles, taking into account the phenomenological and theoretical consistency conditions. After training the autoencoder on at most 161 input data, satisfying the consistency conditions as well as reproducing the n-generations of quarks and leptons without chiral exotics, we draw a 2D chart of the landscape of n-generation models by utilizing the k-means++ algorithm. It turned out that 3-generation models cluster in particular islands in the 2D chart and we called the cluster with densest three-generation models "3-generation island". Such a structure has also been pointed out in the Mini-Landscape of heterotic Z 6 -II orbifold models [6]. 6 We expect that the presence of 3generation island will be a universal phenomena in the string landscape including E 8 × E 8 heterotic line bundle models as well as intersecting/magnetized D-brane models. By estimating the KL divergences of model parameters, we find that the clustered 3-generation island has a strong correlation with the topological data of CY threefolds, in particular, second Chern class of CY threefolds, namely c 2 (T M) ∈ H 2,2 (M, 18Z), although there is no bias in the second Chern class of CICYs we employed. It indicates that second Chern numbers of CYs provide a guideline to obtain 3-generation MSSM-like models. We leave to reveal the underlying reason for future work. It is interesting to apply our analysis to other regions of the string landscape and check the values of second Chern number of CYs for 3-generation models.
We also counted the number of Higgs pairs which are vector-like under the SM gauge group, but chiral with respect to other extra U(1)s. Our results show that the 3-generation island contains a large number of Higgs pairs. It will motivate us to study the phenomenology of multi-Higgs models discussed in the bottom-up approach. Finally, we comment on possible applications of our method to other regions of the string landscape. It is straightforward to extend our analysis to E 8 × E 8 heterotic line bundle models by changing the gauge group decomposition. 7 For the D-brane models, it is enough to add the input data such as the position of D-branes, an amount of magnetic flux (in Type IIB magnetized D-brane models) and intersection angles (in Type IIA intersecting JHEP05(2020)047 D-brane models), taking into account the proper tadpole cancellation conditions. We hope to report on this interesting work in the future.