Learning non-Higgsable gauge groups in 4D F-theory

We apply machine learning techniques to solve a specific classification problem in 4D F-theory. For a divisor D on a given complex threefold base, we want to read out the non-Higgsable gauge group on it using local geometric information near D. The input features are the triple intersection numbers among divisors near D and the output label is the non-Higgsable gauge group. We use decision tree to solve this problem and achieved 85%-98% out-of-sample accuracies for different classes of divisors, where the data sets are generated from toric threefold bases without (4,6) curves. We have explicitly generated a large number of analytic rules directly from the decision tree and proved a small number of them. As a crosscheck, we applied these decision trees on bases with (4,6) curves as well and achieved high accuracies. Additionally, we have trained a decision tree to distinguish toric (4,6) curves as well. Finally, we present an application of these analytic rules to construct local base configurations with interesting gauge groups such as SU(3).


Introduction
The existence of mutiple vacuum solutions is a central feature of string/M-theory paradigm of quantum gravity. This ensemble of string vacuum solutions is commonly denoted as the "landscape of string vacua". Specifically, one can choose a particular regime of string theory (such as IIB, heterotic or M-theory) and a class of geometries to probe a part of the landscape.
In particular, F-theory [1][2][3] provides a geometric framework to describe the largest finite number of string vacua to date. In this geometric description of strongly coupled IIB superstring theory, we compactify on an elliptic fibered Calabi-Yau manifold X of (d + 1) complex dimensions to get a low energy theory in the Minkowski space R 9−2d,1 . The base manifold B of this elliptic fibration has d complex dimensions, which is not Calabi-Yau. Hence F-theory can also be thought as a compactification of IIB string theory on a non-Ricci-flat space B, while the non-zero curvature is balanced by the inclusion of 7-branes.
The classification of F-theory landscape has the following three layers: (1) Classify all the d-dimensional base manifolds B up to isomorphism. For d = 2, the base surfaces have been almost completely classified [4][5][6][7]. For d = 3, there are some partial classification and probing results in the subset of toric threefold bases [8][9][10][11], but we do not have a global picture of non-toric and non-rational threefolds yet.
As one can see, the classification and characterization of the base manifolds is the foundation of this program. In this paper, we will mostly consider the generic fibration X gen over B. For most of the base manifolds, it turns out that X gen has singularities corresponding to a stack of 7-branes carrying non-Abelian gauge groups G gen . For any other elliptic fibration X over B, the gauge group G always contains G gen as a subgroup. Hence G gen is minimal among all the elliptic fibrations over B and it is called non-Higgsable gauge group [12,13], which is a physical characterization of the base manifold B 1 .
In 6D F-theory, the base B is a complex surface and the non-Higgsable gauge groups are carried by the complex curves on B. Such curves form "non-Higgsable clusters" and they are well understood [12]. For example, a curve C with self-intersection (−3) always carries non-Higgsable SU(3) gauge group if it is not connected to any other curve with selfintersection (−2) or lower. These non-Higgsable clusters are fundamental building blocks of the classification of compact 2D bases [4][5][6][7] and the non-compact bases giving rise to 6D (1,0) SCFTs [27][28][29].
In 4D F-theory, the base is a complex threefold and the non-Higgsable gauge groups locate on complex surfaces (divisors). The triple intersection structure among divisors on a complex threefold is highly involved, and the topology of non-Higgsable clusters seem to be arbitary [9,13]. In figure 1, we show a typical non-Higgsable cluster found in [9]. In SU(3) G 2 SU(2) Figure 1. A typical non-Higgsable cluster with SU(2), SU (3) and G2 gauge groups found on a generic base in the random walk approach [9]. fact, there does not even exist a dictionary between the local geometric information on the threefold base B and the non-Higgsable gauge groups.
Nonetheless, we have generated a large number of compact toric threefolds with various non-Higgsable gauge groups that can be easily computed with toric geometry techniques [9,11]. Thus we have a different approach: starting from the geometric data generated by Monte Carlo methods and try to find patterns and rules in it. Because of the large volume of data and potentially complex pattern, it is natural to use the recently flourishing machine learning techniques to simply the task. Various machine learning tools have been applied to data sets in string theory [30][31][32][33][34][35]. For the data in the string geometric landscape, a common feature is that they are mostly tuples of integers with no error. Hence it provides a brand new playground for data science.
In this paper, we formulate a classification problem on the data and apply supervised machine learning techniques. Given local triple intersection numbers near a divisor D as the input vector (the features) 2 , we train a classifier to predict the non-Higgsable gauge group on D (the label). There are 10 classes corresponding to the 10 possible non-Higgsable gauge groups: ∅, SU(2), SU(3), G 2 , SO(7), SO (8), F 4 , E 6 , E 7 and E 8 . We train the classifier with the set of divisors D on toric threefold bases we have generated. Based on accuracy and model interpretability, we find that decision tree provides the best performance in this problem. Hence we will mostly use decision tree and generate a number of analytic rules which are inequalities on the features.
To specify the features, we need to pick a set of local triple intersection numbers near a divisor D. Although one may expect the accuracy to increase with more features included, the decision tree structure and analytic rules will be more complicated. In this work, we only use the triple intersection number information among the divisor D and its neighbor divisors D 1 , . . . , D n . On a toric threefold, the number of neighbors of D is n = h 1,1 (D) + 2. Hence the input vector has different dimensions for different h 1,1 (D), and we need to train a different decision tree for this class of divisors with a specific h 1,1 (D).
To choose the data set, we need to used the notion of "resolvable bases" and "good bases" introduced in [11]. The resolvable bases have toric (4,6) curves which will give rise to a strongly coupled sector in the 4D low energy effective theory, while the good bases do not have these (4,6) curves. We choose our training data set to be the toric divisors on two classes of good bases generated by the random walk approach [9] and the random blow up approach [11] respectively. The out-of-sample accuracies we have achieved range from 85% to 98% depending on the h 1,1 (D), which is remarkably high considering that the task is a multiclass prediction problem 3 . Since the decision trees typically have O(10 3 ∼ 10 4 ) leaves, the number of analytic rules the algorithm generates is too large to present. To circumvent this, we have selected a number of rules which apply to the largest number of samples or have small depth in the decision tree.
The decision trees are tested on a set of resolvable bases as well. It turns out that the accuracies are usually slightly lower than accuracies on the good bases. Nonetheless, for the Hirzebruch surface D = F n , the accuracy is 98.04%, which is even higher than the out-of-sample accuracies on the good bases. This shows that the decision trees and analytic rules generated from the good bases will apply to resolvable bases as well.
The structure of this paper is as follows: in section 2, we introduce the fundamentals of toric threefolds and a useful diagrammatic representation of triple intersection numbers. In section 3, we review the basic setups of 4D F-theory and the non-Higgsable gauge groups. In section 4, we show how to generate the data sets in this paper for machine learning. First, we clarified a subtlety involving the codimension-two (4,6) singularity and review the definition of resolvable and good bases in [11]. Then we show how to generate the good toric threefold bases and construct the input vector (features) from them. In section 5, we briefly review the basic definitions in machine learning and the methods used in this paper. In section 6, we present the universal machine learning framework that will be applied to various data sets in section 7 and 8. Section 7 will be focusing on the classification of non-Higgsable gauge groups on divisors with different Picard rank, and we will list a number of analytic rules extracted from the decision tree explicitly. Section 8 will be focusing on distinguishing toric (4,6) curves. We then discuss two potential applications of the decision tree trained in section 7: applying them to the resolvable bases and constructing local configurations reversely with the analytic rules. Finally, we summarize the results and discuss future directions in section 10.

Geometry of toric threefolds
Toric threefolds are the central geometric objects in this paper. A basic introduction to toric variety can be found in [36,37]. In this paper, we always assume that the toric threefold is smooth and compact, unless otherwise indicated.
A toric threefold B is characterized by a simplicial fan Σ with a set of rays and a set of 3D cones Σ(3). The intersection of σ ∈ Σ(3) forms the set of 2D cones Σ(2) in the fan. From the compactness and smoothness conditions, the 3D cones span the whole Z 3 and each of them has unit volume. For a smooth compact toric threefold, we always Geometrically, the 1D rays v i correspond to the toric divisors D i , which generates the effective cone of B. The 2D cones v i v j correspond to the toric curves D i D j , which generates the Mori cone of B. The 3D cones v i v j v k are the intersection points of three toric divisors D i , D j and D k .
In terms of the local coordinates z i (i = 1, . . . , n) on B, the toric divisors D i are given by hypersurface equations z i = 0. An important fact is that the global holomorphic section of a general line bundle can be easily written out as a linear combination of monomials: Here L is a lattice polytope defined by and c u is an arbitrary complex number. In contrast, we do not have a analogous expression for non-toric threefolds.
A very important class of line bundles on B is the multiple of anticanonical line bundle −mK B (m ∈ Z + ), where on a toric variety. On a toric threefold B, there are three linear relations among the toric divisors D i : Now we can compute the triple intersection numbers among the divisors using the information of rays and 3D cones. First, the following equation holds for smooth toric threefolds: The other triple intersection numbers can be computed using the linear relations (2.9). For all the triple intersection numbers in form of Otherwise, suppose that the two 3D cones sharing the same 2D cone we can solve D 2 i D j and D i D 2 j . Finally, with all the information of triple intersection numbers in form of D 2 i D j , we can solve D 3 i by using the data D 2 i D j for all the neighbors of v i : Hence we can pick an arbitrary x i,a = 0, and solve (2.14) In the end, we are always able to solve all the triple intersection numbers on B uniquely, and they are all integers.
Next we introduce a diagrammatic way to present the triple intersection numbers of a toric variety in figure 2. On each vertex v i , we label the triple self-intersection number D 3 i . On the edges v i v j , we label the triple intersection numbers D 2 i D j and D i D 2 j , where D 2 i D j lies closer to the vertex v i . We do not need to label the D i D j D k (i = j = k) since they are straight forward to read out from the triangulation structure and (2.10).
These numbers on the edges encode the geometric information of toric divisors explicitly. For example, for the divisor D 1 in figure 4, the numbers in red squares are D 1 D 2 j (j = 2, 3, 4, 5), which are actually the self intersection number of curves C 1j = D 1 D j on the surface D 1 . Hence we can directly read off that the divisor D 1 is a Hirzebruch surface F n since the self-intersection of curves C 1j are (0, n, 0, −n). The numbers in the blue squares are a j = D 2 1 D j (j = 2, 3, 4, 5), which are the intersection number between the normal bundle N D 1 and the curves C 1j . With the intersection form on the surface D 1 , we can use a j to solve the normal bundle N D 1 , see section 4.3 for more detail.
The triple intersection numbers in the diagram are not entirely independent. In fact, the self triple intersection number D 3 of a divisor D is uniquely fixed by the triple intersection numbers D 2 D i and D i D 2 where D i is a toric divisor intersecting D. We present a derivation of D 3 in Appendix A for small values of h 1,1 (D).
The change in triple intersection numbers after a blow up can also be easily computed. If we blow up a 3D cone v 1 v 2 v 3 corresponding to a point D 1 D 2 D 3 , we get a new divisor class: the exceptional divisor E. The divisors D 1 , D 2 and D 3 are properly tranformed to Along with the fact that D i D j E = 0 for all i, j = 1, 2, 3, which follows from that the curve D i D j does not contain the Poincaré dual of E, we can solve all the triple intersection numbers: We show the change in triple intersection numbers after the blow up in figure 5. Hence the exceptional divisor is a P 2 with normal bundle N E = −D i | E = −H, where H is the hyperplane class on P 2 .
We can do the similar analysis for the case of blowing up a curve D 1 D 2 , and the change in triple intersection numbers is shown in figure 6. As we can see, the exceptional divisor is F n in this case, where n = |D 1 D 2 (D 1 − D 2 )|.

F-theory on toric threefold bases and the non-Higgsable gauge groups
An introduction to F-theory can be found in [38], and we will only present the essential information for our setup.
In this paper, to get a 4D N = 1 effective field theory, we always consider an elliptic Calabi-Yau fourfold X over the base manifold B with a global section. It is described by the Weierstrass equation: effective. When the discriminant ∆ = 4f 3 + 27g 2 vanishes over a subset L of B, the elliptic fiber is singular over B. If L is complex codimension-one, 7-branes will locate on L and the attached open string modes will give rise to gauge fields in the 4D low energy effective theory.
Hereafter we assume that f and g are generic sections, such that the order of vanishing of ∆ over the codimension-one locus L is minimal. Under this condition, the gauge groups from the codimension-one locus L are the minimal non-Higgsable gauge groups [12,13]. We list the possible non-Higgsable gauge groups with the order of vanishing of (f, g, ∆) in table 1.
For the fiber types IV , I * 0 and IV * , the gauge group is specified by additional information encoded in the "monodromy cover polynomials" µ(ψ) [17]. Suppose that the divisor is given by a local equation w = 0, then for the case of type IV , Kodaira type ord (f ) ord (g) ord (∆) and IV * , the gauge group is not uniquely determined by the orders of vanishing of f, g. One need additional monodromy information in the Weierstrass polynomials to fix the precise gauge group. When (f, g, ∆) vanishes to order (4,6,12) or higher on a codimension-one locus, the geometry does not describe any supersymmetric vacua.
The gauge group is SU(3) if and only if g 2 is a complete square. The case of type IV * is similar, where the monodromy cover polynomial is When g 4 is a complete square, then the corresponding gauge group is E 6 , otherwise it is F 4 . For the case of type I * 0 , the monodromy cover polynomial is When µ(ψ) can be decomposed into three factors: the corresponding gauge group is SO (8). Otherwise, if it can be decomposed into two factors: the gauge group is SO (7). If µ(ψ) is irreducible, then the gauge group is G 2 . On a general threefold base, suppose that the divisor D is given by the hypersurface equation w = 0 locally, and we expand Then the line bundle generators for f k and g k can be written down using the normal bundle N D and canonical line bundle K D [13]: Here O(·) denotes the holomorphic section of a line bundle on the complex surface D. φ i and γ i denote the order of vanishing of f and g on another divisor D i which intersects D. C i = D D i is the intersection of D and D i , which has the topology of P 1 . If f k,D ∈ O(C k ) where C k is not an effective divisor on D for all k < k 0 , then f vanishes to at least order k 0 on D. Similar statement holds for g. The problem of this formula is that there may be other non-local constraints that are not encoded in the neighboring divisors of D. In reality, the order of vanishing of (f, g) may be higher than the values given by (3.9, 3.10). Hence we can only read out a subgroup of the actual non-Higgsable gauge group on D.
For toric threefold bases, the exact form of f and g can be easily computed with the holomorphic section formula (2.6). The sets of monomials in f and g are given by the following lattice polytopes: where v i are the 1D rays in the fan of the toric base. The order of vanishing of f and g on a toric divisor D corresponding to the ray v ∈ Σ(1) are Now we are going to present the explicit monodromy criteria to distinguish the gauge groups for the cases of type IV , IV * and I * 0 fiber in table 1. We denote the toric ray of the neighboring divisors of D by v 1 , . . . , v p .
When ord D (f ) ≥ 2, ord D (g) = 2, the singularity type is IV . In this case, when g 2 only contains one monomial u and 2| u, v i (i = 1, . . . , p), then the gauge group is SU(3). Otherwise the gauge group is SU (2).
When ord D (f ) ≥ 3, ord D (g) = 4, the singularity type is IV * . In this case, when g 4 only contains one monomial u and 2| u, v i (i = 1, . . . , p), then the gauge group is E 6 . Otherwise the gauge group is F 4 .
We will always apply this method to compute the non-Higgsable non-Abelian gauge group on a toric divisor, which is the label of the data samples. It is worth pointing out, the determination of gauge groups involves several inequalities. Later we will see that this is coincidentally reflected in the machine learning algorithm selection.
For some particular class of bases, there exist non-Higgsable Abelian gauge groups from the Mordell-Weil group of the elliptic fibration [6,39,40]. However, they do not appear on toric bases [40] and are not considered.

Resolvable and good bases
In [11], we introduced the terminology of "resolvable bases" and "good bases" depending on the existence of complex codimension-two locus L ⊂ B where (f, g) vanishes to order (4, 6) or higher.
If these codimension-two (4,6) loci exist, then we can try to blow up these loci and lower the order of vanishing of (f, g) to be under (4,6). If this blow-up process can be done without introducing a codimension-one (4,6) locus in the process, then we call this base B a "resolvable base".
If B is free of codimension-two (4,6) locus, then we call it a "good base". For a toric threefold base, we can write down the order of vanishing of f and g on a toric curve D i D j corresponding to a 2D cone v i v j : Hence if we try to blow up the toric curve by adding a new rayṽ = v i + v j , the sets of Weierstrass monomials will not change.
To check whether a base is resolvable or not, one only needs to check whether the origin (0, 0, 0) lies on the boundary of G. If (0, 0, 0) does not lie on the boundary of the lattice polytope G, then after the resolution process where all the (4,6) curves are blown up, there will not be a codimension-one (4,6) locus on any divisor. The reason is that if there exists such a divisor corresponding to the ray v, then all the points u ∈ G satisfying u, v < 0 will vanish and the origin (0, 0, 0) lies on the boundary plane u, v = 0 of G. Since the polytope G does not change when we blow up (4,6) curves, this condition applies to the original resolvable base as well. Now we clarify the physical difference of the non-resolvable, resolvable and good bases.
• For the non-resolvable bases, they cannot support any elliptic Calabi-Yau manifold with only terminal singularities. For this reason, they do not describe any supersymmetric vacua in F-theory. Hence we never include these bases in the classification program of F-theory geometries.
• For the resolvable bases, there may be a strongly coupled superconformal sector on the codimension-two (4,6) locus.
In the 6D F-theory case, blowing up a codimension-two (4,6) point will give a non-zero v.e.v. to the scalar in the tensor multiplets, and the (1,0) SCFT will be deformed into the tensor branch. On the tensor branch, the low energy theory has a usual gauge theory description. If we shrink the exceptional divisors and go back to the superconformal point, then the gauge groups and matter on the exceptional divisors will become strongly coupled "superconformal matter" [28].
In 4D N = 1 theory, the tensor multiplets is replaced by a number of chiral multiplets. The situation is more subtle since the instanton effect from Euclidean D3 branes [41] and G 4 flux may break the superconformal symmetry. However, one can generally expect a strongly coupled non-Lagrangian sector if there are (4,6) curves on the base threefold.
• For the good bases, the low energy effective theory should be free of these SCFT sectors, and we have a 4D N = 1 supergravity coupled with a number of vector and chiral multiplets.
In this paper, we generally accept all the resolvable bases and good bases. We will not consider other subtleties such as codimension-three (4,6) points [42,43] or terminal singularities in the Weierstrass model that cannot be resolved [44]. We will generally accept their appearance and leave their physical interpretation to future work.

Generation of toric threefold bases
We use the divisors on the good bases to train the classifier, and the bases are generated by two different methods. The first class of bases is the "end point bases" introduced in [11].
We start with P 3 and randomly blow up toric points or curves with the same probability. During the process, the base may contain toric curves where (f, g) vanish to order (4, 6) or higher. However, it is always required to be resolvable, or equivalently the polytope G (3.12) should contain the origin (0, 0, 0) in its interior. Finally, we will end up at a base without toric (4, 6) curves, which is called an end point base. It is impossible to blow up a toric curve or point on an end point base to get another resolvable base. The end point base may contain toric divisors with E 8 gauge group and non-toric (4, 6) curves on it, but we allow these to happen since we can easily blow up these (4, 6) curves. They are analogous to the −9/ − 10/ − 11 curves on 2D bases. In total, we have 2,000 end point bases generated in [11].
The second class of bases is called "intermediate base" which is distinguished from the end point bases. Our method to generate these intermediate bases is similar to the random walk approach in [9]. We start from P 3 and do a random toric blow up or blow down at each step with equal probability. In the whole process, it is required that no toric (4, 6) curve appears, but again the toric divisors with E 8 gauge group and non-toric (4, 6) curves on it are allowed. Each random walk sequence from P 2 contains 10,000 bases b 1 , b 2 , . . . , b 10,000 , however we only pick out the first 20 bases b 1 , b 2 , . . . , b 20 and the bases b 100n , n ∈ Z. The reason is that bases related by a few blow up/downs have similar divisor structure, and we want to reduce repetitive samples in our data set. In total, we generate 1,500 of these random walk sequences and we take in total 180,000 bases out of them.
With the toric data, we can classify the toric divisors on these bases according to the h 1,1 (D) (equal to the number of their neighbor divisors minus 2) and compute the local triple intersection numbers and gauge groups.

Generation of the features
Now we can generate the features for the machine learning program, which are local triple intersection numbers near a divisor D. For a divisor D with Picard rank rk(Pic(D)) ≡ h 1,1 (D) = p − 2, there are exactly p toric divisors intersecting D. We relabel them by D 1 , · · · , D p , and the toric curves on D are C i = D i D. The toric curves C i (i = 1, . . . , p) are cyclic, such that the intersection numbers between two different curves are Then we can define a 5p-dimensional feature vector V (D) with the triple intersection numbers near D. The They are well defined numbers for each toric divisor D, and they encode the information of the neighboring divisors of D in a subtle way.
These features are not entirely independent. For example, the Hirzebruch surfaces F n divisors has h 1,1 (D) = 2 and four neighboring divisors. The four toric curves on F n has self-intersection numbers C 2 1 = 0, C 2 2 = n, C 2 3 = 0 and C 2 4 = −n. Then we label the four neighboring divisors of D by D 1 , . . . , . We denote the (−n)-curve on F n by S and the 0-curve on F n by F , which have intersection products S · S = −n, S · F = 1, F · F = 0. Then we have we can shorten V (D) to the following 15-dimensional vectorṼ (D), whose components are also denoted as f 0 , . . . , f 14 .
The other quantities such as DD 2 3 , DD 2 4 , D 2 D 3 , D 2 D 4 , D 3 are all redundant, since there are linear relations between the curves C 1 , C 2 , C 3 , C 4 on F n , see appendix A.2. We plot the features in figure 7 with the diagrammatic presentation introduced in section 2.
The normal bundle of D can be written as N D = aS + bF , and we have For divisors with other h 1,1 (D), we will present the shortened input vectorṼ (D) case by case in section 7.
We will take the set of Hirzebruch surface on the end point bases as the sample set in section 6, since the end point bases have similar structure [11] and it provides a better test ground for various machine learning techniques. This data set is denoted as S end (F n ). In section 7, we will provide more detailed results and include both the end point bases and intermediate bases in the training set.

A brief introduction of machine learning
In this section, we give a brief introduction of machine learning for the audience. First, we decribe the typical setup and procedure of a machine learning problem, including notation, training, and testing. Then we discuss the details and properties of a few most commonly used machine learning algorithms, including decision tree, feedforward neural network, logistic regression, random forest and support vector machine (SVM).
There are two major categories of machine learning problems, namely, supervised learning and unsupervised learning. In supervised learning each input is associated with an output, and the objective of the algorithm is to predict the output for a given input. Furthermore, when the output variable is from a set of categories (e.g."cat","dog") it is called a classification problem and the output is referred to as label, otherwise when the output takes continuous values it is called a regression problem. As opposed to supervised learning, in unsupervised learning input data do not have output associated, and the algorithm is tasked to classify the input data into different groups. In this paper, to fit the tasks described in previous sections, we will focus on supervised learning, and more specifically, classification algorithms.

Training and testing
Following the description above, a machine learning classification problem involves a data set (X, Y ) where X is an N × K matrix containing N input K-dimensional variables, and Y contains N output variables. In general, training a machine learning algorithm can be summarized as an optimization problem which is to find where F is the error function defined on the prediction f (x i ) and actual output value y i for every pair (x i , y i ) ∈ (X, Y ). An algorithm is specified by setting both f and F . Note that many machine learning algorithms are semi/non-parametric, so sometimesf is not a closed-form function but rather a combination of operations. The optimization procedure that determinesf is called training or fitting. Consequentially, the data used in the optimization, (X, Y ) is called training data, and any other data that is not part of training can be treated as test data. After the algorithm is trained, one can start making prediction on any input data (x , y ) viã By comparing the predictionỹ and the actual label y , one can compute the performance of the algorithm. If (x , y ) ∈ (X, Y ), this is called in-sample performance, as the test is done on the training data. Otherwise if (x , y ) / ∈ (X, Y ) the performance is called out-of-sample. In-sample performance indicates goodness of the fit of the algorithm and out-of-sample performance shows the algorithm's real prediction power. In the case of classification, the accuracy can serve as a good performance measure, defined as

Classification algorithms
Here we describe the details of the five major classification algorithms applied in this paper. More thorough description can be found in [45]. Decision tree is a prototypical tree-based algorithm which consists of numerous sequencial "splits". Each split divides the incoming data set into two non-overlapping subsets by partitioning on one feature. Succeeding splits take the previous split's output data sets as input, so the size of data sets for each split decreases. Once the stopping criteria are met for an output subset, it will no longer be split. We will refer to the input data sets as nodes if they are further split, otherwise they will be referred to as leaves. An illustration is shown in figure 8.
In the case of classification, a split on a data set is a binary division 4 that maximizes the information gain defined as where (I, N ), (I lef t , N lef t ), (I right , N right ) are respectively the (impurity, numbers of samples) in the input set, left output subset and right output subset. We choose the Gini index as the impurity measure, which is one of the most popular choices and is defined as follows: where i runs though all the K classes and p i is the ratio of samples in class i and all samples in the input set. The algorithm is recursive: it starts with the original data set as the parent node, computes the information gain of each possible split on every feature and applies the split that gives rise to the highest information gain evaluated by (5.2). Then the left and right nodes are regarded as parent nodes and the algorithm is repeated on them.
Since most problems studied in this paper are multiclass problems, i.e. the number of classes is greater than two, it is worth mentioning that some classification algorithms treat multiclass problems differently from binary cases. For some classifiers, multiclass prediction takes longer runtime and computation power than binary prediction. In the case of decision trees, as shown in (5.2), the classifier is suitable for both binary and multiclass problems.
Random Forest is an ensemble algorithm that consists of multiple decision trees. The prediction is given by majority voting/averaging the individual decision tree's prediction. Each tree is trained on different samples, which are drawn from the original training set with replacement (bootstrap). The randomness reduces potential overfitting in each decision tree, and often results in great enhancement in out-of-sample performance.
Logistic Regression is the classification analog of linear regression, and is a special case of generalized linear models. In the case of binary class, logistic regression assigns {0, 1} to the two classes and models the prediction probability for the two classes {0, 1} as where x is the feature vector and w is the coefficient vector determined by maximizing the regularized log likelihood function: where λ is the regularization strength and ||w|| p is the L p norm of w. The regularization strength is a hyperparameter that needs to be optimized by applying cross-validation. The function in (5.6) is only suited for binary classification. In the case of multiclass prediction, one can either modify (5.6) into multiple output functions, or apply generic multiclassification methods, such as "one-vs-rest" (OVR). When there are K multiple labels (K ≥ 3), OVR is done by training K binary classifiers, and the i-th classifier is to predict whether the output is label i or not. We apply OVR for logistic regression in this paper.
Support Vector Machine (SVM) is by nature a binary classifier. It is another generalization to linear models. By assigning {−1, 1} to the two classes, SVM makes prediction based on a generalized linear function where w w w, b are determined by best dividing the two class samples in the feature space, and the prediction is 1/ − 1 when y is greater/smaller than 0. ϕ ϕ ϕ is a transformation function on the original feature vectors, defined by a kernel function For all SVM classifiers considered in this section, we apply the rbf kernel defined as where γ is a hyperparameter that can be optimized. As SVM is a binary classifier by design, we also apply OVR in our multiclassification problems. Feedforward Neural Network (FNN) is one of the simplest neural network models. The graphic representation of a neural network is composed of a number of layers, including an input layer corresponding to the features and an output layer corresponding the labels. The rest is referred to as hidden layers. Each layer contains several neurons and different layers are connected by a certain topology. Mathematically, both neurons and connections between neurons correspond to variables of the prediction functions. A feedforward neural network with two hidden layers has the following function form where y l is the prediction for the l-th label, O is the output function, h is the activation function, and w w w = (w (3) , w (2) , w (1) ) are the coefficients determined by training. In addition to (5.11), often it is better to include certain regularizations which can be non-parametric. Similar to decision tree, FNN can be designed in a multiclassification setting by choosing a proper output activation function.

Machine learning algorithm comparison and selection
In machine learning, a main question is what is the best algorithm for a specific problem. Since many ML algorithms are adaptive, this question in practice is often solved by empirically testing the performance of each algorithm and choosing the best one. In addition, the properties of a problem may call for a particular ML algorithm. For instance, neural network is usually preferred for image recognition tasks for its ability to handle large data sets in a parallelized fashion. In this section we take the same approach and compare the five ML algorithms introduced above on the data set S end (F n ) introduced in section 4.3.
Besides the prediction performance, we consider the ML algorithms' interpretability as another selection criterion. We show that out of the five algorithms, decision tree provides both high algorithm performance and good model interpretability, indicating that there is an inequality-based pattern in the data set.

Class label imbalance and data resampling
The proportion of samples in each class of a data set are important to ML algorithms' training and evaluation. The disproportion of different classes, commonly referred to as class label imbalance, affects the classifier by making it biased towards the major classes over the minor ones. As a result, the regular classification measures are skewed to the major class label. For instance, a useless binary classifier that predicts only the major label can give 95% accuracy if 95% data is of the major label, albeit it does not provide any insight. Usually in the case of extremely unbalanced data sets, the minority classes are of higher interest. So it is important to analyze the imbalance before training a classifier and apply relevant techniques to deal with the imbalance. In our data sets, there tends to be a strong class label imbalance as gauge groups such as SU(3) have considerably smaller samples than others (yet they are of particular significance). For instance, the numbers of samples and class label imbalance of all gauge groups in the data set S end (F n ) are shown in table 2. It is evident that the percentages of different gauge groups in the entire data set are extremely disproportional. In order to train a classifer properly on an unbalanced data set, there are two common approaches: (1) resampling, which can be achieved by duplicating the minor class data and/or down-sampling the major class data; and (2) adding class weight to the training samples. One can define a class's weight as the normalized inverse of the sample's class percentage. When (2) is applied, it is important to note that the predictions will have a similar imbalance and using accuracy will still be biased. In this case, one can apply class label weighted accuracy instead of manipulating the data set. This measure is simply the orginal accuracy weighted by the class label's uniqueness on each sample. Intuitively it means that, when a correct/wrong prediction is done on a sample with a major label, the total weighted accuracy increases/decreases by a small amount; and a correct/wrong prediction is done on a sample with a minor label, the total weighted accuracy increases/decreases much more significantly.
As shown in table 2, the data set S end (F n ) is highly imbalanced. To overcome the imbalance and incorporate train-test split properly, we resample the data set by first duplicating samples in gauge group SU (3), G 2 , SO(8), F 4 , and E 8 to 10% the size of the original data and then randomly drawing samples in gauge group ∅ and SU (2) to 10% the size of the original data. We call the resampled data set S end (F n ) (1431375 samples in total). The imbalance is completely resolved and is shown in table 3 for the resampled data set. To check the classifiers' scalability, we further down-sample the data set to 10% and call this data set S end (F n ) (143138 samples in total).

Prediction performance
We evaluate the performance of each classifier based on two measures: (1) weighted accuracy and (2) training time/efficiency. The core of most ML algorithms involves minimizing/maximizing an error/utility function, and this is typically done by numerical optimization methods. Difference in various algorithms' complexity results in different training time, and in some cases excessive complexity may lead to that a model not optimized, in addition to long training time.
For each data set, we split it into non-overlapping training and test sets by randomly selecting (without replacement) 75% original data as training and the rest as test. We train every ML model on the training set and use the trained model to make prediction on the test set. In the case of SVM, because the transformed feature space has a very high dimension, in general training more than 10000 samples is computationally unfeasible. This is solved by further down-sampling the training set to 10000 samples for SVM in both data sets. Nonetheless, the size of test set is the same for all classifiers to validate the performance comparability. The implementation and details for each classifer are listed below • LR: Scikit-learn [46]; hyperparameter C optimized, regularization optimized between L1 and L2 • DT: Scikit-learn; untrimmed, no hyperparameter tuning • RF: Scikit-learn; 10 trees, no hyperparameter tuning • SVM: Scikit-learn; rbf kernel, hyperparameters C, γ optimized • FNN: Keras; 2 hidden layers (10, 10), output activation = softmax, dropout regularization added, epochs=5. The structure and hyperparameters have not been fully fine-tuned/optimized, so it can be expected that the accuracy may increase slightly upon further tuning. Yet given the apparent excessive runtime/complexity, we decide not to apply full tuning.
The performance of all the classification algorithms described in Section 5.2 is presented in table 4. By comparing weighted accuracy, one finds that all non-linear algorithms (all but logistic regression) give considerably good results, and decision tree and random forest are the best performaners with only slight difference. The second criterion to consider is runtime, and the table shows that decision tree is much faster than other algorithms (which is clearly due to its simplicity) 5 . Combining both performances, we conclude tentatively that decision tree is the best classification algorithm on our data set. 5 The runtime presented for logistic regression and SVM is based on the optimized hyperparameters.
The optimization of hyperparameters involves multiple training/testing and the total runtime for both of them needs to be multiplied.  Table 5. OOS weighted accuracy on re/down-sampled S end (Fn).

Model interpretability
Compared to traditional statistical models, one avantage of ML is its strong adaptability on data sets with complex patterns. However, this often leads to the fact that many ML algorithms lack interpretability, as one can hardly extract analytical results that explain the pattern in the data set. For instance, random forest can often provide high prediction performance, but its ensemble nature makes it impractical to understand how the prediction is made based on simple rules. For our purposes, it is of particular interest to extract analytic understanding in addition to making prediction from the data sets. So we take interpretability as another algorithm selection criterion. Among all the ML algorithms, decision tree is one of the most interpretable methods, given its simple algorithmic structure. Indeed, since the rule of each split is an inequality on a feature, the decision function can be summarised as a collection of all the inequalities on the tree. For each input x, the prediction rule is simply where L i and U i are the lower/upper bound of all split inequalities involving x i , i.e. the i-th feature of an input. Thus, by extracting the decision rules of certain-samples (e.g. geometries that have a SU(3) gauge group), we may gain insight about how features are related to the final prediction. Moreover, decision tree (and all the other tree-based algorithms) also has a built-in feature importance evaluation method. This is called Mean-Decreased-Impurity (MDI) importance measure. MDI computes the impurity-decrease (purity-increase in our langauage) weighted by number of samples on each node for every feature, and rank features' importance by their MDI values. This is helpful for interpreting how much contribution each feature provides to the whole prediction. With these considerations, we conclude that decision tree is the best ML algorithm to apply to our problem. In the rest of this paper, we will focus only on decision tree and we will apply the rule extraction and MDI feature importance on our data sets.

Detailed analysis of gauge group on divisors
In this section, we present detailed results from the untrimmed decision tree and extract analytic rules for divisors with h 1,1 (D) = 1, 2, 3. For h 1,1 (D) > 3, we will only discuss the properties of the classifier and accuracies.  Table 6. Total number of samples with each gauge group in the set S(P 2 ).

P 2
For h 1,1 (D) = 1, the only possible topology of D is P 2 . We can compress the vector V (D) in section 4.3 into a 10D vector whose components are denoted as f 0 , f 1 , . . . , f 9 in the following discussions. The information of the normal bundle N D = aH is explicitly given by D 2 D 1 = aH · H = a.
As introduced in section 4.2, we use a combination of P 2 divisors on the end point bases and intermediate bases as the initial data set S(P 2 ).
In total, there are 113,219 samples with one of the following gauge groups: ∅, SU(2), SU(3), G 2 , SO(8), F 4 , E 6 , E 7 and E 8 . The total number of samples with each gauge group is listed in table 6.
The decision tree is trained on the set of samples S (P 2 ) after up/down resampling, analogous to the resampling process in section 6.1. The number of samples in S (P 2 ) with each label is balanced to ∼ 8200. The (train set:test set) ratio is still (3:1). After the training, the IS and OOS accuracies are 0.912774 and 0.900694 respectively when tested on the resampled set. As another way to test the algorithm's predictability, we can use this decision tree on the data before resampling. When the decision tree is tested on the original set S(P 2 ) with 113,219 samples, the accuracy is A = 0.949111. The maximal depth of the decision tree is d max = 29.
We plot the feature importance of f i in figure 9. The most important feature is f 0 = a, which is expected since the normal bundle is the most direct information in the formula (3.9, 3.10).
The decision tree contains 2563 nodes and 1282 leaves. The structure is too complicated to be fully drawn. An efficient way to read out analytic conjectures from the decision tree is to sort the leaves l according to the total number of samples |S(l)| in S(P 2 ) which they apply to. We list a number of leaves with large |S(l)| in table 7 (the tables are located at the end of this paper) . We only list the leaves that predict the existence of a non-Higgsable gauge group G with probability higher than 80% on a divisor D, which means that more than 80% samples belong to the same gauge group based on the same rules.
We can see that the analytic rules in table 7 mostly predict the common gauge groups SU(2), G 2 and F 4 , except from the following five rules: (1) If f 0 = a ≤ −13, the gauge group is E 8 .
( Another way to select informative leaves is listing the leaves with small depth, as in table 8. The reason is that the rules with small depth are generally simpler. However, there are leaves with small depth that apply to only few samples in S(P 2 ) or has low predictability, such as the rule giving 57% SU(2) and 43% ∅ in table 8. Hence we can not state that the shallow leaves give the best set of rules.
Because the resampled training set in S (P 2 ) has balanced labels, we have derived a large number of rules predicting rarer gauge groups such as SU(3), SO(8), E 6 , E 7 and E 8 . We list a number of these rules in table 9.
It is hard to check these rules analytically with (3.9, 3.10), since the only directly relevant feature is the normal bundle coefficient a ≡ f 0 . We summarize the possible gauge groups for different normal bundle N D = aH in table 10 using the data set S(P 2 ).
For D = P 2 , −K D = 3H and N D = aH, hence the formula (3.9, 3.10) becomes where φ i and γ i are the order of vanishing of f and g on the divisors D i which intersect D.
If there is an E 8 on the divisor D, then f 3 and g 4 has to vanish, or equivalently we have 12 + a − 13, then the inequalities (7.4) are always satisfied. This condition exactly corresponds to the first rule in table 7. If −10 ≥ a ≥ −12, then the second inequality in (7.4) is automatically satisfied, but the first inequality may not be satisfied. When 12 + a − D i D =∅ φ i ≥ 0, the gauge group is expected to be E 7 instead of E 8 since f 3 is non-vanishing now. However, as we mentioned after (3.10), there may be non-local effects from other non-neighboring divisors which increases the order of vanishing. For example, we see from table 10 that the gauge group can be E 8 even if a = −9. However, the second inquality in (7.4) cannot be satisfied since there cannot be another neighboring toric divisor D i with ord D i (g) = 1. Otherwise, this will lead to a toric (4,6) curve which is not allowed in the set S(F n ) generated from good bases exclusively 6 . Thus we have found cases where the formula (3.9, 3.10) cannot give us the correct non-Higgsable gauge group. If we want an F 4 or E 6 gauge group on D, then f 2,D and g 3,D has to vanish but g 4,D should not vanish. We have inequalities Now the gauge group is E 6 if and only if g 4,D is a locally complete square. In the case of generic fibration, this means g 4 is a single monomial which takes the form of a complete square locally. If a = −9 and the inequalities in (7.5) are satisfied, then we can see that g 4,D ∈ O(0) and γ i = 0 for any D i intersects D. This exactly corresponds to the criterion of an E 6 gauge group as g 4,D is a complex number in this case. Hence F 4 can only appear if a ≥ −8, which is consistent with our observation in table 10.
For the case of SO(8), the situation is similar. If we want an SO (8) or G 2 gauge group on D, f 1,D and g 2,D have to vanish but g 3,D should not, we have If a = −6, then the third inequality in (7.6) means that g 3,D is locally a complex number.
Similarly, since f 2 is also either locally a complex number or vanishes. Then the gauge group should be SO(8) if a = −6 and the conditions (7.6) are satisfied. So we expect that G 2 can only appear if a ≥ −5, which is consistent with table 10.

F n
We apply the same method in section 4.3 to the data set S(F n ), which is a combination of F n divisors on end point bases and intermediate bases. There are in total 6,300,170 samples in this set, and the total number of each gauge group is listed in table 11. We generate the resampled data set S (F n ) similar to the procedure in section 6.1. After training the descision tree with 75% of the data in S (F n ), the IS and OOS accuracies are This decision tree has 66441 nodes and 33221 leaves, which is much larger than the decision tree for P 2 . This is due to the large total number of samples in the training set. The maximal depth of the decision tree is d max = 49.
We plot the feature importance of f i in figure 10. The most important feature is f 1 = a, the coefficient of S in the normal bundle expression N D = aS + bF . The next most important features are It is interesting that the feature f 0 ≡ n, which specifies the topology of D, has low importance. On the other hand, the canonical divisor of F n is −K(F n ) = 2S + (n + 2)F , which depends crucially on n. This counter-intuitive result may imply that the non-Higgsable gauge group on a divisor is not highly sensitive to its topology.
We make a similar selected list of rules in table 12. We also list the leaves with small depth in table 13. Note that the rules with small depth often apply to few samples since they correspond to extremal cases. The rules in table 12 only give SU(2), G 2 , F 4 or E 8 gauge group. We list a number of rules for SU(3), SO(8), E 6 , E 7 and E 8 in table 14 7 . One can see that the leaves giving SU(3) or SO(8) typically have large depth and the rules are highly complicated, except for the following two rules giving mainly SU (3): Now we use (3.9, 3.10) to analyze some of the rules. For D = F n , −K D = 2S +(n+2)F and N D = aS + bF , hence the formula (3.9, 3.10) becomes where φ i and γ i are the order of vanishing of f and g on the divisors D i which intersect D.
If a ≤ −9, it is clear that f 3,D in (7.8) vanishes, since the coefficient 8 + (4 − k)a becomes negative. Similarly, g 4,D in (7.9) vanishes. Hence the gauge group has to be E 8 , as given by the first rule in table 12.
If a = −6, we can see that f 2,D in (7.8) vanishes since 8 + (4 − k)a < 0. Similarly g 3,D vanishes hence (f, g) vanishes to at least order (3, 4) on D. Hence the gauge group is minimally F 4 . The gauge group is E 6 if g 4,D ∈ O(0) and g 4,D is locally a complete square. This can happen when 6(n + 2) + 2b = 0 and D i D =∅ γ i D i D = 0. The rules in table 14 that predict E 6 gauge group roughly all belong to this class. For example, the rule with |S(l)| = 277 states that if n = 0, a = b = −6, f 5 ≥ 4, f 6 ≤ 6, then the gauge group is E 6 . If n = 0, a = b = −6, then the gauge group cannot be F 4 since g 4,D ∈ O(0) already. The additional rules f 5 ≥ 4, f 6 ≤ 6 help to make sure that the gauge group is not larger than E 6 in a subtle way.

Toric surfaces with h 1,1 = 3
The toric surfaces with h 1,1 (D) = 3 are generated by blowing of F n at the intersection points of toric curves. They form a simple one parameter family S 3,n (n ∈ Z, n ≥ 0). D has five neighboring divisors D 1 , · · · , D 5 , where D i D = C i gives the toric curve on the divisor D. The five corresponding toric curves C 1 , C 2 , . . . C 5 on D has the following selfiintersection numbers: The linear equivalence conditions of C i (i = 1, . . . , 5) are The input vector we use is 19-dimensional, with the following form: Then we have equations We plot the feature importance in figure 11. It seems that f 1 is the most important feature to determine the gauge group. f 2 and f 5 also has significantly higher importance. We list a number of leaves with large number of applied samples and small depth in table 16.
Most of the rules are predicting SU(2), G 2 , F 4 or E 8 gauge group. However, there is also one rule predicting E 7 and another rule predicting E 6 . Similar to the cases of D = F n , we can see that the number f 0 ≡ n specifying the topology of D is not very important in these rules.  p = h 1,1 (D) + 2 neighbor divisors is chosen such that the curve C 1 = D 1 D has the lowest self-intersection number among C i (i = 1, . . . , p). Then C 1 , . . . , C p curves form a cyclic toric diagram of D.
We only list some general information about the sample divisors and the decision tree. We list the number of divisors with each gauge group in the original data sets S(h 1,1 (D)) in table 17.
The decision tree is trained on 75% of the resampled data set S (h 1,1 (D)) and we list the accuracy, total number of nodes and maximal depth of the decision tree in table 18, including the cases for h 1,1 (D) = 1, 2, 3 as well. As we can see from table 17 and 18, for the cases of h 1,1 (D) > 4, the number of nodes and leaves in the decision tree is roughly proportional to the number of data samples in S(h 1,1 (D)). We plot the linear model and data points in figure 12. The linear relation is N nodes = 0.038218N samples + 937.59, (7.15) with R 2 = 0.994635. This indicates that the decision tree approach on divisors with larger h 1,1 (D) has a universality. On the other hand, the maximal depth of the decision tree is not significantly correlated to the total number of nodes.
We can see that the in-sample accuracy roughly increases as h 1,1 (D) becomes bigger. For h 1,1 (D) > 7, the in-sample accuracy becomes very high (> 99.98%). In principle, if there is not a case where two samples with different labels share the identical features, then an untrimmed decision tree should give perfect in-sample accuracy, as samples with different labels can always be split into different nodes. The low in-sample accuracy for h 1,1 (D) = 1 implies that there are many samples where the features are not enough to distinguish the gauge group. On the other hand, for larger h 1,1 (D), there are more features and this problem is less significant, since it is less likely to find two samples with exactly the same features.
On the other hand, the out-of-sample accuracy and the actual accuracy on the original data set are not clearly correlated with h 1,1 (D). Nonetheless, the accuracies are always between 85% ∼ 99%.

Checking whether a curve is a (4,6)-curve
Besides the decision of non-Higgsable gauge groups, we also attempt to use machine learning to decide whether a toric curve v i v j on a general resolvable base is a (4,6)-curve or not. We use the 14 local triple intersection numbers shown in figure 13 as the features. The input vector is We label them by f 0 ∼ f 13 . The output label is binary, 0 for curves without (4,6) singularity and 1 for (4,6) curves.
The toric threefold bases are generated from a similar approach as the intermediate bases in Section 4.2. We start from base b 1 = P 3 and randomly blow up/down once in each step, generating 10,000 bases in the sequence. The difference is that we allow all the resolvable bases with (4,6) curves to appear in this sequence. To reduce repetition, we only pick b 1 , . . . , b 20 and b 100k (k ∈ Z). Then we use every toric curves on these bases to generate the training data set. In total, we have performed 25 random walk sequences and generated 3,000 bases.
In total there are 12,125,945 sample curves, among which 1,342,652 of them has (4,6) singularity. After processing the original data set by resampling, the decision tree has 193,121 nodes and 96,561 leaves. The maximal depth is d max = 51. The in-sample and out-of-sample accuracy on the resampling data set is 0.997106 and 0.953865 respectively. The accuracy on the original data set is A = 0.957505.
The feature importance of f i is plotted in figure 14.
We can see that f 0 , f 1 , f 2 and f 3 are the most important features, which is expected since they sit closer to the curve D 1 D 2 .  We list a number of leaves with big |S(l)| in table 19. From table 19, it seems that the curve is usually a (4,6)-curve whenever f 1 ≤ −2 and f 2 ≤ −2. Actually this rule always holds for any toric curve. We will now derive this analytically.
Suppose that f 1 = −a, f 2 = −b, then the local toric geometry near the toric curve is always described by figure 15 up to an SL(3, Z) transformation on the toric rays 8 . The reason is that since the two 3D cones have unit volume, we can always transform v 1 , v 2 and v 3 to (0,0,1), (0,1,0) and (1,0,0). Then the relations D 2 1 D 2 = −a and D 1 D 2 2 = −b fix the toric ray v 4 to be (−1, b, a). With the toric rays in figure 15, any monomial (x, y, z) ∈ F satisfies which implies that by+az ≥ −8. Now if a, b ≥ 2, this means y+z ≥ −4 for any (x, y, z) ∈ F. Since the order of vanishing of f on this toric curve v 1 v 2 is given by Similarly, any monomial (x, y, z) ∈ G satisfies x, y, z ≥ −6 , −x + by + az ≥ −6, (8.4) which implies that by + az ≥ −12. If a, b ≥ 2, then y + z ≥ −6 and g vanishes to order 6 or higher on v 1 v 2 .

Applications
9.1 Applying the rules on bases with toric (4,6) curves In our train set, we do not use the data from resolvable bases with toric (4,6) curves. Now we want to know if the rules derived from the good bases can apply to resolvable bases as well. We have applied the classifiers trained from the good bases in section 7 to the divisors on the resolvable bases generated in section 8, and we list the accuracies for each h 1,1 (D) in table 20. We plot the comparison of the accuracy on the resolvable bases and good bases in figure 16.
As we can see, the accuracies on this set of resolvable bases are usually a bit lower than the accuracies on the set of good bases. Nonetheless, the accuracies are still always higher than 80%. For the case of Hirzebruch surfaces with h 1,1 (D) = 2, the accuracy on resolvable bases is 0.980387, which is even higher than the accuracy on good bases! This implies that the rules of non-Higgsable gauge groups we have derived in section 7 universally apply to the good bases and resolvable bases.
Another interesting feature in figure 16 is the peaks for both A res and A good , for example at h 1,1 (D) = 10, 16 and 17. For some reason, the rules of non-Higgsable gauge groups for these h 1,1 (D) are more organized, and we can get a high accuracy despite of the lacking of training samples. It may be interesting to investigate this phenomenon in future work.

An SU(3) chain
In this section, we present some local constructions of non-Higgsable clusters using the analytic rules we have derived in section 7.
In 6D F-theory, the only possible appearance of a non-Higgsable SU(3) gauge group is on an isolated (−3)-curve with no charged matter [12]. However, we will construct an infinite chain of non-Higgsable SU(3) gauge groups on a 3D base using the analytic rules we have discovered in table 14.
The rule with d = 25, S(l) = 129 states that if Then we can use this to construct a chain configuration as in figure 17. In this particular example, the divisors P n s are F 3n−2 and Q n s are F 3n+2 . The non-vanishing triple intersection numbers between Q n and P n are Notice that the sum of two numbers on each edge between Q n and P n is always 3.
Using the formula (A.13), one can compute that the self-triple intersection numbers of P n and Q n are all 12. Hence the conditions D 3 2 = 9 ∼ 12 and D 3 4 = 10 ∼ 14 in the rule we derived from machine learning are satisfied. Now we check the gauge group analytically using the formula (3.9) and (3.10), assuming f and g does not vanish on D 1 and D 2 . For Q n (n > 0), −K Qn = 2S + (3n + 4)F and N Qn = −2S − (3n + 5)F . Then f 1 on Q n is given by Hence if f vanishes to at least order 1 on Q n−1 and Q n+1 , f 1 always vanishes. g 2 on Q n is given by 3) Hence if g vanish to order 2 on Q n−1 and Q n+1 , we will exactly get g 2 ∈ O(0), which is the condition for the gauge group to be SU(3). The situation for P n and Q 0 is analogous.
We can also check this by assigning toric rays to each of these divisors. D 1 and D 2 are given by (1, 0, 0) and (−1, −2, −5). Q n s are given by (0, n, n − 1) and P n s are given by   Similarly, we can slightly modify the SU(3) chain structure in the last section to get a local SU(2)× SU(3)× SU(2) configuration, as in figure 18.
For Q 1 and P 1 , since ord P 2 (g), ord Q 2 (g) ≤ 1, g Q 1 ,2 , g P 1 ,2 = O(0). Hence the gauge groups on Q 1 and P 1 are type IV SU(2) or type III SU(2) depending on the order of vanishing of f .
Since the constructions are all independent of the global structure of the compact base threefold, this can be applied to non-GUT type model building using non-Higgsable gauge groups [47] or 4D N = 1 SCFT [41].

Conclusion and future directions
In this paper, we have partially solved the problem of reading out the non-Higgsable gauge group on a toric divisor D in 4D F-theory. Using decision tree classification algorithm, we achieved 85%-98% out-of-sample accuracies on divisors with different h 1,1 (D), see table 18 for details. For the divisors with h 1,1 (D) ≤ 3, this methodology is limited by the insufficiency of the features. This is because there exist many samples with the same features but different labels. In the physical language, it means that the set of local triple intersection numbers near D cannot uniquely determine the non-Higgsable gauge group. This problem cannot be resolved by machine learning techniques, and we can only add more local geometric information and increase the number of features. However, we expect the decision tree's structure and rules to be more complicated, which is a trade off. For the divisors with h 1,1 (D) ≥ 4, it turns out that the in-sample accuracy is significantly higher than the out-of-sample accuracy. Hence we should modify the machine learning method to improve the predictability.
Besides the predictability, the machine learning algorithm's interpretability also has crucial importance for our purpose. It will be useful if we can simplify the decision tree's structure. For example, if we have two features f i and f j , then a linear combination f i + f j may be a better variable than f i and f j such that the decision tree will have a smaller depth with the feature. In our decision trees for F n and S 3,n divisors, it turns out that the number n specifying the topology type of D is not very important. It is possible that a combination of n and the normal bundle coefficients may act as a better feature in the decision tree approach. We will leave this exploration to future work.
We have generated various analytic rules from the decision trees through out section 7. But it is worth noting that these rules are derived empirically and not necessarily rigorous. It is hard to prove these rules apart from a small number of simple ones. However, we expect that a particular gauge group will appear for most of the times (> 99%) on a generic base. Of course, it is useful to test these rules on other set of bases as well.
We have applied the trained decision tree to divisors on resolvable bases, and the accuracies are 80%-98% for different h 1,1 (D), see table 20 and figure 16. We see that the rules trained from the good bases can be applied to resolvable bases as well.
In section 8, we presented a simple analysis of the criteria for (4,6) curve. In the future, it is worth investigating the blow up sequences of different (4,6) curves, which will lead to a set of 4D conformal matter. Machine learning techniques may be useful in this problem as well since there are many classes of these (4,6) curves. Similarly, it is interesting to study the blow up of a point where (f, g) vanishes to order (8,12) or higher [41], since they are common on a general resolvable bases constructed in section 9.1.
Of course, another interesting direction is to apply our results to toric divisors on non-toric threefolds. If the divisor D still have p = h 1,1 (D) + 2 neighboring divisors, then the local geometric structure is similar to our samples and the analytic rules should apply. Because many of the analytic rules are insensitive to the topology of the divisor D, as we have mentioned in section 7.2, they may be applicable to non-toric divisors as well. However, we currently do not have such a non-toric threefold database and the non-Higgsable gauge groups information to check these rules.
This is due to the fact that D 2 i D equals to the self-intersection number of C i on the complex surface D. The sum of these self-intersection numbers C 2 i = 3(2 − h 1,1 (D)), since C 2 i is equal to 3 for P 2 , 0 for F n , and each toric blow up reduces this number by 3. D 2 D i and D 3 are also subject to some constraints. We will analyze them explicitly for divisors D with h 1,1 (D) = 1, 2, 3.

A.1 P 2
For D = P 2 , it has three neighbor divisors D 1 , D 2 , D 3 . Denote the toric ray of (D, D 1 , D 2 , D 3 ) by (v, v 1 , v 2 , v 3 ), we can do a SL(3, Z) transformation on the set of toric rays to transform v, v 1 and v 2 to (0, 0, −1), (1, 0, 0) and (0, 1, 0). Since the 3D cones vv 1 v 3 and vv 2 v 3 have unit volume, we can only set v 3 to be v 3 = (−1, −1, a), (a ∈ Z). Now we can use the linear equivalence equations (2.13): to compute Then we can use (2.14): Hence we have a relation Denote the curves on D by C i = D D i , then we can observe this linear equivalence relation on D: A.2 F n For D = F n , it has four neighbor divisors D 1 , D 2 , D 3 , D 4 with toric rays v i (i = 1, . . . , 4). We take these these toric rays to be We can explicitly see that these 3D rays can be projected to a 2D subspace and they v 1 , v 2 , v 3 , v 4 explicitly give the toric rays of F n . Any toric divisor F n and its four neighbors can be transformed into the above form with a SL(3, Z) transformation and a permutation. We have the following triple intersection numbers which equal to the self-intersection numbers of the curve on D. Hence we can read off Then we can use The relations (A.11) can be checked with the linear relations of the curves on D: (A.14)

A.3 S 3,n
For D = S 3,n , it has five neighbor divisors D 1 , D 2 , D 3 , D 4 , D 5 with toric rays v i (i = 1, . . . , 5). We take these these toric rays to be The linear relations of the curves on D are With (2.13), we can compute Then with (2.14), we can compute Using the variables 19.3%∅ Table 7. The inequalities that predict the appearance of certain gauge group G on a P 2 divisor. d is the depth of the leave in the tree. |S(l)| denotes the number of samples in S(P 2 ) on which this rule will apply. The rules are sorted according to the normal bundle ND = f0H and the |S(l)|.  Table 8. Leaves in the decision tree with depth d ≤ 7, applied to P 2 divisors.   Table 9. The inequalities that predict the appearance of rarer gauge groups G =SU(3), SO(8), E6, E7 and E8 on a P 2 divisor. |S(l)| denotes the number of samples among the 113,219 total samples which this rule will apply. The rules are sorted according to the gauge group G. SO (8),  Table 11. Total number of samples with each gauge group in the set S(Fn).  denotes the number of samples among the 6,300,170 total samples which this rule will apply. The rules are sorted according to the normal bundle coefficient a in ND = aS + bF and |S(l)|. We only list the rules that apply to more than 3,000 samples. The rules with G =SU(2)* predicts the existence of SU(2) with at least 99.94% possibility.