Estimating Calabi-Yau Hypersurface and Triangulation Counts with Equation Learners

We provide the first estimate of the number of fine, regular, star triangulations of the four-dimensional reflexive polytopes, as classified by Kreuzer and Skarke (KS). This provides an upper bound on the number of Calabi-Yau threefold hypersurfaces in toric varieties. The estimate is performed with deep learning, specifically the novel equation learner (EQL) architecture. We demonstrate that EQL networks accurately predict numbers of triangulations far beyond the $h^{1,1}$ training region, allowing for reliable extrapolation. We estimate that number of triangulations in the KS dataset is $10^{10,505}$, dominated by the polytope with the highest $h^{1,1}$ value.


Motivation
Triangulations and Calabi-Yau manifolds are objects of intrinsic mathematical interest in combinatorics and algebraic geometry, respectively. In the former case, elementary operations on triangulations such as flips may be used to determine when two triangulations are canonically related to one another, or to populate large ensembles of triangulations. Similarly, elementary topology changing operations on Calabi-Yau manifolds such as flop or conifold transitions may relate two such manifolds to one another by finite distance motion in metric moduli space; these may be utilized to generate large ensembles of geometries. In the case of Calabi-Yau hypersurfaces in toric varieties, Calabi-Yau data is encoded in the structure of a triangulation, and operations on the triangulation often induce topological transitions in the Calabi-Yau hypersurface. The transitions between these objects naturally generate a large network that encodes ensemble structure.
In string theory, Calabi-Yau manifolds are some of the simplest and best-studied backgrounds on which to compactify the extra dimensions of space while preserving supersymmetry in four dimensions [1]. There, flop [2] and conifold [3] transitions induce space-time topology change in a physically consistent manner and may be utilized to generate large ensembles of string compactifications; generalized fluxes [4,5] also give rise to large ensembles [6]. The associated collection of four-dimensional effective potentials and metastable vacua form the so-called string landscape, which is central to understanding the implications of string theory for particle physics and cosmology.
A natural direction is to understand cosmological dynamics on the landscape and associated mechanisms for vacuum selection. Such dynamics, together with statistical properties of vacua [7], could then lead to concrete statistical predictions for physics in four dimensions. Global structures in the landscape, such as large networks, may play a role in the dynamics. For example, dynamical vacuum selection [8] on a network of string geometries [9] in a well-studied bubble cosmology [10] selects models with large numbers of gauge groups and axions, as well as strong coupling. However, concrete studies of the landscape are difficult due to its enormity [11,12,13,14,9,15], computational complexity [16,17,18,19,20], and undecidability [17]. It is therefore natural to expect that, in addition to the formal progress that is clearly required, data science techniques such as supervised machine learning will be necessary to understand the landscape; see for initial works [21,22,8,23] in this directions and [24,25,26,27,28,29,30] for additional promising results.
One concrete goal is to understand the full ensemble of Calabi-Yau threefolds, networklike structures induced by transitions between them, and associated implications for cosmological dynamics and vacuum selection in compatifications of string theory. However, this is far out of reach currently for reasons of enormity and complexity.
Instead, in this paper we take a modest but necessary first step in this direction. We will estimate the number of Calabi-Yau threefold hypersurfaces in toric varieties, which are one of the most-studied ensembles in string theory. Each such manifold is naturally associated to a 4d reflexive polytope, which were classified in a seminal work [31] by Kreuzer and Skarke. Determining a Calabi-Yau from its corresponding polytope requires the specification of a fine, regular, star triangulation (FRST) of the polytope, and our main result is that n FRST 10 10,505.2±292. 6 , where n FRST is the number of FRSTs arising in the Kreuzer-Skarke ensemble. We obtain this estimate with deep learning, specifically a novel neural network architecture known as an equation learner (EQL) [32], which we demonstrate is significantly better at extrapolating triangulation predictions to large h 1,1 than a standard feed-forward neural network. In particular, by demonstrating accurate extrapolation to high h 1,1 in the analogous problem for 3d reflexive polytopes, we lend credibility to the 4d prediction, which requires extrapolation significantly beyond the regime in which training data is available. Predicting n FRST provides an estimated upper bound on the number of Calabi-Yau threefold hypersurfaces in toric varieties.
Given these accurate predictions of n FRST , it is interesting to study the interpretability of the predictions made by the neural network. Sometimes refered to as intelligible artificial intelligence, interpretability is a major current goal of machine learning research, and it is one of the reasons for developing the EQL architecture; see [33] for the related idea of conjecture generation, by which interpretable numerical decisions may be turned into rigorous results. In the EQL context, the idea is to mimic what happens in natural sciences such as physics, where a physical phenomenon is often described in terms of an interpretable function that allows for understanding and generalization. Accordingly, by utilizing simpler functions than standard architectures, EQLs in principle increase the likelihood of intepretability.
In the case of n FRST , we found that an EQL that utilizes quadratic functions makes accurate predictions; see Sections 4 and 5 for quadratic functions associated to trained EQLs. By studying an associated heat map of coefficients, we demonstrate that some variables and cross-correlations are clearly of more importance than others, but the existence of a large number of warm spots suggest that many variables matter, which makes interpretability difficult. This could be an artifact of having used a quadratic function, which may be suboptimal, but it could also be the case that there is no simple interpretation of n FRST predictions. That is, sufficiently complex phenomena in complex systems may not admit descriptions in terms of simple equations in human-understandable variables. This paper is organized as follows: In Section 2, we give an overview of our approach for estimating the number of FRSTs in the 4d Kreuzer-Skarke database. In Section 3 we discuss the method by which we classified the 3d facets of the 4d reflexive polytopes, the results of this classification, and our success in obtaining FRTs of these facets. In Section 4 we discuss our unsuccessful initial machine learning attempts, our input features, and our successful EQL model. In Section 5, we perform a similar analysis for the 3d reflexive polytopes and show that a model of a similar architecture is able to extrapolate far outside its training range.

Approach
Batyrev has shown [34] that a hypersurface in a toric variety can be chosen to be Calabi-Yau if the object underlying the construction of the variety, a lattice polytope, obeys the condition of reflexivity.
A reflexive polytope ∆ is defined as the convex hull of a set of points {v} ∈ Z n whose dual polytope is itself a lattice polytope. In four dimensions, there are 473, 800, 776 reflexive polytopes, as classified by Kreuzer and Skarke [31]. The ambient 4d space described by these polytopes is generally a singular toric variety. A proper Calabi-Yau manifold will therefore be a hypersurface in a suitably de-singularized ambient space. The process of de-singularization is equivalent to obtaining a fine, regular, star triangulation (FRST) of the 4d reflexive polytope.
Crucially, there are typically very many inequivalent FRSTs for a given polytope. While software packages such as PALP [35] and TOPCOM [36] can, in principle, produce all possible triangulations of a given polytope; in practice, such a request becomes increasingly problematic computationally as the overall size of the polytope grows. A good proxy for this computational load is the value of the topological quantity h 1,1 associated with the polytope. For example, in [37], all triangulations of all 4d reflexive polytopes with h 1,1 ≤ 6 were obtained. The nearly 652,000 unique triangulations required approximately 120,000 corehours of processing time to obtain. The computational burden can be mitigated somewhat by exploiting the reflexivity property of these polytopes [38]. Even with this assistance, however, obtaining more than a single, canonical triangulation for a given polytope (in a reasonable computational time) becomes difficult for h 1,1 > ∼ 10 [39]. It would appear, therefore, that an enumeration of the unique triangulations in the KS 4d reflexive polytope dataset must remain unobtainable, barring dramatic advances in computational power or a vastly superior triangulation algorithm. It is this impasse which forms the motivation for the current work.
There is reason to believe that application of machine learning techniques may allow for an estimate of the total number of triangulations in the KS dataset to be achieved. A proof of principle already exists in the KS set of 3d reflexive polytopes, of which there are 4319. In [40], the number of FRSTs for many of the 3d reflexive polytopes was computed, and an estimation of the total number was obtained, using standard triangulation techniques. Soon thereafter, another estimate was obtained using supervised machine learning. Specifically, a decision-tree regression model was trained on known results, and used to estimate the number of FRSTs for the remaining cases [33]. The results were in good agreement with the original estimate of [40]. The objective of the current work was to obtain such an estimate for the 4d reflexive polytopes via similar techniques.
As was done for the 3d case, the method employed here is to estimate the number of FRSTs of a given polytope as the product of the number of fine, regular triangulations (FRTs) of each of its facets. However, counting the number of vertices in the 4d reflexive dataset (which equals the number of facets by duality) shows that there are over 7.5 billion facets. In order to better get a handle on this set, as well as avoid unnecessary computational repetition, the first step was a classification of these 3d facets. The procedure for doing so is the subject of Section 3.
With this classification in hand, we then explicitly computed the number of FRTs for as many of the facets as possible. Due to computational constraints, this consisted of only 1.03% of the total -primarily consisting of facets that first appear in dual polytopes at low values of h 1,1 . The data associated with these explicitly-triangulated facets then became the training data for supervised machine learning. Ultimately, a neural network was employed to construct a model which predicts the natural logarithm of the number of FRTs of each facet. From this we arrive at an estimate of the total number of FRSTs of the 4d reflexive polytopes. The machine learning techniques, and FRST estimate, are the subject of Section 4.
Any such machine learning estimate will obviously have limitations, and ours come from the two main issues that we faced. First, as noted above, we were only able to obtain known results for 1.03% of cases, meaning that it was necessary to estimate the value for nearly 99%. Second, as the facets that were able to be triangulated were necessarily ones with fewer triangulations, this problem involved extrapolation to output values beyond the range seen in the training set. However, as we will discuss below, we believe that our estimate is sound. The equation learning (EQL) neural network architecture that we ultimately employed was developed especially for extrapolation outside of a given training set. By withholding the subset of our known results that first appeared at h 1,1 ≥ 12 when training, we were able to see that our model extrapolated with stable results over a large number of cases as the number of triangulations increased. We comment further on these limitations, and the potential overcounting associated with our approach, in Section 6.

Facet Classification
In this section we review methods that easily determine when two facets are equivalent, and then apply those methods to the classification of facets in 4d reflexive polytopes.

Distinguishing unique facets
We must distinguish between three-dimensional (3d) facets of the four-dimensional (4d) reflexive polytopes, since individual facets may appear in multiple polytopes. In particular, we must determine under what conditions two facets should be considered to be equivalent, and then find a method to determine whether the conditions are satisfied.
For their classification of the 3d and 4d reflexive polyhedra [31], Kreuzer and Skarke defined a normal form, with the property that two reflexive polyhedra have the same normal form if and only if they are related by a GL(n, Z) transformation. While powerful, this normal form has the restriction that it can only be used on full-dimensional polytopes.
To circumvent this issue, and thereby be able to use the normal form, we employ the method of Grinis and Kasprzyk, as described in Section 3.2 of [41]. The origin is interior to every reflexive polytope, and thus we know that none of the facets contain the origin. Thus, for each facet F , we constructed the associated subcone defined by C F = conv(F ∪ {0}). The subcone is a full-dimensional polytope, and so its normal form can be computed. Additionally, as the origin is the sole interior lattice point of a reflexive polytope, the origin is the only lattice point in C F \ F . This also means that we need not worry about lattice translations when comparing subcones. As the origin is fixed under any GL(n, Z) transformation, two subcones C F 1 and C F 2 have the same normal form if and only if their associated facets F 1 and F 2 are related by a GL(n, Z) transformation.
As an example, consider the following two facets F 1 and F 2 , both of which appear as dual facets to the same h 1,1 = 2 polytope (given by POLYID 21 in the ToricCY database [37]). Each of the facets are the convex hulls of four vertices, as below: Adding the origin to each facet, we obtain the associated subcones Computing the normal form for each subcone, we find that We see that the two subcones have the same normal form, and thus the facets are equivalent.

Performing the classification
The primary practical challenge of the facet classification was the volume of the 4d polytope dataset itself. In order to have a consistent direction for our computation, we worked through the polytopes in order of increasing h 1,1 (X), where X is an associated Calabi-Yau hypersurface. As it is the dual polytope that is triangulated when constructing this Calabi-Yau, we identified the facets of the dual polytope in each case. Hence, for each h 1,1 value, we identified the facets that appeared in each dual polytope, keeping both a master list of unique facets as well as recording which facets appeared in each dual polytope (and with which multiplicity).
Computation was done on Northeastern University's Discovery cluster using the SLURM workload manager. As identifying the dual facets is an independent computation for each polytope, we were able to use distributed computing to decrease our real-world running time by several orders of magnitude. Filtering and removal of duplicate facets was performed after identification had been completed for each h 1,1 value. The dual and normal form computations were done using a C++ lattice polytope implementation of our own creation which used the PALP source code for many of its underlying calculations. This offered significantly faster performance than the LatticePolytope class in Sage, allowing the classification to be finished on the timespan of a few weeks.

Classification results
Using the above method for the facets of every 4d reflexive polytope, we found that there are a total of 45, 990, 557 unique 3d facets, which is an order of magnitude less than the number of polytopes themselves. This is approximately 0.6% of the total number of facets, which is 7, 471, 985, 487. We found that a relatively small number of facets accounted for the majority of the facets that appear across the polytopes, with the most common facet accounting for just over 20% of the total. As we performed the classification procedure, we recorded the number of facets which appeared for the first time at each h 1,1 value; i.e., the same facet may appear in many different polytopes with different values of h 1,1 , and we recorded the smallest such value. The distribution of these new facets per h 1,1 value is shown in the left panel of Figure 1, which should be compared to the distribution of the number of polytopes of a given h 1,1 value, which is given in the right panel of Figure 1. The shape of the two distributions are very similar, albeit decreased by around an order of magnitude. We note that the two distributions have different peaks: the number of reflexive polytopes peaks at h 1,1 = 27, while the number of new facets peaks at h 1,1 = 33.
Along with the cumulative distribution, we also show these distributions as fractions of the relevant numbers of polytopes in Figure 2. In the left panel, we show the number of new facets that appear at a given h 1,1 value, as a fraction of the total number of polytopes at that h 1,1 value. The scatter that emerges for h 1,1 > ∼ 250 represents the relatively small number of polytopes at these rather large h 1,1 values. Coupled with the information contained in Figure 1, it is clear that the overwhelming majority of facets are encountered well before h 1,1 250, though (as we will see), the number of triangulations are dominated by the outliers in the very large h 1,1 bins.
Finally, the right panel of Figure 2 shows the number of new facets at a given h 1,1 value, normalized by the cumulative number of polytopes to that particular h 1,1 value. We see that after the peak in the right panel of Figure 1, the ratio quickly saturates to a value of approximately 0.1. This is nothing more than the ratio of total unique facets found (47 × 10 6 ) to the number of 4d reflexive polytopes in the KS database (470 × 10 6 ). In understanding the reliability of our estimation methods, it is important to ask how often the unique facets that we have enumerated actually appear in the 4d polytopes in the KS database. This is the subject of Figure 3, in which we show a histogram of the number of appearances of a given facet.
One can see from the left panel of Figure 3 that a small number of facets dominate the database, with the frequency counts decreasing linearly on a logarithmic scale (the right panel of Figure 3). The 100 most common facets account for 74% of the total number. As mentioned in Section 1, we were able to achieve all possible FRTs for a very small set of the total number of 3d facets, but these cases represent over 88% of facets by appearance. This is indicated by the blue shading in both panels of Figure 3. The same graph, truncated at h 1,1 ≤ 120. We note that the erratic portion of the Figure 4 for h 1,1 > ∼ 120 accounts for fewer than 1 million polytopes. The truncated graph (Figure 4b) shows more clearly the prevalence of S3S.
The most common facet is the 3d polyhedron known as the standard 3-simplex (S3S), which has normal form vertices The S3S appears a total of 1, 528, 150, 671 times in the database, accounting for 20.45% of all facets, and appears in 87.8% of the polytopes. It first appears as a dual facet at h 1,1 = 1 and maintains a consistent presence through the database, as evidenced by the graphs in Figure 4. However, the S3S, being itself a simplex with no interior points, has only one FRT, and thus has essentially no effect on the total number of polytope FRSTs.

Machine Learning Numbers of Triangulations
In supervised machine learning (often called simply supervised learning), the output of the machine learning algorithm is a function, commonly called a model, which takes a specified set of inputs and produces a unique output value. This function contains numerical parameters, or weights, which are adjusted by the machine learning algorithm. The algorithm is trained on a set of input → output pairs, and attempts to minimize a predetermined loss function by adjusting the weights of the function. In this section we will estimate the number of FRSTs of 4d reflexive polytopes by using supervised learning to predict the number of FRTs of the facets. For this application, the input is a set of features which describe a facet, and the desired output is the number of FRTs for that facet. We describe in this section the features which we used as input and the structure of the model that we obtained via supervised learning.

Training data
The first step in our machine learning process was to generate data on which to train a model. In this case, this meant it was necessary to triangulate as many of the 3d facets as possible, in order to obtain the best possible training set.
Due to computational restrictions, we were only able to find the actual number of FRTs for 472, 880 of the facets, which represents 1.03% of the total. These consist almost entirely of facets which first appear in the dual polytope at relatively low h 1,1 values. Table 1 shows our progress. As one can see, h 1,1 = 14 was the last value at which we were able to obtain an appreciable fraction of the real FRT values. While the triangulated set may constitute only a small fraction of the unique facets, they account for over 88% of all facet appearances.
To obtain the FRT values, we first computed the 2-skeleton of each facet using our C++ code. The n-skeleton of a lattice polytope consists of all points which are not interior to any (n+1)-dimensional face. We then passed the 2-skeleton to the TOPCOM executable points2nfinetriangs in order to obtain the FRT number. As with the classification process, we used Northeastern's Discovery cluster to run simultaneous calculations for several hundred facets at a time.
When training with this data, and throughout our process of finding a good model, we organized the facets by the first h 1,1 at which they appeared as a dual facet, which we will refer to from here on as the h 1,1 value of the facet. Our reason for this was that there is a rough correlation between this h 1,1 value and the number of triangulations. This is reflected in Table 1: as h 1,1 increased, fewer cases were able to complete as the facets became more complex. For this reason, the training set for each model only contained facets with h 1,1 up to some maximum value (typically 11). This allowed us to see how well the model would  extrapolate to FRT numbers higher than it had seen, something that would prove to be an issue throughout our attempts.

Initial attempts
Our first attempt at obtaining a model was to follow the same pattern used for the 3d polytopes in Section 3 of [33]. As was done in the 3d case, for each polytope we constructed the 4-tuple (n p , n i , n b , n v ), consisting of the number of points, interior points, boundary points, and vertices, respectively. Our metric for determining the model's success was the mean absolute percent error (MAPE), which is defined as where n is the number of data points, and P i and A i are the predicted and actual values for the output, which here is ln(N F RT ) for the i th facet. We trained models using the same four algorithms as discussed in [33]: Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNNR), Classification and Regression Trees (CART), and Naive Bayes (NB). Of these, the CART algorithm gave impressive performance on the training set, as well as on the test data for the h 1,1 values that it had trained on. However, its ability to extrapolate to higher h 1,1 values was poor. As the majority of the facets lie at higher h 1,1 values, the inability to extrapolate well was unacceptable.
As an example, we show here the results from one of these attempts. The model is an ExtraTreesRegressor model from the scikit-learn Python package, using 35 estimators. We trained the model on a data set consisting of 60% of the known values from the range 5 ≤ h 1,1 ≤ 10. On the training set, the model achieved a MAPE of 5.723. It performed similarly on the test set consisting of the other 40% of these values, with a MAPE of 5.821. If we could be confident of a similar accuracy for the rest of the facets, this could be an acceptable model. However, when we used the model to predict on facets from higher h 1,1 values whose FRT numbers are known, we obtained the results shown in Table 2.
As one can see, the MAPE value consistently increases with the h 1,1 value, indicating that the model is less accurate farther away from its training region. Further, the rightmost two columns illustrate the problem with this model: it is under-predicting on facets with a higher number of triangulations. The model never predicts a ln(N F RT ) value greater than 12.467, which is similar to 12.595, the highest value in its training set.
This example is emblematic of the problems we faced with traditional machine learning. Models either fit the training data poorly, or had trouble extrapolating outside of their training regions. The primary difference between the 3d case (where this simple approach worked), and the 4d case, is the inability to generate a representative training set. In the 3d case, the number of FRTs for all facets that appear up through h 1,1 = 22 were triangulated, in a set where the maximum value of h 1,1 is 35. Conversely, in the 4d case, the number of FRTs was only obtained for all facets up through h 1,1 = 11, and for a majority of cases up to h 1,1 = 14, in a data set where the maximum h 1,1 value is 491.

Neural networks
Given the poor performance of traditional supervised machine learning as described above, we shifted our focus to artificial neural networks. In particular, a feed-forward network seemed the most suited to our purposes.
We recall briefly here the definitions of a neuron and a neural network. A neuron is a function f ( k i w i ·x i +b) whose argument x i is the input. The value of f is called the output. The parameters w i are called weights, b is called the bias (or offset), and · represents the appropriate tensor contraction. The function f itself is called the activation function and the choice of f is one of the characteristics that define the neuron.
A neural network is a (finite) directed graph, each node of which is a neuron. For each arrow in the graph, the output of the neuron at the tail is used as the input for the neuron at the head. Conceptually, a neural network is organized into layers, such that there are no connections between nodes in the same layer. The set of nodes whose arguments explicitly involve the original input data is known as the input layer, while the set whose output involves the actual output data is called the output layer. All other layers are referred to as hidden layers.
In the general case, a neural network may contain cycles, allowing for complicated connections between layers. In a feed-forward neural network, there are no cycles, and so information moves only in one direction. Cycles in a neural network allow for the current output to be influenced by the previous output. As the number of triangulations of a given facet is independent of the number for other facets, we can restrict ourselves to the feed-forward case.
We initially tried applying a feed-forward network to our 4-tuples of data, but despite trying a variety of architectures faced similar issues as in the previous subsection, with models consistently underpredicting outside of the training regime. To aid our networks in their prediction attempts, additional input data was generated. In addition to the numbers of points, interior points, boundary points, and vertices, the following quantities were computed for each facet to be used as inputs to the neural network: • Several quantities obtained from a single fine, regular triangulation of the facet: -The total numbers of 1-, 2-, and 3-simplices in the triangulation -The numbers of 1-and 2-simplices in the triangulation, without accounting for redundancy between higher-dimensional simplices -The numbers of 1-and 2-simplices shared between N 2-and 3-simplices, respectively, for N up to 5 As as example we consider the facet F with normal form is given by This facet has 11 lattice points, of which 10 lie on the boundary (leaving 1 in the interior). Of these 10 boundary points, 8 are vertices. The 1-and 2-skeletons of this facet contain 9 and 10 points, respectively. It first appears in a dual polytope with h 1,1 = 8, and has 10 faces and 16 edges. Its flip graph has 13 nodes. Obtaining one FRT for this facet from TOPCOM, we find that it contains 72 1-simplices, 48 2-simplices, and 12 3-simplices. Of the 1-simplices, there were 29 unique ones, while there were 32 unique 2-simplices. There were 13 1-simplices shared between two 2-simplices, 6 shared between three, 3 shared between 4, and 1 shared between 5. There were 16 2-simplices shared between two 3-simplices, with none shared between more than two. Hence the input vector describing this facet was Adding this additional data to our original inputs improved results, but not to the point of satisfaction. As an example of the struggles faced by a traditional neural network, we show here the results from one such model. This neural network has two hidden layers, each with 30 nodes. The first layer has a sigmoid activation function, while the second has a tanh activation function. The final layer is a rectified linear unit (ReLU), meaning that its activation function f act is equal to the positive part of its argument: A ReLU unit was chosen for the final layer as we want ln(N F RT ) to be positive. The model was trained on a data set consisting of an equal number of randomly chosen points from 6 ≤ h 1,1 ≤ 11 to avoid biasing the model towards higher h 1,1 values. The model performed well on the test set of values from the same h 1,1 , with a MAPE of 6.304. However, when we evaluated the model's progress on higher h 1,1 values, our results were as shown in Table 3. This model performs much better than the ExtraTreesRegressor outside of the training region, with a MAPE of 10.915 at h 1,1 = 14. However, the MAPE values are increasing with h 1,1 , and the mean predicted values are falling behind the true mean values, leading one to believe that the model is underpredicting at higher h 1,1 . This belief is confirmed by examining the histograms of the percent error values in the extrapolation range, shown in Figure 5. As h 1,1 increases, the percent error distribution skews to the left, indicating that the model is not keeping up with the increasing values. Hence this model is not suitable for extrapolation to the higher h 1,1 values which make up the bulk of our data set.

The EQL architecture
In this section we will instead utilize an equation learner (EQL) architecture, which was first introduced in [32], and will find that it has significantly better extrapolation ability. Each layer in an EQL network consists of two stages: a standard linear stage followed by a non-linear stage. The non-linear stage essentially replaces the activation function in a standard neural network layer, but differs significantly in that it changes the shape of the linear stage's output tensor. Like any feed-forward neural network layer, an EQL layer accepts some number of input values n i , and outputs some number n o of output values. The linear stage maps the n i -dimensional input x to an intermediate n m -dimensional representation z via an affine transformation. That is, for some weight matrix W ∈ R nm×n i and some bias vector b ∈ R nm . The second stage takes this intermediate representation z to the final n o -dimensional output y via a non-linear transformation. For this transformation, the elements of z are divided into two parts which are acted upon differently. To u of the elements, an activation function is applied. These nodes are called unary units. In principle, a different activation function can be applied to each unary unit. The other v elements are pairwise multiplied, giving v

Model selection
Our final neural network model is simple, with only one hidden EQL layer between the input and output layers. The input layer consists of 22 nodes as previously described. The linear stage of the EQL layer contains 45 nodes. Of these, 15 are unary units, with the other 30 being the binary units. The output from this layer thus consists of 15 + 30 2 = 30 nodes. The unary activation which was found to be the most successful was to square each of the unary units. The output stage consisted of one node, whose value represents the natural logarithm of the number of FRTs. This final output had a ReLU activation function to ensure that the output was nonnegative.
The Adam optimizer was used, with hyperparameters β 1 = 0.9, β 2 = 0.99, and = 1 × 10 −8 . At each layer, an L1 regularizer with λ = 0.001 was used on all weights in order to select for the most important features. In addition, a dropout rate of 0.1 was used for the EQL hidden layer to help select the most optimal neuron configuration.
To select a model, we trained across multiple ranges h 1,1 min ≤ h 1,1 ≤ h 1,1 max and examined the results. As the facets at h 1,1 ≤ 5 have few triangulations and constitute relatively few data points, we chose h 1,1 min ≥ 6 in each case. In order to have an adequate range on which to test our model's extrapolation, we also chose h 1,1 max ≤ 11. The training data sets consisted of an equal number of randomly chosen data points from each h 1,1 value. This was done rather than taking a percentage of the data to avoid biasing the model towards fitting to the higher h 1,1 values; there are, for example, over 37 times as many facets at h 1,1 = 11 as at h 1,1 = 6. We then tested each model on the full set of values across its training range to see how it performed. Our metric of choice was the mean absolute percentage error (MAPE), as this gives a size-independent measurement of how well the model is performing. The results of our training are shown in Table 4.
As the table shows, the model trained on the largest range of h 1,1 values, from 6 to 11, performed the best on both the test set and when extrapolating to higher values. This is the model that we used to generate predictions for the rest of the data set. We note that the extrapolation range contains a large number of points: there are 92162, 108494, and 124700    (Table 5) and the distributions of percent errors (Figure 7), we can see that this is not the case. The mean predicted values stay close to the true means, and the percent errors stay centered around zero as h 1,1 increases.
On the other hand, the extrapolation here is only to three h 1,1 values that are not in the training set, which provides some cause for concern since we will be interested in the polytope with maximal h 1,1 , which has h 1,1 = 491. For this reason we will perform an analogous analysis of 3d polytopes in Section 5, and will demonstrate that in that case the EQL accurately predicts the number of FRTs for h 1,1 values significantly beyond that in the training set; e.g., training up to h 1,1 = 11, we will accurately predict numbers of FRTs for all available facets, which go up to h 1,1 = 25. This suggests that the EQL may also make accurate predictions in the 4d polytope case at values of h 1,1 significantly beyond the training set. For completeness, we present in Figure 8 the formula learned by our EQL model. The variables x i are in the same order as in the example in Section 4.3. This formula does not seem to lend itself to any intuitive interpretation. Its interpretability is also hindered by the fact that its variables will generally be of different scales. To attempt to adjust for this, we normalized each variable so that it has an expectation value of 1. This was done using the mean value of each variable in the training data. We then calculated log 10 (|c|) for each coefficient c in the normalized expression to determine their relative importance on the value. The results of this are shown in Figure 9. We can see that there are two brightest squares, which correspond to the interactions of the number of unique 1-simplices in the seed triangulation with both the number of 3-simplices, and the number of 1-simplices without accounting for redundancy. However, while these terms have the highest values, there are many others at a similar order of magnitude. Thus, while we can say that some terms are more important than others, the complicated form of the function, along with a lack of knowledge of how these variables will change at larger h 1,1 , makes any stronger statements difficult.

Estimate for the total upper bound
Having obtained a satisfactory model, we generated the necessary input data for all of the facets. We then fed the data into the neural network to obtain the estimated number of FRTs for each facet.
Naively, for a given reflexive polytope, its FRSTs would be given by the set of all possible combinations of its facets' FRTs. But in reality, the number of FRSTs is reduced by two considerations. The first is that given two 3d facets F 1 and F 2 , the triangulation of each induces a triangulation on the intersection F 1 F 2 , and these induced triangulations may not overlap. The second consideration is that even if the induced triangulations do overlap, the aggregated triangulation of the reflexive polytope may fail to be regular even though the individual facet FRTs are regular. Thus, using the estimated number of FRTs for each facet, we are only able to estimate an upper bound for each 4d reflexive polytope. So, for a given reflexive polytope ∆ with facets F i , we know that where N F RST and N F RT are the number of FRSTs and FRTs of the 2-skeleton of a given reflexive polytope and facet, respectively. Via this inequality, the upper bound is given by the product of the facet FRTs. One can then calculate the estimate for each 4d polytope and sum the results. Determining the degree to which the product of FRTs overestimates the number of FRSTs is left to a future work. In practice, since our neural network predicted the natural logarithm of the number of FRSTs, we calculated the estimate for each polytope as 20046x 2 x 9 + 0.2274x 3 x 9 − 0.01136x 4 x 9 + 0.25023x 5 x 9 − 0.13467x 6 x 9 + 0.23766x 7 x 9 + 0.08246x 8 x 9 − 0.11071x 2 9 + 0.02683x 0 x 10 − 0.00254x 1 x 10 + 0.02553x 2 x 10 + 0.01278x 3 x 10 + 0.00554x 4 x 10 − 0.04584x 5 x 10 + 0.03714x 6 x 10 + 0.01352x 7 x 10 + 0.00105x 8 x 10 + 0.20665x 9 x 10 − 0.00099x 2 10 − 0.34428x 0 x 11 + 0.05335x 1 x 11 − 0.00715x 2 x 11 + 0.33024x 3 x 11 − 0.01736x 4 x 11 + 0.16713x 5 x 11 − 0.08848x 6 x 11 + 0.08456x 7 x 11 + 0.03144x 8 x 11 − 0.11092x 9 x 11 − 0.00151x 10 x 11 − 0.07556x 2 11 + 0.02349x 0 x 12 + 0.0031x 1 x 12 − 0.01155x 2 x 12 − 0.03125x 3 x 12 − 0.00223x 4 x 12 + 0.01197x 5 x 12 − 0.00373x 6 x 12 − 0.0372x 7 x 12 + 0.02864x 8 x 12 − 0.07238x 9 x 12 + 0.01227x 10 x 12 + 0.0125x 11    In these calculations, we used the actual value for ln(N F RT (F i )) whenever it was known, and the neural network prediction otherwise. As noted in Section 3.3, the triangulated facets account for over 88% of all facets. Performing this calculation for each polytope, it was found that the single polytope whose dual has h 1,1 = 491 (the greatest h 1,1 value) dominated the count. Our estimation method predicts that it has 1.465×10 10,505 FRSTs. This is more than any other polytope by over 1300 orders of magnitude, and so this value effectively serves as the total triangulation number for the entire set. Considering that this polytope has 680 integral points, 40 more than any other, it is unsurprising that our method has identified this polytope as possessing the most triangulations.

The h 1,1 = 491 polytope
The polytope whose FRST count dominates the database is the polytope dual to the single h 1,1 = 491 polytope, which we will call ∆ • 491 . This polytope has 680 integral points and five facets, of which only four are unique. The polytope and its four facets The facet F 1 first appears as a dual facet at h 1,1 = 23, and appears twice in ∆ • 491 . The facets F 2 , F 3 and F 4 each appear once in ∆ • 491 and nowhere else in the database.

20
Our model predicts that F 1 will have e 29.32 = 5.41 × 10 12 FRTs. As it appears twice, it will contribute e 2×29.32 = 2.93 × 10 25 to the triangulation number for this polytope, which is a subdominant contribution. F 2 has a larger contribution, as our model predicts that it will have e 2391.5 = 4 × 10 1038 FRTs. However, the triangulation number for the polytope is dominated by F 3 and F 4 , which our model predicts to have e 10,753.0 = 1 × 10 4670 and e 10,985.9 = 1.25 × 10 4771 FRTs, respectively. Combining these, we get the estimate of the number of FRSTs for this polytope:    Propagating these errors, we find that for ∆ • 491 we have log 10 (N F RT ) = 10, 505.2 ± 292.6. This gives a range for our estimate of 10 10,505.2±292.6 = [10 10,212.6 , 10 10,797.8 ].

Comparison to 3d Polytopes
The biggest point of uncertainty involving our analysis of the 4d reflexive polytopes is the inability to test our model on facets that first appear beyond h 1,1 = 15. As all evidence suggests that these facets -and in particular those that first appear at high h 1,1 valueswill be responsible for the dominant contribution to the total number of FRSTs, it would be preferable to have some confidence in our model's accuracy in this region.
Though obtaining the true number of FRTs for these largest facets is currently infeasible, we hope to gain confidence in our model by showing that we can achieve the same goal for the 2d facets of the 3d reflexive polytopes. As these facets are smaller, the number of FRTs for the majority of the facets can be triangulated. Therefore a model's extrapolation ability can be more thoroughly tested. Specifically, we will demonstrate that EQLs trained to predict ln(N F RT ) predictions on data with relatively low h 1,1 nevertheless accurately extrapolate to much higher h 1,1 values, lending some credence to our above assumption the 4d case.

Classification of the 2d facets
As in Section 2, we began by classifying the 2d facets of the 3d reflexive polytopes to avoid redundancy in our calculations. The same classification method as in Section 2 -for each facet, we obtained a 3d polytope by attaching the origin, calculated the normal form, and removed the origin afterwards.
In Section 2 we organized the facets by the first h 1,1 value at which they appeared. As h 1,1 , as obtained by Batyrev's formula, is equal to 20 for every 3d reflexive polytope, we instead organized the facets by the number of integral points |∆| in the smallest polytope ∆ in which they appeared. We note that this quantity is equal to h 1,1 (B) + 4, where B is the smooth weak Fano toric threefold associated to ∆. For the remainder of this section, by h 1,1 we will mean h 1,1 (B) = |∆| − 4.
As in Section 2, the calculations were performed using our C++ lattice polytope implementation. Due to the small size of the 3d reflexive dataset, the classification was completed in just over 20 seconds.
Performing our classification, we found that the 4, 319 3d reflexive polytope contain 344 unique 2d facets. The numbers of facets and polytopes that first appear at each h 1,1 value are shown in Figure 10.

Training data
Having classified the 344 unique facets, we set out to triangulate as many as possible in order to test our model. We again used TOPCOM to obtain the FRT count for each facet. We were able to successfully obtain the number of FRTs for 322 of the 344 facets, with all facets that appear up to h 1,1 = 25 being completed. Additionally, for 6 of the remaining 12 facets, as well as all of the other 322, we were able to obtain the number of fine triangulations (FTs). For the 322 facets for which the number of FRTs was found, there is very close agreement between the numbers of FRTs and FTs. More specifically, the number of FTs was at most 1.026 times the number of FRTs. Thus, we used the number of FTs as an approximation for the number of FRTs for the six additional facets. This gave us values for all facets that appear up through h 1,1 = 30. 22

Machine learning and extrapolation
Having triangulated as many facets as possible, we trained models with an EQL hidden layer. Our input data was simpler than for the 3d facets, and consisted of As before, we had the neural network train on ln(N F RT ). As the goal of this exercise was to mimic the 4d case, where we only have data for the smallest facets, we restricted our training data to facets that first appear at 4 ≤ h 1,1 ≤ 11. There are 243 facets that first appear at h 1,1 ≥ 12, meaning that our extrapolation region contains 70.6% of the facets.
Our trained model contains a single EQL layer, which used a combination of linear and quadratic unary activation functions. The performance of the model is shown in Table 6.  Table 6: Performance of our model across h 1,1 of the 3d reflexive polytopes.
Recall that h 1,1 = 25 is the highest value for which all of the FRT values are known; for 26 ≤ h 1,1 ≤ 30 we approximate the number of FRT by the number of FTs for some facets. As we can see, the model performs well through the extrapolation region of 17 ≤ h 1,1 ≤ 25. With our trained model, we predicted the number of FRTs for all 344 facets, and then estimated the number of FRSTs for any given polytope as the product of the predictions for each facet. We also computed these estimates using the real FRT numbers for polytopes where all facets have been triangulated. Taking the mean value at each value of h 1,1 , we obtain the graph shown in Figure 11. From this graph we see that the estimates from the model predictions are in good agreement with the estimate made from known values well outside of the training region of 4 ≤ h 1,1 ≤ 11. In particular, we note that at the highest h 1,1 value available to us, h 1,1 = 30, our model is in nearly perfect agreement with the true result.

Conclusion
In this paper we have provided the first concrete estimate of the number of fine, regular, star triangulations of 4d reflexive polytopes, n FRST 10 10,505.2±292.6 .
This provides an upper bound on the number of topologically distinct Calabi-Yau threefold hypersurfaces in toric varieties. We attempted a variety of supervised learning techniques for predicting n FRST and found that a neural network with the equation learning (EQL) architecture performed best. The estimation was computed by taking products of numbers of FRTs of facets, where the EQL was trained on 4d polytopes up to h 1,1 = 11 and made accurate predictions up to h 1,1 = 14, which was the highest h 1,1 at which we were able to compute FRTs of facets for the sake of validating the trained EQL. While encouraging that the EQL extrapolated to higher h 1,1 , it is so much smaller than the maximum h 1,1 = 491 that it is hard to trust such a high extrapolation without further evidence. For that reason, we performed an analogous analysis in the case of 3d polytopes and found that the EQL was able to extrapolate accurately to h 1,1 = 30, where the maximum possible value is h 1,1 = 35, despite the fact that it was only trained up through h 1,1 = 11. This provides some evidence that the corresponding extrapolation in the 4d case may be trustworthy, despite needing to extrapolate far beyond the training region. Indeed, such extrapolations were part of the motivation for the EQL architecture [32] in the first place.
After training successful models, we extracted the equations that they learned. The variables were rescaled to have an expectation value of 1 in an attempt to interpret these equations. We found that some variables and terms were more important than others, but the majority of variables made significant contributions, making interpretation of the equation difficult. This may be due to the functional form utilized in our EQLs.
In the process of making the prediction, we have demonstrated the overall utility of deep neural networks in this context, in particular their ability to extrapolate to regions of higher topological complexity. This provides motivation for further studies of triangulations and Calabi-Yau manifolds using techniques from data science.
Though this result is a modest prediction of a single number, it is a necessary step in understanding the ensemble of Calabi-Yau threefold hypersurfaces as a whole. We wish to turn to refined structures in the ensemble in the future, and in particular their implications for cosmology. For instance, studying axion-like particles (ALPs) in this context is particularly well-motivated since the ensemble is strongly dominated by the polytope that gives rise to the large number of ALPs.