Patterns in Calabi-Yau Distributions

We explore the distribution of topological numbers in Calabi-Yau manifolds, using the Kreuzer-Skarke dataset of hypersurfaces in toric varieties as a testing ground. While the Hodge numbers are well-known to exhibit mirror symmetry, patterns in frequencies of combination thereof exhibit striking new patterns. We find pseudo-Voigt and Planckian distributions with high confidence and exact fit for many substructures. The patterns indicate typicality within the landscape of Calabi-Yau manifolds of various dimension.


Introduction
A Calabi-Yau n-fold is a Kähler manifold of n complex dimensions with a trivial canonical bundle. In superstring theory, it serves as a compactification manifold wherein a ten dimensional theory at high energies reduces to an effective theory in four spacetime dimensions. In particular, global SU (n) holonomy ensures that 2 1−n of the original supersymmetry is preserved. Thus, confronted by the vacuum selection problem, Calabi-Yau compactifications present an avenue for Standard Model building, especially in the context of the heterotic string [1][2][3][4]. Indeed, the basis of the landscape is to consider flux compactifications on these geometries [5,6].
Let us briefly recollect what all this means. The (possibly singular) toric variety A n+1 is specified by an integer polytope in R n+1 , which is a collection of vertices (dimension 0) each of which is an (n + 1)-vector with integer entries and such that each pair of neighboring vertices defines an edge (dimension 1), each pair of edges defines a face (dimension 2), etc., all the way up to a facet (dimension n). Alternatively, can be defined by a set of integer linear inequalities, each of which slices a facet. The polytope is then the convex body in R n+1 enclosed by these facets. We will always include the origin as being contained in . Using the usual dot product , inherited from R n+1 , the dual polytope is defined by The polytope is reflexive if all the vertices of • are integer vectors. In this case, we can define the Calabi-Yau hypersurface X n explicitly as the polynomial equation x m,v r +1 where v r =1,...,k are the vertices of • with k being the number of vertices of • (or equivalently the number of facets of ), x r are the coordinates of A n+1 , and c m are numerical coefficients parameterizing the complex structure of X n . Indeed, the reflexivity of ensures that the exponents are integral thereby making the hypersurface polynomial as required.
The classification of these Calabi-Yau manifolds thus amounts to that of reflexive polytopes in various dimensions, and the intense computer work of Kreuzer and Skarke was to combinatorially find such polytopes. For n = 1, there are 16 such polytopes in R 2 , and we have Calabi-Yau onefolds, or elliptic curves. For n = 2, there are 4319 such polytopes in R 3 , and we have Calabi-Yau twofolds, or K3 surfaces. For n = 3, there are 473, 800, 776 such polytopes (which was a formidable computer task!), and we have the Calabi-Yau threefolds. This sequence {1, 16, 4319, 473800776, . . .} (1.3) of remarkable growth rate can be found in the Online Encyclopedia of Integer Sequences [38]. The numbers in higher dimension are still not known, nor has there been an asymptotic analysis of their growth. It should be emphasized that generically a reflexive polytope corresponds to a singular toric variety even though the hypersurface is chosen (by generic coefficients c m ) to miss the singularities and hence ensuring the smoothness of the Calabi-Yau X n . For example, of the some half-billion reflexive polytopes in R 4 , only 136 A 4 are in fact smooth [39]. As we desingularize the toric variety by various star-triangulations of , we are led to potentially inequivalent Calabi-Yau manifolds. In principle, the same Calabi-Yau geometry can arise from different reflexive polytopes or triangulations of a given reflexive polytope. Whereas K3 is essentially unique, we do not know how many Calabi-Yau threefolds there are. A systematic study to classify the desingularizations, to compute the necessary topological data, and to build an interactive online database [19] is under way. The moral is that there are almost certainly far more than half a billion Calabi-Yau threefolds! Luckily, the Hodge numbers depend only on the polytope and not on the choice of desingularization. (The intersection numbers, however, do depend on the choice.) For Calabi-Yau threefolds, the pair of Hodge numbers (h 1,1 , h 1,2 ) is a famous quantity. Indeed, the plot in Part (a) of Fig. 1 has become iconic. Here, the sum h 1,1 + h 1,2 is plotted against the Euler number χ = 2(h 1,1 − h 1,2 ), and the left-right symmetry supplies "experimental evidence" for mirror symmetry. There is enormous redundancy in this data: of the some half a billion reflexive polytopes, there are only 30, 108 distinct pairs of Hodge numbers and the pair (27,27) dominates the multiplicity, totaling almost one million. In Part (b) of Fig. 1 we have attempted to visualize the distribution of the multiplicity by having a color density plot of the logarithm of the number over each Hodge pair. Understanding this multiplicity forms the inspiration for the present work. While there have been analyses on the shape of the funnel-like plot [28,33,35], there has not been much work on its density, i.e., the distribution of the multiplicity of Hodge data for the Calabi-Yau manifolds of various dimension. Of course, fundamentally, this is entirely due to the combinatorics of reflexive polytopes and might in principle be analytically determined. However, given the complexity of the problem it is expedient to analyze the available data which have been compiled over the years, observe intriguing patterns, and draw statistical inferences before turning to analytic treatments. This is what we achieve in this work.
The organization of the paper is as follows. We perform a detailed analysis on the structure and behavior of the threefold data in Sect. 2. This is motivated by looking for an exact function describing the relationship of the distribution of the Hodge pairs (h 1,1 , h 1,2 ) with frequency.
In Sect. 2.1, we study the distribution of (h 1,1 −h 1,2 , f ). We find that this distribution is composed of a family of curves, for which each curve can be described using a modified pseudo-Voigt model. Although an approximation, the model is able to describe the general trend of the data, as well as some additional fine structure within each individual data point. Performing an analysis on the parameter relationships shows that three out of the five parameters can be expressed as a single variable, but we conclude that additional modifications need to be introduced in the model to overcome certain shortfalls.
Subsequently, Sect. 2.2 performs an analysis on the structure of (h 1,1 + h 1,2 , f ). Similarly, this distribution is composed of a family of curves for which each curve can be described using a Planckian profile. Combining the regression analysis for each curve within the distribution, we construct a single function able to approximately model the entire distribution of (h 1,1 + h 1,2 , f ) with only two variables. Section 2.3 uses the model developed in Sect. 2.1 to describe the distribution of the Euler number χ . Section 2.4 is dedicated to the description of model validation in our context, as the usual statistical tests are inadequate. Section 2.5 discusses possible implications to physics by referencing recent advancements in F theory and further investigations of structures within the Kreuzer-Skarke database. In Sects. 3 and 4, we perform primary analyses of Calabi-Yau twofolds (Picard number and multiplicity) and Calabi-Yau fourfolds. Due to the lack of a complete data set, we are unable to provide a thorough analysis of the fourfolds as with threefolds. Finally, the Appendix presents many supplementary plots and figures for the various sections. We conclude with a summary and outlook in Sect. 5.

Calabi-Yau Threefolds
As advertised in the Introduction, we will begin with the analysis of threefolds and identify patterns within this rich distribution of Hodge numbers and their frequency as plotted in Fig. 1. It turns out striking patterns do exist, pointing to a definite structure within the threefold data, which consists of the triple (h 1,1 , h 1,2 , f ), where f is the number of reflexive polytopes in the Kreuzer-Skarke database with the given Hodge pair. Here, h 1,1 and h 1,2 respectively count the Kähler and complex structure moduli of the Calabi-Yau obtained from the reflexive polytope. More precisely [8], we have that (2.1) In the above, is the defining polytope for the Calabi-Yau threefold X and * is its dual. Moreover, θ and θ * are the faces of specified codimension of these polytopes respectively; ( ) is the number of integer points of the polytope while * ( ) is the number of interior integer points. Indeed, our analysis of the distribution of Hodge numbers ultimately reduces to counting these integer points.
To facilitate the analysis, we plot (h 1,1 − h 1,2 , f ) and (h 1,1 + h 1,2 , f ) as shown in (a) and (b) of Fig. 2, respectively. Recall that the Euler number χ = 2(h 1,1 − h 1,2 ). We will use the difference h 1,1 − h 1,2 rather than the Euler number. In the simplest heterotic constructions, |h 1,1 − h 1,2 | corresponds to the index of the Dirac operator and gives the number of generations of particles in the low-energy spectrum [1].
By inspection, these plots already exhibit two patterns. Firstly, in both the h 1,1 − h 1,2 and h 1,1 + h 1,2 plots, there appears to be an inner distribution contained within the outer distribution. We find that these inner and outer distributions are related to the parity of h 1,1 ± h 1,2 . Figure 3 elucidates this point by having the odd and even values in different colors.
Though this parity structure may be a result of the Kreuzer-Skarke algorithm, its consistent appearance means we need to treat the distributions of even and odd distinctly for now.
The second evident structure which can been seen by inspection, is that the outer edge of the distribution of h 1,1 − h 1,2 (Fig. 3a) appears to follow a normal like curve, whereas the edge of h 1,1 + h 1,2 (Fig. 3b) follows a Planck like curve. It is through the analysis of these distributions that we deduce their characteristic behavior and underlying structure. In the main body of this paper, we outline the results and analysis of only the even distributions for h 1,1 − h 1,2 and h 1,1 + h 1,2 , except where it is important to present both. It turns out that any structure and patterns which are found in the even distributions for h 1,1 − h 1,2 and h 1,1 + h 1,2 are found identically in the odd distribution (see "Appendix" for various plots).

2.1.
Analysis of h 1,1 − h 1,2 . Before we can present the results, it is important to explain some notation. When working with the distribution of h 1,1 − h 1,2 , we find that it is composed of many curves, whose individual structure is the same as the "edge" or boundary of the distribution mentioned earlier. As a consequence of this, we refer to h 1,1 − h 1,2 as being composed of a "family of curves." Each curve is then classified by its r -value, where r = h 1,1 + h 1,2 . It is important to be clear that in this analysis, although h 1,1 − h 1,2 is just half the Euler number, we are not summing over all the possible values of h 1,1 + h 1,2 . We are keeping these values distinct: hence, the r -curves we obtain. Later on in Sect. 2.3 we sum over all possible values of h 1,1 + h 1,2 to get two plots representing the full Euler number distribution.
Consider the example in Fig. 4a. By ordering the data in terms of h 1,1 + h 1,2 , one can classify data sets within h 1,1 − h 1,2 by an r -value. Holding r fixed, we can plot the frequency f versus the difference h 1,1 − h 1,2 . We call each value of r a curve, which we can overlay on the same plot. In this example, we tabulate data for curves identified by r = 28 and r = 29. As a further illustration, we show explicitly the curves of the even distribution within h 1,1 − h 1,2 for r = 42, 54, 66 in Fig. 4b. By mirror symmetry, the curve is symmetric about the vertical axis, where h 1,1 − h 1,2 = 0. We can now perform a regression analysis for each individual curve, in the quest of obtaining a function describing the distribution. In the analysis, we indeed find an approximate function predicting the fine structure of the data. We operate with one caveat: we ignore data points which have a frequency lower than 2000. At large r , the data, whose frequency is below 2000, begins to deviate from our model. The reason for such deviations, comes down to the fact that our model, though remarkably accurate, is still an approximation. We suspect that with further modifications, such deviations can be accounted for and that consequently, it may be possible to find an exact function to map the frequency distribution of h 1,1 − h 1,2 . Such statements also apply to the distribution of h 1,1 + h 1,2 .

A pseudo-Voigt fit
Due to the normally-distributed, peak-like nature of these curves, we performed a regression analysis using the following models: Gaussian; Cauchy (Lorenztian); Pearson7; Breit-Wigner; Voigt; and pseudo-Voigt. In the "Appendix A.1.2", we perform a side by side comparison. It turns out that both the Voigt model (25e) as well as the pseudo-Voigt model (25f) give excellent fits.
We focus on the pseudo-Voigt model as it gives the best fits. This is a linear combination of a Gaussian and Lorentzian (Cauchy) distribution: with amplitude ( A), center (μ), Gaussian width (σ ), and fractional parameter alpha (α). However, we can modify the above distribution slightly so that the amplitude A of the distribution has an oscillating component where A 0 is the original amplitude of a particular curve described by the pseudo-Voigt distribution, a is the amplitude of oscillations, and b represents the period. By doing a regression analysis one curve at a time using this modified pseudo-Voigt model, we are almost able to replicate not just the basic structure of each curve, but even the individual behavior of each data point in the entire distribution. (See "Appendix A.1.3" for a comparative plot of the all the regression curves using the standard, unmodified, pseudo-Voigt model.) We plot the frequency against h 1,1 − h 1,2 for various values of r (odd and even). Figures 5 and 6 are striking in their accuracy.
As these figures illustrate, each curve follows a pseudo-Voigt profile, however the individual data points seem to "jump" up and down, as if oscillating. It is this behavior of the data points which can be accounted for by the modified pseudo-Voigt model. To do the regression analysis, we used Python lmfit with a custom model which is just the modified pseudo-Voigt model. The parameters that were fitted are (A 0 , a, b, σ, α). Due to mirror symmetry, μ = 0. In "Appendix A.1.4", one can find a table with the value of every parameter for every curve as well as their reduced χ 2 values.
A few comments explicate the regression lines and the behavior of the distributions.
1. When we refer to the model as being an "excellent fit," it is principally a statement made by inspection of the curves and the data. If one inspects the reduced χ 2 values  Fig. 29), the numbers are large, which statistically does not refer to a good fit. This is misleading however. Firstly, we need to consider that the number of parameters used in the model is five. This allows for a larger χ 2 R value. Secondly, the distribution is based on a discrete set of data. When doing a regression analysis using the modified pseudo-Voigt model, one obtains an equation which describes a continuous curve. Lastly, the frequency values span over several orders of magnitude. The tiniest deviation from a parametric model-in this case, the modified pseudo-Voigt profile-will be detected in cases where there is such a huge sample size. Typically the predicted model gives data points which are in the range of 0.02-3% accuracy from the actual data point. The tail behavior of the model is less accurate however, here the predicted values can be off from between 60 and 80%. For cases with a very poor fit, the last data point (large value of h 1,1 − h 1,2 ) can have an error of up to 300%-this is another example of the model being less accurate at lower frequency. When one is dealing with such Fig. 7. These two plots serve two purposes. The first is to show how the modeled data should really look by using data points (red points) instead of the (perhaps misleading) lines (refer to Comment 1 below). The second purpose is to illustrate that as r becomes large (left plot has r = 99, right plot has r = 118), the actual data points deviate more and more from the modeled data, implying that there is a missing function in the modified pseudo-Voigt model which would allow one to describe the data at much lower frequencies (color figure online) sample sizes, even a 1% error can give a difference of up to a couple of thousand. This difference summed over all the data points for a particular curve result in a large χ 2 R value. Due to the discussion in Sect. 2.4 we from now on ignore the χ 2 R as a test for model validation. Instead we opt for probability plots-which can also be seen in Sect. 2.4. 2. One obtains a continuous model to describe the discrete data, in reality, we should not be plotting fitted curves, but rather fitted data points-as can be seen in Fig. 7.
It is just illustratively more clear to display the curves. One could in principal work out what the discrete approximation is to our continuous model. 3. Although the modified pseudo-Voigt distribution does a good job to model the behavior of the data, one still needs to address the problems experienced with our model at low frequency. A problem which is hidden, by virtue of our cut-off frequency, is that the tail of our models predicts negative values, Fig. 8. There is a possibility that by having different variances σ g , σ c for the mixing of the two distributions (Gaussian, Cauchy), one could adjust the tail behavior. Introducing more and more parameters however does not always resolve the problem, as it is possible to over-fit the data. Yes, the model may be more accurate, but one loses physical significance. In a situation like ours, where one does not have any physical backing for choice in models, this line between fitting and over fitting is not so clear. 4. The odd distribution's behavior is more regular. In comparison to the even distribution, as one increases in r value, the behavior of the individual data points remain somewhat constant relative to the fitted curve. The even distribution becomes more and more irregular as one increases the r value. This suggests that there is an added parameter which seems as if it should be function of r . By regular and irregular we are referring to how well the data point is described by the model. 5. Both distributions become very irregular as the value of r becomes large (r > 100 and r > 120 for odd and even distributions respectively-see Fig. 7). A large r value refers to curves which have a relatively low frequency. Again this suggests that the pseudo-Voigt model needs to some how have some function of r which "distorts" the behavior of the curves as r increases (by the looks of how the real data deviates from the modeled one, it seems that the missing functions is also oscillating in nature). We obtain a good fit to the data. The right plot has a cutoff frequency of 460, which is equivalent to a percentage cut off of 9.68% (calculated relative to the peak frequency for that r -curve). This curve is exact There exist, however, certain cases where the model is exact. In other words predicted values are the same as the actual values. This happens when one adjusts the frequency cutoff for each r curve individually. That is to say, we only examine data points with at least f 0 reflexive polytopes with a given value of r and h 1,1 − h 1,2 . If there are fewer than f 0 cases, the data is ignored.
This trend persists for all values of r , however what becomes apparent is that it's not the percentage cutoff frequency that determines whether or not one gets an exact fit, but rather, the number of data points that remains after the percentage cut of has been effected. Figure 30 gives a table of how many data points remain after an appropriate cut off percentage has been chosen to achieve a perfect fit. From this table we see that for even curves, one almost always requires 7 data points to achieve a perfect fit; for the odd curves, the number of data points is 10. The reason for this constant number throughout all the curves is that the centers of all the distributions for the various curves are all similar. As soon as one includes a larger number of data points we cannot achieve exact fits, and the model becomes approximate. At very low r values the number of data points remaining after cutoff are not too different to the total number of points. As r increase, the total number of points increase-the fact that we can achieve exact fits becomes less meaningful. The other models-even when including an oscillatory component were unable to give exact fits.
The model is thus much more accurate at low r values, and as r increases the actual data deviates more and more from the fit. This reinforces the statements from the comments that the pseudo-Voigt model can be modified further with some function g(r, x) such that it will greatly improve the accuracy of the fit, and perhaps even become exact.
After the above analysis, we return to our goal of finding a single function describing the distributions. It is clear from the above that the function has to be a function of at least two variable, f = f (x, r ). We thus continue the analysis by plotting all the parameters versus r , in search for any relationships. We find that three parameters σ ,b and α can be expressed in terms of r , the other parameters, while they show trends, do not give a precise relationship with r . For the even distribution of h 1,1 − h 1,2 , the r values range from 36 to 110, whereas for the odd distribution (see Fig. 24a, b) the r values range from 37 to 99. By looking at Fig. 10a, it turns out that: (2.4) Our model of h 1,1 − h 1,2 now looks as follows: where A 0 (r ) and a(r ) are two unknown functions yet to be determined (see Fig. 10b for relationship plots). For replicating the plots as precisely as possible, one would need to keep the parameters, as they are, up to their 17 decimal values, without excluding terms as we have done. If one wants to reproduce the data from the model, one has to use the exact expressions. Making an approximation from an already approximate model leads to large errors. The first plot in Fig. 10a in particular evinces a sinusoidal fluctuation about the mean. This again indicates the possibility of refining the plots by adding an extra function.

Analysis of h
We begin by classifying the curves within the h 1,1 + h 1,2 distribution (Fig. 2) in an analogous way to how it was explained before. This time, we order the data by h 1,1 − h 1,2 such that a single curve within h 1,1 + h 1,2 can be identified by its q-value, where q = h 1,1 − h 1,2 . Due to mirror symmetry, the curve for q = −a is the same curve as q = a, thus within our two-dimensional plots will only have q > 0. In continuation to the analysis on h 1,1 − h 1,2 , we use a cutoff frequency of 2000 and only present results from the even distribution within h 1,1 +h 1,2 , unless stated otherwise. As an example, illustrating the classification of curves within h 1,1 + h 1,2 , consider the curves q = 0, 18, 30 in Fig. 11.

A Planckian fit
Each curve within the h 1,1 + h 1,2 distribution behaves the same. Just like in the h 1,1 − h 1,2 distribution, we do a regression analysis for each curve within a The width parameter σ has a linear relationship with r such that σ (r ) = 0.5097r − 12.7142. The amplitude period parameter, b, also has a linear relationship, however, since r is at most order 3 in magnitude, we can regard it as a constant such that b(r ) = 0.6629 ∼ 2/3. The same goes for the fraction parameter,α; we can regard it as a constant such that α(r ) = −0.0345. For odd parameter fit statistics see Fig. 24a; b plots of A 0 versus r (left) and a versus r (right). Both exhibit a similar pattern, however it is difficult to discern any nice relationships. For odd parameter plots see Fig. 24b the distribution independently, in the quest to describe the entire h 1,1 + h 1,2 with a single function. The model we chose to describe h 1,1 + h 1,2 is the simplest possible Planckian model The parameter names in the fit results are the amplitude A, the power n, and some real constant b. The shift in x-axis is so that the distribution begins at 0 as the smallest h 1,1 + h 1,2 above the cutoff is 22. The choice of a Planckian model in the above form is greatly motivated by the blackbody distribution f (T, λ). The q curves within h 1,1 + h 1,2 appear to behave in a manner analogous to the curves of constant T within the blackbody distribution. This is an initial trial. Later, we will discover additional structure in the distribution by trying to mimic the blackbody distribution exactly. It turns out that the general behavior of the distribution is modeled very well, cf. Fig. 12a. Consider the maximum of each of the curves. As indicated in Fig. 12a, we can fit the maxima to a curve as indicated using the data plotted for the given values of q. From the above analysis, the h 1,1 + h 1,2 distribution behaves analogously to a blackbody spectrum-except for one small subtlety. It is in this subtlety that the added structure within h 1,1 + h 1,2 is observed.
Just as was seen in Fig. 2, h 1,1 + h 1,2 appears to split up into two smaller distributions based on the parity of h 1,1 + h 1,2 . One can then further break up both the even and odd distributions into three further sets. The manner we observed this added fine structure is again motivated by a blackbody spectrum. In a true blackbody distribution, the curves of constant T never overlap. However, if you consider the lines of best fit only, when looking at our distribution one sees an overlap of certain curves. For example, observe the following plot of curves which clearly cross in Fig. 12b.
It turns out that this overlapping occurs consistently to the point where one can classify the curves (defined by their q value) into residue classes q n distinguished by n mod 6. On the left hand side of the h 1,1 + h 1,2 axis, the curves are ordered with red (residue class q 2 ) above yellow (residue class q 4 ) above blue (residue class q 0 ), whereas on the right hand side of the axis, the order is reversed. Similar behavior is observed in the odd distribution of h 1,1 + h 1,2 with the curves in the residue classes q 1 , q 3 , and q 5 (see Fig. 32b).
The clusters of curves constitute an entire set of mod 6 residue classes. These classes now define a set of curves which belong to very "nice" distributions that behave exactly like a blackbody distribution. 1 Compare, for example, a plot of the all the curves for even distribution of h 1,1 + h 1,2 , separated into their residue classes, Fig. 13  In the attempt to describe the data analogously to a blackbody distribution (a), we discover some subtle structure (b). a Lines of best fit from a regression analysis for a few select curves. The black data points represent the maximum frequency for that particular q-curve. The Black line is a line of best fit to describe the points of maximum frequency-this is analogous to a blackbody spectrum. See Fig. 32a for the curves within the odd distribution. b The curves segregate into three classes determined by the value of the even integer modulo 6. A similar pattern occurs in the odd distribution; see Fig. 32b As a first approximation we have successfully modeled the general trend of the data. There is, however, a fine structure to the individual data points that we would like to model. Introducing an oscillating term in the amplitude, as seen in the analysis of h 1,1 − h 1,2 , unfortunately did not seem to improve the fits.
Again, it appears that the least number of variables our functions can have is two, f = f (x, q). This function will be slightly different in the values of coefficients, depending on which residue class one is modeling.
Just as for h 1,1 − h 1,2 , we wish to express the parameters for the h 1,1 + h 1,2 model (2.6) in terms of q. We therefore write A = A(q), b = b(q), n = n(q) and seek to find expressions for the coefficients. While the x-axis of h 1,1 + h 1,2 has only positive q values-due to the fact the data points will overlap-when plotting them against the parameter values, we also have to consider the negative values of q. We present the various relationships (see Fig. 34 for the plots for the odd distribution of h 1,1 + h 1,2 analogous to Fig. 14).
Each distribution has an equation with different parameter values. However, the fact that we can express all the parameters in terms of q means we are able to get a generalized formula to describe the entire h 1,1 + h 1,2 distribution-as long as the frequency is above 2000. For succinctness we use the following notation for the coefficients where the subscript k = 0, 1, 2, 3, 4, 5 refers to residue class q k , and i = 0, 1, 2, 3, 4 refers to the coefficient of the i th power of q. Thus, we have: where the matrix of coefficient values for A k,i , n k,i and b k,i can be found in "Appendix A.2.2". 2 Our function (2.6) now is able to approximately describe the entire h 1,1 + h 1,2 distribution: Of course there are certain constraints on the values of q. For a given k, q has to be an integer which falls within the residue class q k . For even values of k, x = 2m, and for odd k, x = 2m + 1. We have m > 12.
A few comments about the analysis on the h 1,1 + h 1,2 distribution are in order.
1. The Planckian model used in (2.6) could be modified in some manner such that there is some oscillating behavior in the amplitude. Any kind of oscillatory term we introduce, only has a mild effect on the model's behavior. As the q values exceed 100, the model is not able to describe the data very well. 2. Assuming one adds an oscillatory component to the model, the module used in python to do the regression analysis called lmfit is sensitive to the initial conditions set by the user. Since the model is a custom model, it is difficult to find the correct initial conditions such that the best fit line oscillates close to every point (as with h 1,1 −h 1,2 ). 3. It is possible that the model used does not have the features required to describe the oscillatory "up and down" behavior of the data points. The Planckian model was chosen in that the h 1,1 + h 1,2 distribution resembled a blackbody distribution. 4. In choosing a polynomial model for Fig. 14a-c, we picked the lowest order polynomial that gave the best fit. Choosing the order to be four for all the plots appeared to be convenient. However, it is apparent that the parameter relationship plot in Fig. 14b would be better described by a polynomial of order 6. One could use an order 6 polynomial for all the other relationships plots too, but doing so might not have any physical significance. One can achieve an arbitrarily good fit the larger the order of the polynomial used, but that does not necessarily mean the chosen model is the correct model.

The distribution of the Euler number. The Euler number for Calabi-Yau threefolds is
As mentioned previously, we are summing over all the various r -curves to obtained the full-Euler number distribution. A plot of χ versus frequency yields the pseudo-Voigt distribution. In particular, we can model the behavior of the distribution almost perfectly using the modified pseudo-Voigt curve (2.11) and (2.12), which is repeated here for convenience: The results of the regression analysis for the Euler number distribution is presented in Fig. 16a. The fitted parameter values for f (χ ) E corresponding to even values of h 1,1 − h 1,2 are: σ, α, b, a)  Although χ is only even, the two curves originate from the fact that if you take χ/2 you get even and odd values. The two curves arise from the parity of χ/2 and are presented in Fig. 16a.

Goodness-of-fit.
A goodness-of-fit test is implemented as a means of testing how well a given model describes some given data. Typically the model validation process consists of only quoting a single statistically generated number like the R 2 , χ 2 or p values. Based on the size of this number, one then makes inferences on how well the chosen model fits the observation. One needs to be careful however of misusing such indicators as an absolute measure for assessing goodness-of-fit. For a structural equation model (SEM)-in our case, the modified pseudo-Voigt and Planckian models-this assessment is not so straight forward as it would be for a simple regression analysis. To quantify the predictive power of an SEM, a single statistical test does not suffice -in fact, there is no single test. According to [41], the best one can do is assess three different aspects of what it means to have a good fit, these are: overall fit, comparative fits to a test model and model parsimony. 3 The only real test available is the chi-squared (χ 2 ) test, when it comes to overall fit, this χ 2 statistic is the most popular test. The χ 2 test compares observed and predicted correlation matrices with each other, and so, statistical significance is evaluated based on the value of χ 2 . A large χ 2 value signifies a considerable difference between the correlation matrices. A low value indicates there is little statistical difference between matrices. Since the χ 2 test is between actual and predicted matrices only, when looking for overall fit, one searches for non-significant differences between the correlation matrices. Often, rather than presenting the χ 2 or χ 2 R (the chi-squared value relative to the degrees of freedom for the model) value, a p value is given instead. The p value, in a way, informs us whether one should reject a null hypothesis or not. A small p-value suggests that the differences in observed versus predicted are too large to be consistent with the null-hypothesised model i.e. assuming the null-hypothesised model, the probability of observing what we did is relatively small, suggesting either an absolutely fluke experimental outcome or an incorrect model null-hypothesis. The p-values can be determined by a p-value calculator by inputting the χ 2 R value. There is no standard way of choosing a significance level for the p-value, but typically p < 0.05 is considered statistically significant.
In general, statistical non-significance given by appropriate values of the χ 2 fit statistics is adequate. However, one must be careful of drawing similar conclusions for structural equation modeling. The fit statistic makes a statement of the correlation matrices only, not about whether or not the correct model is identified. This is largely due to the sensitivity to sample size of the χ 2 test. In our analysis, the sample size (number of reflexive polytopes) is enormous-almost one billion! For large samples (> 200) the χ 2 test will give significant differences for any model used. This sensitivity to a sample size, together with an effect size and alpha value, is related to what one calls the power of a test -the probability of not incorrectly accepting a null hypothesis that is actually false.
Without worrying too much about what an effect size and alpha value is; for any alpha value, the greater the sample size, the greater the power of the statistical test. However, increasing the sample size beyond a certain amount, can result in the test having "too much" power. 4 Perceived effects in very large sample sizes, will always become significant. 5 Observe how in Figs. 29 and 36 the χ 2 R values for all the different curves is extremely large, naively indicating that we have a horrible fit-which would be an incorrect conclusion.
It is clear from the above discussion that we cannot use the χ 2 or p values in validating our choice in model. What is not so clear, is the additional subtlety in using purely statistical means to asses goodness-of-fit for our data. This subtlety lies at the heart of almost all statistical tests-the construction of a null hypothesis. The term frequency, as used in the statistical sense, refers to the number of outcomes for a certain event. The measurement of this outcome will often have certain known or unknown factors affecting it. These tests check for the probability that the errors found are too significant to be solely due to random variations in the data. For example, assume that statistical tests give non-significant results. If the residuals are small enough to be considered random errors in the measurement of the frequency, we could say that the model is appropriate. If however, the residuals are too large or present additional structure, we could say the model is good, but not quite the correct one as the residual errors are not "random enough". In our case, there is no notion of measured frequency and error in measurement of frequencies. Our frequencies are generated as a result of a combinatoric calculation. Statistical tests assume that the input is from measurement and observations (obeying some null-hypothesis), thus they are inherently constructed with this notion in mind. By inputting our data, the tests are trying to calculate something from a data set which does not obey the very assumption they use in their calculations. We are not exactly clear how much this affects statistical outcomes, but it is important to keep in mind.
How do we validate then, that our chosen models are a good fit, or that our model is the best one at describing the data? We implement graphical methods. The first graphical method is obviously through pure inspection-this is not quite statistically quantifiable. There is a statistically based graphical method to asses goodness-of-fit called probability plots, Q-Q plots or P-P 6 plots. These plots were initially constructed to test the "normality" of a data set when the sample size is too large to depend on the χ 2 and p values. In principle, a standard probability plot tells you the likelihood that the a sample's distribution of data obeys a normal distribution-hence checking for normality. The answer to the question is not given by a statistical value, but rather by a graphical representationfrom which one can extract statistical numbers. If the plotted data on this probability plot is a straight line, then we can determine that the sample set is normally distributed.
We can extend this concept further: we can take two different samples, and take a probability plot to determine if two data sets come from populations with a common distribution. Such a probability plot is referred to as a Q-Q (quantile-quantile) plot. Extending this concept one more time-as for our use-we will take the quantiles of our theoretical distribution (the modified pseudo-Voigt and Planckian profiles) as our "first sample" and plot them against the quantiles of our data as our "second sample"; this will give us our probability plot. In all the probability plots, it is the quantiles of the respective data sets which are plotted against each other.
Quantiles are basically just a generalization of quartiles. For example, the kth percentile of a set of values divides them, such that the number of values which lie below is k%, and the number of values which lie above is (100 − k)%. The 25th percentile is the lower quartile or the 1 4 quantile. Quantiles are the same as percentiles, but indexed by sample fractions rather than by sample percentages. Suppose that p ∈ [0, 1], the aim is to find the value that is the fraction p of the way through the ordered data set. As an example, if p = 1 2 = 0.5, we want to know what is the value that sits at p = 0.5 of the way through i.e. half way. The value that sits there (this value may have to be interpolated) will be called the quantile for the fraction p = 0.5. There are many different algorithms for generating the quantiles for a given data set, we use python to generate the quantiles in a manner similar to that discussed above. For an ordered data set, x 1 ≤ x 2 ≤ x 1 . . . ≤ x n−1 ≤ x n , the most common way of calculating quantiles is to first compute the empirical distribution function: (2.15) and then define the quantile function to be the inverse of F(x): By generating the quantiles of some theoretical model and comparing them to the quantiles of a given data set of equal length, one can determine if the data set belongs to the same distribution as the data set belonging to the theoretical model-i.e., does the data fit the model. If the quantiles are roughly equal the plots will all be more or less on a straight line. In probability plots: 1. The length of data set needs to be equal. For unequal lengths, one must perform an interpolation of data. 2. If two identical data sets were compared to one another, the points would lie exactly on a 45 degree line. Thus, for two different data sets, the deviation from this reference line determines the likelihood that the sets belong to similar distributions. To quantify this likelihood, one can calculate the R 2 -value of the data, relative to the y = x reference line. 3. Q-Q plots are not only limited to determining similarity in data sets. By analyzing the deviations which occur, one can determine how the scale and location of the data is shifted -the data would follow some line y = mx + c, where m, c would be the estimates of these shifts in scale and location. Also, from the distribution of points above or below the reference line, one can infer aspects of the tails and skewness in the data.
Consider the following curves for the h 1,1 − h 1,2 distribution with r = 60 in Fig. 17a, b. For the h 1,1 + h 1,2 distribution we just plot the data of q = 2 together with the corresponding probability plot in Fig. 18.  Fig. 17. Using probability plots, we are able to statistically see which model provides the better fit. We employ such graphical methods as standard goodness-of-fit tests such as the χ 2 fail to give meaningful results. a Best fit curve for r = 60 based on the left Gaussian model, right modified pseudo-Voigt model. b Probability plot for Fig. 17a. The x-axis represents the quantiles for the actual data, the y-axis represents the theoretically predicted quantiles-dependent on the model chosen (red modified pseudo-Voigt model (R 2 = 0.99974); blue Gaussian model (R 2 = 0.99334). The R 2 values are not relative to the best fit lines, but are relative to the 45 • reference line y = x. The closer the R 2 value is to 1, the more similar the predicted quantiles are to the actual ones, thus, the better the model describes the data (color figure online) In its current form, the probability plots do not allow us to calculate p-values of the various models. This due to the same issue encountered previously. If one however standardizes the data according to the Z-standardization: (2.17) where μ and σ are the mean and standard deviation, it is possible to calculate the pvalues since the magnitude of each sample gets rescaled. The probability plot of all the models is displayed in the "Appendix", with the relative p-values for each model- Fig. 25g, h. What we see is that the modified pseudo-Voigt is statistically the model which provides the best fit.

Implications for physics.
Calabi-Yau threefold compactifications of string theory have been the traditional approach to obtaining interesting phenomenological models. The plethora of geometries and configurations, ranging from heterotic strings on Calabi-Yau threefolds endowed with stable bundles, to D-brane probes on local Calabi-Yau varieties, to F-theory compactification on elliptic fibrations, has over the years justified the landscape and inspired various statistical analyses of the space of vacua.
Of particular interest has been the investigation of further structures in the Kreuzer-Skarke database, including identification of "the tip" where Hodge numbers are small [21,35,46], the top bounding curves where Hodge numbers are large [43], identifying elliptically fibered threefolds [28,29,42,44], finding further fibrations such as K3-fibers [33,45], or a step-by-step construction of all possible smooth Calabi-Yau hypersurfaces from the reflexive polytope data [19], etc. Now, it should be emphasized that each of the some 473 million reflexive polytopes admits, as an ambient toric variety, many 7 so-called maximal projective crepant partial (MPCP) desingularization, each of which gives rise to a different Calabi-Yau threefold. Therefore, the actually number of Calabi-Yau threefolds from the Kreuzer-Skarke database is many orders of magnitude larger than 10 10 . While manifolds coming from the same reflexive polytope have different geometrical data such as triple intersection numbers, which in the standard embedding in heterotic compactification correspond to Yukawa couplings, they do share the same Hodge numbers because these, by virtue of (2.1), depend only on the combinatorics of the polytope. We need to wait for significant theoretical and/or computational advances to have the full data of the Hodge pairs in view of the Calabi-Yau manifolds themselves, which might give new statistics. It would be perhaps even more interesting if the statistics remain largely the same, thereby hinting at some universality in the distribution of such topological data.
In the context of the recent works on F-theory, it is an important fact the vast majority of the Kreuzer-Skarke threefolds are elliptic fibrations over some complex surface, and in fact birational to [42,44,45] a Weierstrass model. For example, some 10 6 alone [42] come from elliptic fibrations over P 2 . Therefore the Kreuzer-Skarke dataset is directly relevant to F-theory. In the more classical context of heterotic strings, the Hodge numbers dictate the number of (anti-)generations in the standard embedding. In our above plots, the Euler number ±6 indicate the three generation models. The generic paucity of χ = ±6 manifolds led to the industry of non-standard embedding where extra vector bundle and Wilson line information is needed. The advantage of F-theory models is that the compactification data comes only from the Calabi-Yau manifold. In particular, the intersection theory of the cycles and fiber-degeneration structure determine the gauge group, anomaly cancellation, matter content, and Yukawa couplings. Much of this can be extracted from the polytope data.
F-theory compactifications on threefolds, resulting in six dimensional gauge theories have been considered from the point of view of systematically classifying the base complex surfaces [44] and the statistics have been performed therein. Non-toric bases were considered and a number of Calabi-Yau threefolds beyond the Kreuzer-Skarke data were found. It is remarkable that the overall distribution of Hodge numbers remains largely unchanged. Indeed, in unpublished work of Kreuzer-Skarke, where they extended the hypersurface in toric fourfolds to double hypersurfaces in fivefolds, obtaining some 10 10 more manifolds and the shape of Fig. 1 persists. All these point to the Kreuzer-Skarke data being a robust representative in the space of Calabi-Yau threefolds. Our distribution subsequently seems a representative sample, and we speculate that analyses of string vacua, in any context, should be thus weighted. For example, in study of the "typical" number of generations in four dimensional heterotic compatification, or of charged matter in six dimensional F-theory compactification, one should superpose our pseudo-Voigt profile.

Calabi-Yau Twofolds: K3 Surfaces
As noted in the Introduction, there are 4319 data points, corresponding to hypersurfaces as Calabi-Yau twofolds, i.e., K3 surfaces, in reflexive three dimensional polytopes. Being algebraic K3 surfaces, there is only one relevant topological invariant, the Hodge number, h 1,1 = 19. However, there is a further refined algebraic quantity for the K3 surface X , the rank of the Neron-Severi lattice H 2 (X ; Z) ∩ H 1,1 (X ), which is the Picard Number ρ(X ) and which enumerates the number of divisors on the surface up to algebraic equivalence. The Picard numbers of the 4319 K3 surfaces were computed in [12]. We present the distribution thereof in Fig. 19a.
We only used the standard pseudo-Voigt profile as the modified one did not change the fit significantly. Here are the fit statistics for best fit curve: (A, μ, σ, α) = (4517. 45,10.76, 2.97, −0.031), as shown in Fig. 19.
What is interesting about Fig. 19a is that the "oscillations" of the actual data points above and below the modeled curve is very apparent, yet modifying the pseudo-Voigt profile is unable to give any significant improvement. This leads to two potential conclusions: (a) the pseudo-Voigt profile is not the best profile to use in combination with an oscillatory component; (b) the manner in which the oscillations occur is not so straight forward as introducing simple cosine function. An interesting exercise would be to superimpose a cosine function along the distribution, by rotating it as one traverses the profile. As long as the wavelength, amplitude and angle of rotation are all small enough, the continuously rotated cosine function should remain a function everywhere along the profile.

Calabi-Yau Fourfolds
The analysis of the four fold data is performed in the same spirit as the threefold data. We aim to look for patterns in the frequency plots. Due to complex conjugation and Poincaré duality, the only topological invariants of fourfolds that vary are h 1,1 , h 1,2 , h 1,3 , and h 2,2 . Three of these are independent [15]: Fig. 19. Using probability plots, we are able to statistically see which model provides the better fit. We employ such graphical methods as standard goodness-of-fit tests, such as the χ 2 test, fail to give meaningful results. a For K3 surfaces, the multiplicity is plotted against Picard number with a pseudo-Voigt fit. b Probability plot for the multiplicity quantiles versus the fitted standard pseudo-Voigt quantiles. The R 2 value is 0.99908 h 2,2 = 44 + 4h 1,1 − 2h 1,2 + 4h 1,3 . (4.1) We compiled a database for the frequency of the triplets (h 1,1 , h 1,2 , h 1,3 ) to then obtain the following data structure Since one expects mirror symmetry within the invariants (h 1,1 ± h 1,3 ) [40], a plot of h 1,1 −h 1,3 against h 1,1 +h 1,3 (Fig. 20) should be symmetric about the line h 1,1 −h 1,3 = 0.
Doing a quick analysis of the data yields the following observations: only partial mirror symmetry is found. For 69.77% of data points, the point (h 1,1 − h 1,3 , h 1,1 + h 1,3 ) is accompanied by the point (−h 1,1 + h 1,3 , h 1,1 + h 1,3 ). Taking frequency into account, For now, we have performed a primary analysis on the Euler distribution only. The Euler number for fourfolds is [15]: Interestingly enough, the distinction between even and odd distributions persist in the fourfold data base. For illustrative purposes, we show the distribution of χ/6 against frequency. It is not immediately clear what is the reason for the gap, presumably it could be a cluster of data points which is missing from the data base. Until one obtains the complete fourfold data base of Hodge numbers, one can't say much else. We also preset plots of the individual Hodge numbers h i, j versus frequency.

Conclusions and Outlook
By examining the distribution of Hodge numbers of Calabi-Yau manifolds of complex dimension two, three and four, realized as hypersurfaces in toric varieties of one higher dimension as constructed by Kreuzer and Skarke based on the results of Batyrev and Borisov, we have found many hithertofore undiscovered patterns. We summarize our key points as follows.
• For threefolds, there are 30108 distinct pairs of Hodge numbers (h 1,1 , h 1,2 ) from 473800776 reflexive polytopes, the frequency of both the half-Euler number h 1,1 − h 1,2 and the sum h 1,1 + h 1,2 are distributed according to whether the value is odd or even; -The half-Euler number h 1,1 − h 1,2 follows a modified pseudo-Voigt distribution where the modification is made in the amplitude A of the distribution, such that There is fine periodic substructure in terms of curves indexed by an integer r . Our model is accurate for low r -values (r ∈ [36,110] and r ∈ [37,99]); using probability plots as test for goodness of fit, this modified pseudo-Voigt model is indeed the best one out of several standard candidates (cf. Fig. 29 for all the R 2 and p values). Among A, σ, α, b, a, the parameters σ, b, α have a strong linear relationship with r : Even r Odd r σ (r ) = 0.5097r − 12.7142 0.51379r − 13.2494 α(r ) = 2 × 10 −4 r − 0.0345 2.25 × 10 −4 r − 0.0388, b(r ) = 3.7299 × 10 −5 r + 0.6629 7.9101 × 10 −5 r + 0.65956 For a small subset of curves with a low r -value and an appropriate cut-off frequency, it is extraordinary that the model exactly fits the data. That is, it appears that the number of data points for each curve required, such that the model will result in a perfect fit is: 7 for even r -valued curves and 10 for the odd valued r -curves, see Fig. 30.
-The quantity h 1,1 + h 1,2 follows a Planckian distribution There is a substructure of curves, indexed by an integer q, each Planckian and with some periodic behavior. The curves q n appear clustered into groups of residue classes distinguished by n mod 6, and the parameters log(A), n, b all have extremely strong relationships with the q value. By substituting this relationship into the model, we have a function f k (x, q) that approximately describes the entire h 1,1 + h 1,2 distribution up to a q value of 69, 100: with k = 0, 1, . . . 5 and the coefficients given in A.8, A.9, A.10. -The Euler number χ = 2(h 1,1 − h 1,2 ) follows the modified pseudo-Voigt distribution composed with a sinusoidal A + A 0 + a cos(2π b · x) which is almost an exact fit, with the coefficients given by (A 0 , σ, α, b, a)  It is remarkable how well the pseudo-Voigt distribution, modified with a sinusoidal component, fits the distribution of topological numbers of toric Calabi-Yau manifolds, often giving an exact fit. Of course, what we are studying at heart is the number of integer points inside (cf. (2.1)) reflexive polytopes. This is a highly non-trivial counting problem whose answer will ultimately give full analytic results for our distributions and we suspect that the answer should be some generalized pseudo-Voigt function.
Now, in addition of Calabi-Yau manifolds, stable vector bundles over various such manifolds in a variety of construction beyond Kreuzer-Skarke have also been studied algorithmically over the years in the context of heterotic compactification (cf. e.g., [23][24][25][26]). One can see a somewhat pseudo-Voigt profile in these as well, even though there is no underlying polytope and the counting problem is dictated by certain Diophantine system. It would be interesting to see why this shape is universal in such classifications.
where β is the Beta function.

Breit-Wigner Model
This model is based on the Breit-Wigner function.
The Voigt model is a convolution of the Gaussian and Lorentzian models.

Pseudo-Voigt Model
We present the standardized and shifted probability plots for the above comparisons: Fig. 24. The plots of the various parameters A, σ, α, b, a versus r for odd values of r . a The width parameter σ has a linear relationship with r such that σ (r ) = 0.51379r − 13.2494. The amplitude period parameter, b, also has a linear relationship, however, since r is at most order 3 in magnitude, we can regard it approximately as a constant such that b(r ) = 0.65956 ∼ 2/3. The same goes for the fraction parameter,α, we can regard it as a constant such that α(r ) = −0.0388. For even parameter fit statistics see Fig. 10. b Plots of A 0 versus r (left) and a versus r (right). Both exhibit a similar pattern, however it is difficult to find any nice relationships. For even parameter plots see Fig. 10 Fig . 25. For all models, the left hand graph is for r = 54 and the right is for r = 51. The probability plot presents all the models together. All the above mentioned modeled are included to compare their resemblance with the actual data. The larger the p value the better the line y = x fits the data, implying the better the model is at describing the data. a Gaussian model.    Left list of best fit coefficients for all even curves r ∈ [34,120]. Right list of best fit coefficients for all odd curves r ∈ [35,99]. In both figures, the last two columns represent the R 2 and p values for the probability plot for each curve. The p-values were obtained by first performing a Z-Standardization on the data Fig. 30. A list showing the number of data points left after increasing the cut off frequency to achieve a perfect fit. Conversely, one may state is as, the number of data points for each curve required such that the model will result in a perfect fit  In the attempt to describe the data analogously to a blackbody distribution (a), we discover some subtle structure (b). These are the odd counterparts to Fig. 12. a Lines of best fit from a regression analysis for a few select curves. The black data points represent the maximum frequency for that particular q − curve. The black line is a line of best fit to describe the points of maximum frequency-this is analogous to a blackbody spectrum. See Fig. 12a for the curves within the even distribution. b The curves segregate into three classes determined by the value of the even integer modulo 6. A similar pattern occurs in the even distribution; see   . 33. We illustrate the added structure for odd h 1,1 + h 1,2 data, by displaying how the regression curves can be divided into residue classes. For the list of even curves, refer to Fig. 13. a All the curves color coded according to what residue class their curves q n belongs to. b Family of curves all belonging to q 1 . c Family of curves all belonging to q 3 . d Family of curves all belonging to q 5