Topological Data Analysis for the String Landscape

Persistent homology computes the multiscale topology of a data set by using a sequence of discrete complexes. In this paper, we propose that persistent homology may be a useful tool for studying the structure of the landscape of string vacua. As a scaled-down version of the program, we use persistent homology to characterize distributions of Type IIB flux vacua on moduli space for three examples: the rigid Calabi-Yau, a hypersurface in weighted projective space, and the symmetric six-torus $T^6=(T^2)^3$. These examples suggest that persistence pairing and multiparameter persistence contain useful information for characterization of the landscape in addition to the usual information contained in standard persistent homology. We also study how restricting to special vacua with phenomenologically interesting low-energy properties affects the topology of a distribution.


Introduction
String theory appears to have an enormous number of vacua, making up what is called the string landscape. The seminal work of [1] pointed out that the presence of multiple fluxes leads to a discretuum of values for physical observables like the cosmological constant. A statistical approach to studying the landscape was advocated in [2]. It was further argued in [3] that the existence and size of the landscape necessitated the use of anthropic arguments. Some subsequent work used statistics and explicit constructions to explore corners of the landscape and make arguments about stringy naturalness, especially with regard to the scale of supersymmetry breaking, the presence of various symmetries, and the cosmological constant [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. Efforts were also made to propose alternatives to anthropics for vacuum selection in the landscape [20][21][22]. Despite the widespread belief in the landscape paradigm, skepticisms have also been raised [23,24]. Now, with the advent of Big Data, we are perhaps well-positioned to ask previously out-of-reach questions about the landscape. 1 Several recent papers have applied techniques from machine learning to the landscape and other data sets in string theory [27][28][29][30][31][32][33][34][35][36][37]. One drawback of machine learning approaches, however, is their interpretability: while a neural network may be effective in classifying pictures of cats and dogs, its method for making this decision is generally opaque. If the goal is physical insight, a black box algorithm is not sufficient. For example, if one is classifying string theory vacua, one would like to understand how the classifier works.
In this paper, we propose studying distributions of string vacua using persistent homology [38][39][40][41] (see [42] for a recent historical review). Persistent homology is a technique from the field of Topological Data Analysis (TDA) that allows one to formalize the notion of the shape of a data set. Roughly, this is done by thickening each point in the data set and computing topological invariants at various stages of thickening. The persistence of topologically nontrivial features like connected components, loops, voids, etc. throughout this thickening defines the multiscale topology of the data set.
We argue that persistent homology is useful for characterizing distributions of string vacua, with the ultimate goal of understanding how structure in the distribution correlates with low-energy physics. One advantage that persistent homology has over machine learning techniques is its clear interpretation: we are simply computing topological invariants of 1 We should not confuse the Big in Big Data with the Big that characterizes the string theory landscape. Estimates suggest that the number of flux vacua for a typical geometry is around 10 500 [4,6] and can be as large as 10 272,000 [25], with the number of geometries in a particular of F-theory ensemble bounded below by 4 3 × 2.96 × 10 755 [26]. However, we might hope that studying larger subsets of the landscape than previously possible will provide hints toward a more complete picture. simplicial complexes. Persistent homology has proven useful in diverse fields including sensor networks [43], image processing [44], bioinformatics [45], genomics [46], protein structure [47], neuroscience [48], cosmology [49][50][51][52][53][54], and many more. For example, persistent homology has been used to study the Cosmic Web of large-scale structure, composed of interlocking voids, filaments, and sheets at a variety of scales [51]. Void structure is also found in moduli space for the simplest string toy example, flux vacua on a rigid Calabi-Yau, studied in [6]. In the case of the Cosmic Web, it seems that at a scale of 300 Mpc, our galaxy resides within a void [55]. We might ask where our universe lives within the distribution of string theory vacua. As with the Cosmic Web, the answer to this question is necessarily multiscale.
Persistent homology can be used to compare different distributions of string vacua (by summarizing what topological features are present and how they relate) and to understand where individual vacua reside within a distribution. We imagine that applying persistent homology to distributions of string vacua could be potentially useful for understanding vacuum selection (which becomes even more interesting when issues of computational complexity are considered [56][57][58][59]) or tunneling in the landscape. Moreover, by choosing only special vacua with certain phenomenologically interesting properties and studying the restricted distributions with persistent homology, we may learn which low-energy properties of a string vacuum are simultaneously allowed, and where they live.
The structure of this paper is as follows: in Section 2, we outline how persistent homology describes the multiscale topology of a point cloud. In Section 3, we briefly review the construction of Type IIB flux vacua. We then apply persistent homology to distributions of Type IIB flux vacua on various backgrounds: the rigid Calabi-Yau, a hypersurface in weighted projective space, and the symmetric T 6 . While these are simplified setups, they contain useful lessons for more general (and realistic) constructions. In Section 4 we conclude.
It is worth noting that an early work [60] applied some of these techniques to string vacua, albeit in an incomplete fashion. In this paper we carried out a thorough study, with an eye towards a large-scale analysis of the landscape. As we shall see, the new techniques and observables we developed have a wider applicability than the simple examples we studied.

Persistent Homology
Persistent homology is a multiscale approach that can robustly characterize the shape of a data set. Though data sets are generally discrete, is it often the case that they contain patterns that are topological in nature. For example, if enough points are uniformly sampled from an annulus, the presence of a topologically nontrivial feature in the set of points is clearly visible (see Fig. 1). Persistent homology systematically describes the presence of such features by embedding the data set in a sequence of discrete complexes. Often each complex in the sequence can be thought of as a thickening of the points. In these cases, we can associate each complex in the sequence with a length scale given by the thickening. For each complex in the sequence, we are able to compute the number of topologically nontrivial features. Moreover, as we move through the sequence of complexes (to larger scales), we are able to track individual topological features as they are created and destroyed. The persistence of a topological feature in the sequence of complexes allows us to assign to it a notion of significance. The zeroth-order intuition is that long-lived features are real, robust aspects of the data, while short-lived features can be attributed to noise. (We will see in Sec. 3.4 that short-lived features can sometimes encode useful information about the data set's structure.) In this section we briefly review simplicial complexes and persistent homology. We describe how witness complexes [61] allow us to efficiently reconstruct the topology of a point cloud using a small sample of points. We then review persistence diagrams, which encode the output of a persistent homology computation. We emphasize that potentially useful information regarding persistence pairs is usually thrown out when making persistence diagrams. For a more in-depth discussion of persistent homology and computational topology in general, see [41,62,63].

Simplicial complexes and persistent homology
Consider a collection of points and pairwise distances between them. We call the collection of points a point cloud. We would like to formalize a notion of topology for the point cloud.
To do this in a nontrivial way, we need to endow the point cloud with some extra structure. In this paper, we will embed the point cloud in a simplicial complex. More precisely, we will represent the point cloud with a sequence of simplicial complexes. We will then use persistent homology to track the creation and destruction of topological features as we move through the sequence of complexes.

Simplicial complexes and simplicial homology
A simplicial complex is a collection of vertices (0-simplices), edges (1-simplices), triangles (2-simplices), etc. such that (i) all faces of a given simplex in the complex are contained in the complex (ii) the intersection of two simplices in the complex is contained in the complex. A simplicial complex and a collection of simplices that is not a simplicial complex are shown in Fig. 2. Given a simplicial complex, we can compute topological invariants. Consider some simplicial complex S. We define k-chains as collections of k-simplices in S. A k-chain may be formally represented as a sum where i runs over k-simplices in S and we use Z 2 coefficients. The k-chains form a group under element-wise addition, which we denote C k . There are two important subgroups of C k . To define them, we first define the boundary operator ∂ k : C k → C k−1 . Writing a k-simplex in terms of its vertices as [v 0 , . . . , v k ], the action of the boundary operator is defined by and linear extension. Here the hatted vertex is omitted. The boundary map ∂ k is a homomorphism from C k to C k−1 .
We call a k-chain σ a k-cycle if its boundary vanishes, ∂ k σ = 0. By linearity of the boundary map, the k-cycles form a subgroup of C k , which we denote Z k . In other words, Z k = ker ∂ k . Analogously, we define the group of k-boundaries, denoted B k , as the image of the (k + 1) boundary map, B k = im ∂ k+1 . In other words, B k is made up of k-simplices σ such that, for some (k + 1)-chain τ , ∂ k+1 τ = σ.
Importantly, the boundary of a boundary is always empty, ∂ k ∂ k+1 = 0. This means that every k-boundary is also a k-cycle, so B k ⊆ Z k . However, the converse is not always true. For example, if a k-cycle wraps a (k + 1)-dimensional hole, it cannot be written as the boundary of any (k + 1)-chain in the complex. Thus we define the k-th homology group as H k (S, Z 2 ) = Z k /B k . In other words, elements of H k are k-cycles subject to the equivalence relation σ ≡ σ + τ for τ a k-boundary. The Betti numbers b k are the ranks of the homology groups. The 0-th Betti number b 0 counts the number of connected components and the higher Betti numbers b i count (i + 1)-dimensional holes (by counting the cycles wrapping them). For example, the simplicial complex on the left side of Fig. 2 has b 0 = 1 and b 1 = 1. For a general simplicial complex, the Betti numbers can be calculated by reducing the matrix representation of the boundary operator ∂ on the space of simplices in the complex [63].
So far we have only mentioned one simplicial complex. However, if we were to use just one simplicial complex to represent our data set, we would be asking for trouble. Given a point cloud, there are very many ways one might choose to form a simplicial complex. It is natural to associate each point with a vertex, but when it comes to connecting vertices with edges and other higher-dimensional simplices, we must make choices. With slightly different choices, one gets very different topological invariants. This is not a desirable outcome. To be confident about what we are learning from the data set, we need a more sophisticated approach.

Persistent homology
Persistent homology solves this problem of representational ambiguity by using the point cloud to construct a sequence of simplicial complexes, called a filtration. Specifically, a filtration is a monotonic sequence of simplicial complexes S 1 ⊂ S 2 ⊂ · · · ⊂ S n . As we move through the filtration, nontrivial cycles are created as simplices are added to the complex, and other cycles are made topologically trivial as they are filled in by simplices. Inclusion maps from S i to S i+1 induce chain maps C k (S i ) → C k (S i+1 ). Importantly, the corresponding homology maps H k (S i ) → H k (S i+1 ) are homomorphisms, since the inclusion maps commute with the boundary maps. This allows us to track individual cycles in the homology as the complex grows. The output of a persistent homology computation is roughly a list of ordered pairs (ν birth , ν death ) detailing when in the filtration cycles of various dimensions are created and destroyed. (In fact, there is more information to harvest; see Sec. 2.3.) This is stronger information than the Betti number curves b k (S i ), which merely count nontrivial cycles for each step in the filtration. Persistent homology resolves the problem of representational ambiguity, and has been proven stable (under a suitable metric) against perturbations to the point cloud [64]. Moreover, moving through the filtration is often associated with some thickening scale, so persistent homology can rightfully be called a multiscale technique.
As an example, consider the Vietoris-Rips complex. The vertex set of the Vietoris-Rips complex is the set of points in the cloud. Then, for r > 0, we include an edge Higher-dimensional simplices are included if all of their faces are. An illustration is shown in Fig. 3. In this construction, the filtration parameter r has a natural interpretation in terms of length. We are thickening each point to a ball of radius r and connecting overlapping balls. As r is increased, we consider the topology of the point cloud at larger and larger scales. At large enough r, the topology will become trivial, with b 0 = 1 and b i = 0 for i > 0. This is true for any Vietoris-Rips complex.

Witness complexes
For our purposes, it will prove useful to use a more sophisticated construction than the Vietoris-Rips filtration. One issue with the Vietoris-Rips filtration is that it is very inefficient in terms of simplices. For example, a dense cluster of points leads to many edges and higherdimensional simplices, but these are generally not arranged in a topologically interesting fashion. Computing and storing these simplices is wasteful.
One way to circumvent this problem is to use witness complexes 2 [61]. Witness complexes use a small subset of the point cloud as a landmark set, whose points form the vertex set of the complex. The presence of higher-dimensional simplices is determined by witness points, which can be any points in the data set. More precisely, let L denote the set of landmark points, and Z the full point cloud. Let m k (z) be the distance from z ∈ Z to its (k + 1)-th closest landmark point. Then for k > 0 and vertices l i , we include the k-simplex [l 0 l 1 . . . l k ] in the witness complex W (Z, L, r) if all of its faces are included and there exists a witness point z ∈ Z such that Here r is the filtration parameter, analogous to the Vietoris-Rips filtration parameter. A simpler computation is the lazy witness complex. For the lazy witness complex, one chooses ν ∈ N. If ν = 0, m(z) is taken to be 0. If ν > 0, m(z) is the distance from z to the ν-th closest landmark point. The lazy witness complex is then constructed by taking the vertex set to be the landmark set and including an edge Then, as in the Vietoris-Rips construction, one includes a higher-dimensional simplex if all of its faces are included. This construction is called lazy because one only needs to compute the edges. For our purposes, we will generally use the lazy witness complex with ν = 1.
In general there are two ways to select a landmark set from a point cloud. One is to simply randomly choose points. The other is to use a maxmin algorithm, choosing the first point randomly and selecting subsequent points by maximizing the distance from the nearest landmark point. While maxmin gives more evenly spaced landmark points than a random selection, it tends to select outlier points; in point clouds with dense regions, one often obtains a more representative landmark set by using a random selector. We will generally use a maxmin selector. In Sec. 3.5 we will see that multiparameter persistence is also well suited for point clouds with dense regions.

Persistence pairing and persistence diagrams
Given a filtration, the persistent homology calculation is a matrix reduction of the boundary operator. Columns of the boundary operator represent individual simplices and are ordered by when a simplex is added to the filtration. See [39,62,63] for details of the reduction algorithm. The reduced matrix encodes a persistence pairing among simplices in the filtration. When a k-simplex is added to the filtration, it either creates a k-cycle or destroys a (k − 1)cycle. The persistence pairing links k-simplices and (k + 1)-simplices (for all k relevant to the complex) as persistence pairs, with the (k + 1)-simplex destroying the cycle created by the k-simplex. (Some cycles may have infinite persistence, i.e. they are not destroyed by the end of the computed filtration, in which case there is no second simplex in the pair.) By looking at when the relevant simplices for a particular cycle were added to the filtration, one may assign to a cycle the filtration time of the cycle's birth as well as the filtration time of the cycle's death.
Generally, there are two (equivalent) ways the output of a persistent homology calculation is represented. Barcodes are collections of horizontal lines, each starting at the birth time of a particular cycle and ending at that cycle's death time. One draws a barcode for each dimension of the homology. Persistence diagrams are scatter plots of the birth and death times (ν birth , ν death ) of individual cycles. Two examples of persistence diagrams (and their corresponding Betti number curves) are shown in Fig. 4. Our experience is that relationships between cycles of different dimensions are easier to see using persistence diagrams. Since these relationships seem to matter for the distributions we are considering, we will use persistence diagrams.
It should be noted that compressing persistent homology's output to barcodes and persistence diagrams actually erases some information about the structure of a data set. The persistence pairing computed by matrix reduction is used only to generate (ν birth , ν death ). However, the simplices appearing in the persistence pairing can encode more about the structure of a point cloud than just these two numbers. We will encounter a scenario in Sec. 3.4 where explicitly sorting through the persistence pairing and looking at overlapping "destroying simplices" allows us to recover more refined information about the point cloud than is represented in the persistence diagram.

Flux Vacua
In this section, we consider type IIB string theory on Calabi-Yau orientifolds in the presence of background fluxes, as reviewed in [66,67]. In these setups, the axiodilaton and complex Figure 4: Two persistence diagrams. To calculate the Betti numbers at a given filtration time, one counts "living" cycles. Persistence diagrams contain more information that the Betti number curves. These two diagrams give rise to the same Betti number curve, shown below. structure moduli are stabilized by 3-form fluxes threading the internal manifold. Flux quantization and tadpole cancellation give rise to a discretuum of vacua distributed over the moduli space [1,68]. This discretuum generally features quite a bit of structure, well-suited for analysis via persistent homology. In particular, we will study the distribution of stabilized axiodilaton and complex structure moduli vevs using persistent homology.
First we review the construction of flux vacua. We then study three examples using persistent homology: the rigid Calabi-Yau, a hypersurface in weighted projective space, and the symmetric T 6 . Each example contains lessons that should be useful when applying persistent homology to more explicit and phenomenologically viable models. Specifically, we learn how persistence pairing, information that is often thrown out in a persistent homology calculation, can be useful. We also observe that a notion of multiparameter perisistence should be used to characterize overdense regions in addition to the underdense regions one conventionally studies with persistent homology. Additionally, we study how different restrictions to special vacua (like those with enhanced symmetry or vanishing tree-level superpotential) can affect the topology of a distribution of string vacua.
As part of our simplified setup, we will study just the distribution of stabilized vevs for the axiodilaton and complex structure moduli, neglecting the distributions of Kähler moduli, open string moduli, and fluxes. For the Kähler moduli, we can appeal to a separation of scales, with the complex structure moduli and axiodilaton stabilized by fluxes and Kähler moduli stabilized by non-perturbative corrections to the no-scale models we consider. For further discussion of the motivation for neglecting the Kähler moduli, see [2,4,69]. We plan to eventually include these degrees of freedom, applying the lessons we learn from the following examples.

Review
We follow the conventions of [69]. Consider a Calabi-Yau threefold M with h 2,1 complex structure moduli. Take a symplectic basis {A a , B b } for the b 3 = 2h 2,1 + 2 three-cycles, with a, b = 1, . . . , h 2,1 + 1. We have dual cohomology elements α a , β b satisfying From the unique holomorphic three-form Ω, we have the periods z a ≡ Aa Ω, where we have introduced the symplectic matrix whose entries are (h 2,1 + 1) × (h 2,1 + 1) matrices. The NSNS and RR 3-form fluxes are quantized and may be written in the α, β basis where we have defined the integer-valued b 3 -vectors f and h. From now on we set (2π) 2 α = 1. The fluxes induce a superpotential for the complex structure moduli and axiodilaton We are interested in vacua with vanishing F-terms where D a W ≡ ∂ a W +W ∂ a K, a runs over complex structure moduli, and the Kähler potential truncated to the axiodilaton and complex structure moduli is The F-flatness conditions (3.6) and (3.7) imply that the (3,0) and (1,2) parts of the fluxes vanish, so that G 3 is imaginary self-dual, 6 The fluxes also contribute to the D3-brane charge For imaginary self-dual fluxes, we have that N flux > 0. Therefore tadpole cancellation requires the presence of negative D3-brane charges. A fixed amount of negative charge is induced by orientifolding. If the orientifold can be viewed as arising from a fourfold compactification of F-theory [71], the orientifold charge L is proportional to the Euler character of the fourfold [72]. For cancellation, any difference L − N flux can be made up by mobile D3-branes spanning the four-dimensional spacetime. For two of our three examples, we will not consider explicit orientifolds. Instead, we will take L max as an adjustable parameter, so that This is in line with the conventions of [4,6,69]. As we will see, L max sets the scale at which we observe interesting structure in the moduli space distribution.

Gauge-fixing
We have gauge symmetries relating equivalent vacua that must be fixed. In general, our symmetry group is G = SL(2, Z) φ × Γ. Here SL(2, Z) φ is the S-duality group from type IIB string theory and Γ is the modular group of the complex structure moduli space.
Under SL(2, Z) φ , the axiodilaton and fluxes transform as These transform solutions of (3.6), (3.7) to other solutions, and preserve N flux . They act as Kähler transformations on W and K and are thus symmetries of N = 1 supergravity, with the scalar potential V ≡ e K (|DW | 2 − 3|W | 2 ) manifestly invariant.
We also have the complex structure modular group Γ. For our examples, under a transformation of the complex structure moduli z a → z a , the periods change as Π(z a ) → Π(z a ) = Λ(z a )M · Π(z a ) (3.13) where M is a symplectic matrix with integer entries that is independent of z a . This transformation then induces a Kähler transformation (3.12) and is thus a symmetry as long as the fluxes transform as which also preserves N flux . Note that this transformation takes solutions to the F-flatness conditions to other solutions with moduli φ, z a . Since we are ignoring the distribution of fluxes, the moral here is that we can apply modular transformations to the axiodilaton and complex structure moduli vevs independently and without keeping track of the fluxes.
Our gauge-fixing prescription will be to map each vacuum to a fundamental domain. One could also choose some gauge-image of the fundamental domain. In this case, continuity of the gauge map combined with the topological nature of our analysis would seem to suggest that the persistent homology would be stable under this operation (although the scales of certain features may change). There are interesting subtleties to this argument due to discreteness of the data set -see Fig. 5. We have checked our results for stability under large gauge transformations. Such features can accomodate more deformation than their poorly-sampled counterparts and still be recovered by persistent homology. Right: a deformed poorly-sampled circle. The deformation overtakes the characteristic distance between points, and the 1-cycle will be poorly recovered by persistent homology.

Enhanced symmetries and W = 0 vacua
Given the full set of flux vacua on some background with fixed L max , we are often interested in special vacua obeying certain phenomenologically desirable conditions. For example, in more involved setups including intersecting D-branes, one might enforce conditions such as appropriate ranks of gauge groups, number of generations of chiral matter, etc., along the lines of [18,19]. One interesting question to ask is how the persistent homology of the distribution of vacua changes when one restricts to special vacua. Following [69], we will consider two types of special vacua: those with enhanced symmetries and those with vanishing tree-level superpotential.
Enhanced symmetries are low-energy symmetries (involving transformations of just the moduli) that descend from transformations of the moduli and the fluxes. The authors of [69] consider enhanced symmetries that descend from the modular group as well as a complex conjugation transformation 3 . One necessary condition for such a symmetry is that the moduli are invariant under the transformation [69]. In other words, restricting onto vacua with enhanced symmetries involves restricting to fixed points in moduli space of the set of transformations. For SL(2, Z), the fixed points are isolated in moduli space. For complex conjugation symmetries φ → −φ, z a → ±z a , the fixed points form a half-dimensional space, with each modulus restricted to its imaginary axis. Due to the dimensionalities of these subsets of moduli space, there is an upper bound on the dimensionalities of cycles in the restricted sets. For example, in a moduli space with 2n real dimensions, the region allowing vacua with a complex conjugation symmetry is n-dimensional. An n-dimensional region could in theory support an n-cycle, but the trivial topology of the n-dimensional plane means the highest cycle is an (n − 1)-cycle. Restricting to these vacua erases cycles of dimension n and higher that may be present in the full distribution of vacua. Vacua with vanishing tree-level superpotential are interesting to study for a variety of reasons. By nonrenormalization theorems, one expects the condition W = 0 is not corrected perturbatively. Assuming only nonperturbative (in α or g s ) corrections, F-term SUSY breaking of these vacua leads to a plausibly naturally small cosmological constant [69]. Moreover, it is known that there are deep connections between W = 0 vacua and R-symmetries [74]. In the T 6 example of Sec. 3.6, we will find that restricting to vacua with W = 0 gives the distribution a richer topology, with more long-lived higher-dimensional features. This is possible in part because the condition W = 0 itself does not select a subregion of the moduli space. In the continuous flux approximation and ignoring subtleties due to tadpole cancellation, there are W = 0 vacua everywhere in the T 6 moduli space. The topology of the restricted set is then an effect of having quantized fluxes and imposing tadpole cancellation, which is also the reason for interesting structure in the full distribution of vacua.
Understanding restrictions via their effects on the distribution's topology could be useful in a scaled-up problem where one wants to search for vacua that simultaneously satisfy several conditions, like certain number of generations, gauge groups, etc. In principle this could be done in a top-down fashion by analyzing large systems of equations, but in the limit of many conditions, a bottom-up tool like persistent homology seems potentially useful.

Rigid Calabi-Yau
Consider a rigid Calabi-Yau, with no complex structure moduli, studied in [4,6,69]. Since there are no complex structure moduli, b 3 = 2. We can write our symplectic basis for H 3 (M ) as {A, B}. Take the periods of the holomorphic three-form Ω to be The superpotential is The D3-brane charge induced by the fluxes is and the vacuum equation is We are interested in the distribution of flux vacua on the axiodilaton moduli space at fixed L max . To avoid repeat copies of individual vacua, we need to fix the gauge symmetry. For the rigid model, the entire modular group is SL(2, Z) φ . We can fix this gauge by either imposing conditions on the axiodilaton (i.e. restricting to a particular domain) or by imposing conditions on the fluxes. A natural condition is to restrict the axiodilaton to the SL(2, Z) fundamental domain: Alternatively, one can use SL(2, Z) to impose conditions on the fluxes. For example, (as in [4,69]), one may impose which entirely fixes SL(2, Z). This strategy is particularly helpful for listing all vacua for a finite L max , or alternatively for counting vacua.
To generate the distribution of flux vacua on the rigid Calabi-Yau, we use the flux conditions (3.21) to list the vacua for finite L max . We then use SL(2, Z) to map each vacuum to the axiodilaton fundamental domain. Despite the simplicity of the toy model, the distribution exhibits rich structure. In particular, projecting onto φ, one observes voids with no vacua except for at accumulation points in their centers [6] (see Fig. 6). These voids can be understood as arising from the combination of flux quantization, tadpole cancellation, and the vacuum equation. The vacuum equation (3.19) tells us that when we project onto φ, we are introducing some degeneracy. There are multiple flux configurations that give the same stabilized value for φ. These are inequivalent vacua, and represent different low energy theories. Specific values for φ correspond to 2-dimensional hyperplanes (intersecting with some gauge-fixing conditions) in the 4-dimensional flux space. Quantization of the fluxes forces these slopes to be rational. For example, φ = iβ for rational β corresponds to the hyperplane f 1 = βh 2 , f 2 = −βh 1 . The hyperplanes intersect the origin in flux space, although no vacua are located there since the moduli are not stabilized for vanishing fluxes. Moving around in the axiodilaton moduli space corresponds to rotating these hyperplanes along two axes. (Sometimes the plus sign takes one outside the tadpole bound, so that the nearest neighbor above is farther away than the nearest neighbor below.) Similar arguments to the above apply for nearest neighbors in other directions. Since b scales as √ L max in this discussion, we also come to understand the previously known fact that the voids shrink as 1/L max as L max is increased [6,69]. In terms of persistent homology, large voids should correspond to long-lived 1-cycles. Moreover, we expect the presence of vacua at the center of a void to have a specific signature in the persistence diagram. That is, the death of the 1-cycle corresponding to the void will be correlated with the death of a 0-cycle. In the case of a perfectly symmetric void, the 1-cycle and 0-cycle will die at exactly the same time. For a more oblique void, the 0-cycle will die first, possibly with the formation of a short-lived 1-cycle (see Fig. 7).
Persistence diagrams for a subregion of the rigid Calabi-Yau axiodilaton distribution are shown in Fig. 8. We observe the presence of many 1-cycles, some of which are relatively longlived. The longer-lived 1-cycles correspond to the larger voids in Fig. 6. We also observe the expected correlated deaths of 1-cycles and 0-cycles. While many correlated deaths occur at the exact same filtration time, some of the voids slightly outlive their isolated center vacua, including the longest-lived void. We also note several 0-cycles with ν death ∼ 0.06 that do not seem to correlate with any 1-cycles. In fact, these 0-cycles correspond to vacua that would be at the centers of voids, but whose voids are cut off by the boundary of F D . This boundary prevents the voids from being recognized topologically, other than the late deaths of the 0-cycles corresponding to isolated interior vacua.

Persistence pairing
We should note that while the correlated deaths are suggestive, they are not sufficient to recover the isolated vacua in the centers of the voids. Instead, one must turn to the persistence pairing output by the persistent homology algorithm. This pairing tells us which simplex causes the death of a particular p-cycle. For example, the death of the 0-cycle corresponding to an isolated vacuum in the center of a void is caused by the addition of an edge connecting Figure 8: Left: persistence diagram for rigid Calabi-Yau flux vacua projected onto axiodilaton with L max = 150, using a lazy witness complex with 700 landmark points. The filtration has 1,140,182 simplices. The orange points represent 1-cycles and the blue points represent 0-cycles. We observe many long-lived 1-cycles, corresponding to voids in the distribution. There are also correlations in the deaths of long-lived 1-cycles and 0-cycles, corresponding to the vacua in the centers of the voids. Right: making use of the information contained in persistence pairing, we can further investigate the correlated deaths. 0-cycles are destroyed by the addition of 1-simplices, and 1-cycles are destroyed by the addition of 2-simplices. We link with a red line 0-cycles and 1-cycles whose destroying simplices overlap. that vacuum to a point on the edge of the void. This same edge is contained in the triangles that fill in the void, causing the 1-cycle's death. Persistence pairing is not shown in persistence diagrams 4 and represents finer-grained information. However, noting the correlated deaths from the previous section, we can use persistence pairing to connect 1-cycles and 0-cycles whose "destroying simplices" overlap. These connections are also shown in Fig. 8. They confirm the isolated vaccum-void structure, which we might have only suspected from the persistence diagram.

Special vacua
We can also consider discrete symmetries in this model. There is a low-energy symmetry descending from complex conjugation for vacua with imaginary axiodilaton. 5 We might consider the persistent homology of the distribution of these special vacua. Restricting to the imaginary axis, we can only reduce the topological complexity of the distribution. As previously discussed, the restricted distribution cannot feature 1-cycles. All 0-cycles are born at the beginning of the filtration, so instead of a persistence diagram we show a histogram of the 0-cycle deaths in Fig. 9. There are many 0-cycles that die at ν death = 1. These vacua live at large Im φ, where tadpole cancellation and flux quantization dictate that the nearest neighbor should be a unit distance away. There are also a fair number of 0-cycles dying at ν death = 0.5 for similar reasons. For smaller Im φ, relatively long-lived 0-cycles will correspond to vacua at the centers of voids (or rather, what would be voids in the full distribution). Thus certain aspects of the full distribution show up (albeit in the form of lower-dimensional cycles).
In this example we could have simply looked at the projection in Fig. 6 and understood the presence of voids with isolated vacua in their centers. However, persistent homology also proves useful in higher-dimensional examples where we do not have a simple visualization. Moreover, we learned that persistence pairing can be used to confirm suspicions arising from correlations in persistence diagrams. We also had our first example of restricting to vacua satisfying special properties. In this case the restriction only reduced the topological complexity of the distribution. Figure 9: For rigid Calabi-Yau vacua with a low energy symmetry descending from complex conjugation, the topological complexity of the distribution is reduced. As the points live on a line, there are only 0-cycles. Long-lived 0-cycles correspond to vacua at the centers of voids in the full distribution.

Calabi-Yau Hypersurface and Multiparameter Persistence
In this section we consider the hypersurface defined by in the weighted projective space WP 4 1,1,1,1,4 . The hypersurface has h 1,1 = 1 and h 2,1 = 149. For this hypersurface, a particular orientifold (taking x 0 → −x 0 , ψ → −ψ along with worldsheet parity reversal) arises from F-theory compactified on a Calabi-Yau fourfold defined as a hypersurface in WP 5 1,1,1,1,8,12 , giving a tadpole condition L max = 972 [75]. As described in [69,75,76], the hypersurface equation (3.22) has a discrete symmetry group Γ = Z 2 8 × Z 2 . Any deformation to the complex structure besides the ψ term is charged under Γ. Thus, if only fluxes consistent with Γ are turned on, these charged moduli can only appear at higher order in the superpotential. We can then consistently set the charged moduli to zero, solve for the periods, and compute the vevs for the axiodilaton and uncharged modulus ψ.
We will focus on flux vacua near the conifold point ψ = 1. To first order in x ≡ 1 − ψ, the F-flatness conditions give where constants can be found in [76]. Monte Carlo sampling in [76] explicitly confirmed the expectation from [4] that vacua would cluster near the conifold point, including the scaling of density with distance from the conifold point. 6 We would like to study this clustering using persistent homology. In this case, since we have four real dimensions (two from φ and two from x), there is no simple visualization of the space of vacua, although we may plot projections onto specific planes. Although the clustering is manifest in the projection onto x, there could be more structure relating the clustering to φ. Persistent homology in the full four-dimensional space allows us to search for higher-dimensional features in the distribution of vacua.
We also encounter a new issue. While a length-based filtration like Vietoris-Rips is wellsuited for void identification, it does not perform as well in identifying clusters. Using a length-based filtration, the persistence diagram for a cluster is not very different from the persistence diagram for a uniform distribution. The cluster will give rise to many 0-cycles that die very early in the filration. This is not using persistent homology in a very clever way.
One way around this is to use a multiparameter filtration [78]. In addition to the length parameter, we can assign to each point a density (e.g. the inverse distance of the n-th nearest neighbor). We can then take as an orthogonal filtration parameter a threshold density ρ th and only include points with ρ < ρ th . For low ρ th , points in the cluster will not yet be included, and the distribution will feature a void, easily picked up by a length-based filtration. This density filtration is related to the sublevel filtration the present authors used to study the CMB in [52]. We can imagine smoothing out the point cloud to define a density function on the moduli space. We are then performing a sublevel filtration on this function. How well this represents the underlying point cloud depends on how well-sampled the space is. 6 One might be tempted to worry about the breakdown of our EFT here due to the masslessness of a wrapped D-brane state [77]. We can consistently leave this state out of our EFT by going to larger volume, since the Kähler moduli are unfixed in our setup. Increasing the Calabi-Yau volume increases the wrapped D-brane's mass and lowers the masses of the moduli under consideration, which scale as the flux density. It would be interesting to see how including this state in the full moduli space modifies the clustering as calculated by the approximation of [4,6]. There is a long-lived 1-cycle corresponding to the excised cluster. Right: without the density filter, there is no long-lived 1-cycle. We observe a few 3-cycles (red), but, like the rest of the features, they are short-lived.
While multiparameter persistence is well-defined, it lacks a simple summary statistic. Instead, one has a persistence diagram for each (well-defined) path through (r, ρ th ) space. (A software implementing interactive visualization of two-dimensional persistence is [79].) For our purposes, we will only consider a length-based filtration at two density thresholds to demonstrate the successful identification of the cluster and diagnose whether there is any more interesting structure in the four-dimensional space.
We generate the point cloud by drawing random flux vectors and computing the stabilized moduli with (3.23), (3.24). We have to be careful to fix the gauge symmetry and stay within the regime of perturbative validity |x| 1. In addition to S-duality for the axiodilaton, the complex structure modulus has a logarithmic monodromy around the conifold point. We fix the gauge by mapping φ to the SL(2, Z) fundamental domain (also acting on the fluxes) and taking arg x ∈ [−π, π).
Computing the persistence diagrams for the point cloud with and without a density filter, we find a long-lived 1-cycle with the filter and no nontrivial features without the filter (Fig.  10). Thus we have recovered the conifold clustering using persistent homology. Moreover, we find no higher-dimensional features linking the clustering in x to the distribution for φ. In other words, the clustering in x does not correlate with any topological structure in φ.
This is an aspect we could not have diagnosed with simple visualization.

Symmetric T 6
In this section we consider toroidal compactifications where a T 6 can be viewed as a direct product of three two-tori with equal modular parameter τ . The moduli space has two complex dimensions, so attempts at simple visualization require projecting onto arbitrary planes. Persistent homology, however, has no trouble with higher-dimensional spaces, and can be used to characterize the distribution of vacua. The symmetric T 6 also features vacua with vanishing tree-level superpotential. These vacua give us the opportunity to study how imposing phenomenologically desirable conditions affects the persistent homology of a distribution of vacua that cannot be directly visualized.
We follow the conventions of [69]. Take x i , y i for i = 1, 2, 3 to be coordinates with periodicity x i ≡ x i +1, y i ≡ y i +1. Then choose three holomorphic 1-forms dz i = dx i +τ ij dy j . We take the orientation The holomorphic 3-form is Ω = dz 1 ∧ dz 2 ∧ dz 3 . We can expand the fluxes as So far, we have written our parameterization for a general T 6 . Now we specialize to the symmetric case, taking τ ij = τ δ ij . In other words, we take the T 6 to be factorizable as three two-tori with equal modular parameter. This corresponds to turning on only a special subset of the fluxes (as in the hypersurface example): In this case, the superpotential takes the form W = P 1 (τ ) − φP 2 (τ ) (3.29) where P i are cubic polynomials in τ over the integers The Kähler potential for τ and φ is and the flux-induced D3-brane charge is The F-flatness conditions can be simplified to be The axiodilaton can be solved for using one of these equations and plugged into the other, giving the equations for τ = x + iy where the q i are polynomials in x whose form can be found in the appendix of [69]. When these equations are multiplied to eliminate y, miraculous cancellations leave one with a cubic rather than sextic equation in x, where the α i are combinations of flux integers that can also be found in [69].

Generic vacua
We generate random flux vectors, solve for τ and φ, and map both to their fundamental domains to fix the gauge. For L max ≤ 18, lattice effects suppress the number of vacua (essentially, (3.36) and (3.38) are not compatible). As L max is increased above 18, a dense region develops for small Im τ with mostly trivial topology at our resolution and sampling, pushing an underdense region that is topologically similar to the L max = 18 distribution to larger Im τ . However, it seems that this underdense region loses some topological complexity as it is pushed (Fig. 11). Moreover, the dense region at small Im τ is not entirely trivial. Actually, voids similar to those in the rigid model develop on the imaginary axis, shrinking as L max is increased (Fig. 12). Figure 11: Left: persistence diagram for generic vacua and L max = 18. Right: persistence diagram for generic vacua and L max = 54, looking at the underdense region. We see that as L max is increased, the higher dimension cycles become shorter-lived. Figure 12: Long-lived 1-cycles in the generic distribution of T 6 vacua. Here we used 150 landmark points for a subregion of the generic T 6 distribution with L max = 54.

W = 0 vacua
The symmetric T 6 features vacua with vanishing tree-level superpotential, W = 0. Considering only these vacua, we find that their distribution in moduli space exhibits different topological structure than the generic vacua. In particular we find that restricting to W = 0 vacua results in additional higher-dimensional cycles in certain regions of the distribution.
Combined with the F-flatness conditions, enforcing W = 0 means the simultaneous vanishing of P 1 (τ ) and P 2 (τ ), For details on enumerating fluxes giving rise to W = 0 vacua, see [69]. The important point is that the condition W = 0, when combined with flux quantization and tadpole cancellation, maps to different structure in the moduli space than the generic vacua exhibit. Moreover, as we will see, restricting to W = 0 vacua, unlike the discrete symmetry restriction of Sec. 3.4, induces richer topology in the distribution. This is possible because W = 0 is less restrictive about the stabilized vevs of the moduli.
Given L max , we generate all W = 0 vacua with N flux ≤ L max . We again fix gauge by mapping to the fundamental domain. We observe a large-scale structure that is insensitive to L max (Fig. 13). Moreover, for sufficient L max , there are enough W = 0 vacua to form a complex topology in moduli space (Fig. 14). The multiscale topology of such vacua contains higher-dimensional cycles that are long-lived. In this case, we see that restricting to W = 0 vacua gives rise to a more complex topology. Not only is the topology more complex than the dense subregion of the generic distribution in which we are making our cut, but it is also more complex than the underdense subregions of the distribution (cf. Fig. 11). This is perhaps to be expected, as the extra requirement W = 0 imposes richer number-theoretic structure on the flux space.

Conclusion
Persistent homology can be used to study the shape of a data set. For string theory, we are interested in understanding the structure of the landscape. We showed that persistent homology can be used to effectively characterize generic and special vacua in toy models. Despite the simplicity of our toy examples, we learned a few things that should prove useful in a scaled-up program. For one, we learned that persistence pairing can be used to recover more refined information than is expressed in a persistence diagram. In particular, we  were able to reconstruct the presence of isolated vacua inside voids in the rigid Calabi-Yau construction. We also learned that to robustly characterize not only underdense regions like voids but also overdense regions like clusters, a notion of multiparameter persistence is useful. In addition to the usual length-based filtration parameter, one can consider an orthogonal density threshold parameter. As the density threshold is raised, we include points in dense regions, allowing persistent homology to recognize the presence of clusters by their excision. This notion could have interesting cosmological applications, e.g., in the context of [49][50][51][52][53][54]. We also studied the persistent homology of restricted distributions of special vacua (like those with discrete low-energy symmetries or vanishing tree-level superpotential). Understanding the topology of special vacua could have interesting consequences in more realistic models if we want to ask about the simultaneous satisfiability of multiple low-energy criteria and the distribution of those very special vacua.
There are plenty of future directions to consider. Obviously, we would like to study distributions arising from more realistic constructions. In this work we only considered the distribution of complex structure moduli and axiodilaton vevs stabilized by fluxes is type IIB string theory. In principle we also know how to stabilize the Kähler moduli and open string moduli. We also largely ignored the fluxes. Trying to combine the fluxes with the moduli vevs seemingly presents a problem, since the fluxes are discrete while the moduli vevs in our cases were rational or irrational. However, the fluxes do show up in the low-energy theory as coupling constants, and we could include them in a unified analysis by considering e.g. the masses of stabilized moduli.
Of course, as we move on to more realistic data sets, the constructions necessarily become more complicated. While we made use of persistent homology's ability to compute in high dimensions, we did not consider models with hundreds of moduli. Understanding how to adapt our techniques to such situations might require advances on the algorithmic/software side (see [80] for a comparison of different software packages). Another difficulty is in choosing parameters (number of witness points, subregion of moduli space, maximum filtration parameter) for the persistent homology computation. For this paper, we largely chose parameters via a "guess and check" method, as certain parameter values would e.g. freeze the program or not find what we were looking for. Ultimately we would like to perform a systematic scan over a large database of string models. Automating the choice of parameters then presents another difficulty. With such an automated scan, we would also have the problem of too much output. Systematically processing persistence diagrams provides another challenge. It could be here that persistent homology and machine learning techniques might be usefully coupled.
From a more physical perspective, it would interesting to ask what structure in the landscape means for vacuum selection and tunneling between vacua. This is where related ideas in the study of energy landscapes may become useful. Morse theory relates the topology of sublevel sets with critical points of a potential. Therefore, TDA can be used to sample the topology of the string landscape to effectively find vacua and study their transitions. We plan to return to this interesting idea in a future work. An alternative approach to these questions from the perspective of network science was advocated in [81]. However, much of the interesting dynamics is suppressed in treating string vacua as nodes of a network. More understanding is also needed on the physics side here to study the transitions. For example, tunneling in the presence of dynamics remains poorly understood [82].
In this paper we were concerned with methods for mapping out the structure of the landscape itself. A complementary approach is to study the even vaster swampland of effective field theories with gravity that do not admit UV completions [83]. Recent conjectures put some interesting constraints on the shape of the energy landscape of string theory [84,85]. It would also be interesting to apply TDA or other data science techniques to understand the shape of the boundary between the landscape and the swampland, or to test various conjectures about the topology of moduli space [86].