Persistent Homology and String Vacua

We use methods from topological data analysis to study the topological features of certain distributions of string vacua. Topological data analysis is a multi-scale approach used to analyze the topological features of a dataset by identifying which homological characteristics persist over a long range of scales. We apply these techniques in several contexts. We analyze N=2 vacua by focusing on certain distributions of Calabi-Yau varieties and Landau-Ginzburg models. We then turn to flux compactifications and discuss how we can use topological data analysis to extract physical informations. Finally we apply these techniques to certain phenomenologically realistic heterotic models. We discuss the possibility of characterizing string vacua using the topological properties of their distributions.


Introduction
When studying string compactifications in many occasions one faces a set of choices among a discrete set of parameters. For instance in certain N = 2 models one has to fix a Calabi-Yau variety, labelled for example by its Hodge numbers; similarly certain N = 1 effective models are parametrized by the choice of a collection of integers, representing flux quanta, subject to various constraints. In all these cases we are presented with many possibilities, all of which seem equivalent before a detailed study of the low energy effective theory. There are by now many available techniques to perform such an analysis, yet one can wonder if there is a simpler setting in which one can understand qualitative features of this set of choices.
We can represent this collection of choices with a set of points. Each point could represent a particular compactification manifold, or the set of parameters determining a vacuum for a fixed background geometry. For example in a flux compactification these points could be associated with integral fluxes which stabilize the physical moduli. Or their origin could be more mysterious, for example associated with a (yet not understood) distribution of manifolds with certain properties. In the former case these points arise from studying the equations associated with the physical model, as minima of a superpotential;

JHEP03(2016)045
in the latter from some ad hoc mathematical construction, as part of the open problem of classifying higher dimensional manifolds.
In this note we pose the following question: is there any particular topological structure in these set of points? For example one can ask if vacua in a given distribution have the tendency to cluster in distinct regions, or if the distribution of vacua presents holes or analog higher dimensional structures. Similarly one can wonder if a distribution of vacua characterized by certain physical properties is "simple", with almost no topological feature, or "complex", with many non-trivial homology generators. We will study topological features of distributions of vacua, in the appropriate sense, and consider the possibility that such topological information could be of physical relevance. We will do so by applying techniques from topological data analysis to the problem of counting vacua in string theory.
Topological data analysis studies how homological features of a dataset persist over a long range of scales. This is obtained by constructing a family of simplicial complexes out of a dataset and studying its homologies at various length scales. This approach to topological spaces is called persistence [1,2]. The basic idea is that those homological features which are more persistent over the range of length scales can be used to give a topological characterization of the original dataset, as reviewed for example in [3][4][5]. We will discuss how this characterization can be used to analyze the physical significance of a distribution of vacua; for example by measuring its topological complexity in terms of its non-trivial persistent homology classes in higher degree, or if there are certain physical requirements on the parameters which correspond to distinctive topological features.
Such techniques are becoming standard practice in data analysis, with a long range of applications, from biology and neuroscience to complex networks, natural images and syntax, see [6][7][8][9][10] for a sample of the literature. Among the strengths of topological data analysis is its robustness respect to noisy samples, since spurious features typically show up as very short-lived persistent homology classes. For an application to the study of BPS states and enumerative problems in string and field theory see [11].
In plain words, the idea behind this note is simple. First of all, we take a set of string vacua, for example obtained as critical points in a flux compactification, or as a collection of compactification spaces. Out of this collection we construct a point cloud, a set of points in R N each representing a vacuum. Then we "fatten" each point into a sphere of radius . To this configurations we associate a continuous family of simplicial complexes by declaring that a number of points form an edge, a face or a higher dimensional simplex, if the associated spheres of radius intersect pairwise. The main idea of topological data analysis is to study such complexes as a function of . To obtain topologically invariant information, we then pass to the homology of this continuous family of complexes, and obtain a family of homology groups parametrized by . A little thought shows that the only thing that can happen is for an homology class to be born at some value 1 and then to die at 2 > 1 . We finally plot the lifespans (their persistence) of all the homology classes: these are called barcodes. At the end, we have associated a collection of barcodes to each distribution of vacua. Such a collection captures the topological features of the distribution at every length scale.
One could envision a program to study systematically string vacua from this perspective and ask which subsets of physically relevant vacua are characterized by which JHEP03(2016)045 topological features. In this note we take a small step towards this direction by studying the persistent homology associated with a few distributions of vacua, with N = 2 and N = 1 supersymmetry. In this context we will learn how to interpret and extract physical information from the barcodes.
As a first step we will discuss N = 2 vacua obtained from Calabi-Yau compactifications of the type II string or from Landau-Ginzburg models. Many Calabi-Yau varieties are known and have been constructed explicitly. An example is the Skarke-Kreuzer list [12], which parametrizes Calabi-Yau varieties obtained from reflexive polyhedra in fourdimensions. Such lists offer only samples of the full set of Calabi-Yau varieties and indeed no systematic general construction is known. From a more geometrical perspective, studying these vacua is akin to studying the topological properties of the distributions of known Calabi-Yau varieties. We will consider the question if such distributions present any particular homological feature, and if distinctive characteristics appear when restricting to geometries with certain properties, for example low Euler characteristics. We will take a similar approach to Landau-Ginzburg models.
Next we will discuss flux vacua in type IIB compactifications. As originally pointed out in [13,14] this is a very promising avenue to apply statistical techniques to distributions of string vacua. A great amount of work has been dedicated to understanding the distribution of the number of flux vacua with certain properties, as reviewed for example in [15][16][17], and part of these results have been put on firmer ground using tools from random algebraic geometry [18]. Here we take a topological approach, which is different from the statistical counting of vacua with certain properties and can in principle address the topological structure of the whole distribution of vacua. In this note we content ourselves with discussing the counterpart of results already known in the literature from the perspective of persistence, leaving a more detailed analysis for the future.
Finally we end with a very cursory look at a class of promising heterotic models constructed in [19][20][21]. Such models are characterized by a Calabi-Yau and an holomorphic bundle, plus a series of additional choices. We will consider N = 1 models which give rise to an SU(5) gut, which then can be higgsed by Wilson lines giving a Standard Model-like spectrum. We will study the topological features of a distribution of vacua parametrized by the Hodge numbers of the underlying Calabi-Yau and the Chern classes of the holomorphic bundle. Then we will take a somewhat different perspective, and study in a simple example how these topological features change as we vary a certain physical parameter in the gut spectrum. We will use this example to make more precise the statement that a collection of vacua has a higher degree of topological complexity compared to others.
In this note we content ourselves to understand how to use techniques of computational topology to extract physical information from distributions of string vacua and discuss which kind of facts one might hope to learn. Of course this is just a first step and one could obtain a deeper intuition by enlarging the datasets available or considering different collections of vacua. Hopefully by working with collections of larger datasets corresponding to phenomenologically viable models, one could use these methods to gain physical insights, such as if specific physical characteristics are accompanied by certain topological features. At this stage we have no evidence for this intriguing idea, and plan to investigate it more thoroughly in the future.

JHEP03(2016)045
All the persistent homology computations in this paper have been done with matlab using the javaplex library from [22]. 1 The accompanying software and dataset can be found at [23]. Manipulations of datasets have been carried out with mathematica.
This note is organized as follows. Section 2 gives a basic introduction to topological data analysis, presenting all the elements that we will need. In section 3 we discuss N = 2 vacua obtained from Calabi-Yau compactifications and Landau-Ginzburg models. Sections 4 and 5 contain applications of persistent homology to the study of flux vacua in type IIB compactifications and to certain phenomenologically realistic heterotic models, respectively. Finally in section 6 we summarize our findings.

An introduction to topological data analysis
Topological data analysis is a relatively recent approach at managing large sets of data using techniques based on computational topology [1,2]. It applies homological methods to collections of data arranged as point clouds to extract qualitative information. The results are only sensitive to topological information and not to geometrical quantities, such as a choice of a metric or of a system of coordinates. Furthermore it has the advantage of being functorial by construction, by studying the relations between objects as the parameters of the model are varied. The analysis of data based on topology is rapidly gaining momentum in fields such as biology, computer vision, neuroscience, languages or complex systems [6][7][8][9][10].
In this section we will survey some basic ideas and techniques of topological data analysis. The reader interested in a more in depth discussion should consult the reviews [3][4][5], which we will follow. For a more detailed discussion aimed at physicists and various examples, see [11]. After reviewing some basic elements of algebraic topology, we introduce the Vietoris-Rips complex and persistent homology. We also discuss some approximation schemes in the computation of barcodes and discuss how to set up the topological analysis.

Elements of algebraic topology
We start with a quick review of some notions of algebraic topology that we shall use in the following. In many applications it is useful to approximate a topological space by a triangulation. This can by done by means of a simplicial complex. A simplex is the convex hull of a series of points, its vertices. Its dimension is the number of vertices minus one: a 0-simplex is a single vertex, a 1-simplex is an edge between two vertices, a 2-simplex is a triangle, and so on. A face of a simplex is the convex hull of a subset of its vertices. A simplicial complex is basically a collection of simplices with the property that if a simplex is part of it, then so are all of its faces. We can state this more abstractly as follows: a simplicial complex S is a collection of non-empty sets Σ, its simplices, such that if σ ∈ Σ and τ ⊆ σ, then τ ∈ Σ. We say that σ ∈ Σ is a k-simplex if it has cardinality k + 1. One can define maps between simplicial complexes in the natural way. A simplicial map f between two simplicial complexes S 1 and S 2 , is a map between the corresponding vertex sets such that a simplex σ of S 1 is mapped into a simplex f (σ) of S 2 . A simplicial map takes a p-simplex into a k-simplex, with k ≤ p.

JHEP03(2016)045
Let us consider a couple of examples. A simple simplicial complex which can be attached to a topological space X is the nerve of a covering U , Nerv U . Consider a covering U = {U i } i∈I labelled by a set I. The nerve of U is then defined in terms of non-empty sub-collections of sets S as that is, as the simplicial complex whose vertex set is I and where a set σ = {i 0 , · · · , i p } ⊆ I defines a simplex if and only if U i 0 ∩ · · · ∩ U ip = ∅. This construction does not depend on the particular details of the covering. In particular, under general assumptions, Nerv U is homotopy equivalent to the space X. For a metric space X consider for example the covering given by radius > 0 balls, In particular assume we can write X = i∈I B (i) for a subset I ⊆ X. Then the nerve construction Nerv applied to the covering {B (i)} i∈I gives thě Cech complexČech (I) associated to the set I and to . Note that as increases, the balls get bigger and bigger and therefore whenever 1 ≤ 2 we have thatČech 1 (I) ⊆Čech 2 (I).
From a simplicial complex we can get interesting topological information by passing to its homology. Assume we have a simplicial complex S, with an ordering of the vertex set. We form the vector space of k-chains C k by considering linear combinations c = i a i σ i , where σ i is a k-simplex in S and a i ∈ Z p (typically for p a small prime). The boundary of a k-simplex σ is the union of its (k − 1)-subsimplices τ ⊆ σ. One defines the boundary operator ∂ k : where the hatted variables are omitted. We can now form the chain complex and define the spaces of k-cycles Z k = ker ∂ k and k-boundaries B k = Im ∂ k+1 . We define the homology H k (C • ; Z p ) as the quotient Z k /B k and its Betti numbers If the simplicial complex S is derived from an underlying topological space X, for example via the Nerv orČech construction, the Betti numbers give information about the topology of X. Heuristically b k measures the number of independent holes of dimension k. Perhaps the most important feature of this construction is that it is functorial. Consider a continuous map f : S 1 −→ S 2 between two simplicial complexes S 1 and S 2 , for example induced by a map between the two underlying topological spaces. This map induces the chain map C • (f ) : C • (S 1 ) −→ C • (S 2 ) between chain complexes, such that the diagram

JHEP03(2016)045
commutes. The map f induces the homomorphism f : We will use these facts extensively in the following; in our case the map between two different simplices S 1 and S 2 will be the inclusion. Then functoriality can be used to understand the fate of the homology classes of H • (S 1 ; Z p ) in H • (S 2 ; Z p ). This idea leads to persistent homology.

Point clouds and the Rips-Vietoris complex
We will be interested in a version of the previous constructions. The starting point is not anymore a topological space, but a finite collection of points {x i } i∈I in R n . We will call such a collection X a point cloud. In most practical applications a point cloud is constructed out of a multidimensional data set. Topological data analysis is basically a framework which associates topological information to a point cloud, via the homology of a certain complex. Given a point cloud X it is natural to define a simplicial complex whose vertices correspond to the set of points in X. To define k-simplices we use a version of the Nerv construction. From X we define the space X = x i ∈X B (x i ), by fattening the points of X. In X each point of X is replaced by a ball of radius > 0, also called the proximity parameter. For example now we can associate to X theČech complexČech (X) whose vertex set is the set of points of X and its k-simplices are collections of points σ = {x i 0 , . . . , The problem with theČech complexČech (X) is that it is computationally lengthy to check all the intersections. It is useful to approximate theČech complex with a simpler variant, the Vietoris-Rips complex VR (X). To the set of points in X we associate the Vietoris-Rips simplicial complex as follows. Given a proximity parameter , a k-simplex in VR (X) is a collection of k + 1 points {x i 0 , . . . , x i k } whose pairwise distance is less than , that is Equivalently we assign balls of radius /2 to each point, and we connect two points by an edge anytime their balls intersect. The natural orientation is given by declaring that the p-simplex [v 0 , · · · , v k ] changes sign under an odd permutation. The main difference with theČech complexČech (X), is that in defining the Vietoris-Rips complex VR (X) we only have to compute the distance for a pair of points at a time. Furthermore the fact that in the former the parameter is the radius of the balls, while in the latter is the distance between the centers, leads to the inclusionš (2.6) Finally now that we know how to construct the Vietoris-Rips complex VR (X) of a point cloud X, we can compute its homology H i (VR (X); Z p ).

Persistent homology and barcodes
The idea behind persistent homology is to study the homology spaces H i (VR (X); Z p ) as a function of . Instead of a simplicial complex, now we have a collection of them, VR (X) parametrized by , which leads to a family of homology spaces H i (VR (X); Z p ), again JHEP03(2016)045 parametrized by . While these in principle are continuous families, there will be only a finite number of inequivalent simplicial complexes which appear at finitely many 's. We label these values of as { a } a∈J where J is a finite set. From the point cloud X we construct the sequence of inclusions of spaces for 0 = 0 < 1 < 2 < · · · . For each X a we can construct the associated Vietoris-Rips complex VR (X). This leads to the filtration of simplicial complexes Taking the i-th homology gives (2.9) We will see momentarily that this is an example of an N-persistence module, and that it is completely characterized by its barcode. Let us however begin with a couple of remarks.
First of all what we have defined is technically an R-persistence module, since is a real variable. However only finitely many simplicial complexes will be distinct and therefore we can consider this a N-persistence module. To do this we have to chose an ordering preserving map f : N −→ R; any choice will do, but a choice has to be made. Secondly, it can be useful to think a bit more about (2.9). The maps in (2.9) are the lift to homology of the inclusions between the Vietoris-Rips complexes in (2.8). Since homology is functorial, these maps keep track of the corresponding homology classes. In other words we know if the same homology class is present both in H i (VR a (X); Z p ) and H i (VR b (X); Z p ) for arbitrary a and b. Therefore the only thing that can happen is for an homology class to be born at a certain "time" a and then to die at a subsequent "time" b , with a < b (we allow for the case b = +∞). Barcodes are a simple tool to visualize homological features of a data set. They are precisely what keeps tracks of these births and deaths. The idea of persistence is to look for features which persist over a large range of 's. Here "large" can have different meaning, depending on the point cloud or on the particular questions one is asking. Persistent features are a measure of the underlying topological structure of the dataset. On the other hand, short-lived signatures are interpreted as noise, which depend on the particular approximations one is using when constructing the dataset.
Let us make the above discussion a bit more precise. Let K be a field. A N-persistence module over K is a family of vector spaces {V i } i∈N over K, together with a collections of homomorphisms ρ i,j : V i −→ V j for every i ≤ j, such that whenever i ≤ k ≤ j, the homomorphisms are compatible (in the sense that ρ i,k · ρ k,j = ρ i,j ). Persistence modules can be added, to create a new persistence module. Viceversa we can ask if a persistence module can be decomposed in simpler modules.
The usefulness of persistence modules is the existence of a classification result. This is a generalization of the similar result from elementary linear algebra, which states that finite dimensional vector spaces are classified by their dimension, up to isomorphisms. In the same fashion, certain classes of persistence modules are classified by their barcodes.

JHEP03(2016)045
We say that the persistence module {V i } i∈N is tame if each V i is finite dimensional and ρ n,n+1 : V n −→ V n+1 is an isomorphism for large enough n. Given two integers (m, n) so that m ≤ n, we introduce the "interval" N-persistence module K (m, n), given by In words K(m, n) assigns the vector space K to the interval [m, n]. Note that we can extend this definition for n = +∞. Then the classification result states that any tame N-persistence module over K can be decomposed as for a certain N , and the decomposition is unique up to the ordering of the factors. In plain words a tame persistence module is completely determined by the intervals in N where we assign non zero vector spaces. Therefore a tame N-persistence module is equivalent to the assignment of a collection of pairs of non-negative integers (m i , n i ), where 0 ≤ m i ≤ n i and we allow n i to be +∞. We call such an assignment a barcode and we represent it graphically by a collection of bars, each one associated with the aforementioned intervals. Each collection of maps and vector spaces (2.9) is clearly an N-persistence module. Since the point cloud X is finite, it is also tame. Each N-persistence module (2.9), obtained by taking homology in degree i, is then uniquely determined by its barcode. In the rest of this paper we will compute the barcodes associated to N-persistence modules which arise from certain point clouds and discuss the physical interpretation of their persistent features.

Approximation schemes and witness complexes
When a point cloud X consists of a large number of points, the computation of the Vietoris-Rips simplicial complexes and of the associated N-persistence modules can become quite intractable. We will now discuss certain approximation schemes, based on the idea that one could select only a limited subset L of X as vertices of the simplicial complexes.
A landmark selector is an operator which chooses the subset L from X. Each landmark selector has its own advantages and disadvantages and it is important to be aware of them. An obvious landmark selector picks the elements of L at random. This is quite useful, although depending on how scattered is the dataset, it may miss important features. We will mostly use a so-called maxmin selector. The idea is to select points by induction, in order to maximize the distance from the already chosen set. More precisely we start from a randomly chosen point. The remaining ones are chosen by induction: if L i consists of i chosen landmark points, then the next i + 1-th landmark point is chosen in order to maximize the function z −→ d(L i , z), where d(L i , z) is the distance between the landmark set L i and the point z ∈ X. Note that with a maxmin landmark selection, the landmark set will consist of points spreading apart from each other as much as possible. As a consequence the set will cover the dataset, in principle better than a random selection. The drawback is that the maxmin algorithm will generically choose outlier points.

JHEP03(2016)045
A landmark selector can greatly simplify the computation of the homology. We will use landmark selection to define approximations to the Vietoris-Rips complex, the witness complexes. Assume we have chosen a landmark set L from a point cloud X. We define the witness complex W(X, L, ) as follows. The vertex set of W(X, L, ) is given by L. To define simplexes, we pick a point x ∈ X and we denote by m k (x) the distance between x and its k + 1-th closest landmark point. Then a collection of k > 0 vertices l i form a k-simplex [l 0 . . . l k ] if all its faces are in W(X, L, ) and there exists a witness point x ∈ X so that (2.12) In particular we have the inclusion W(X, L, 1 ) ⊆ W(X, L, 2 ) when 1 < 2 . Therefore we can construct a filtration of witness complexes as function of the proximity parameter .
Passing to the homology defines N-persistence modules and we can study their persistent features by looking at the barcodes. Another approximation scheme is the lazy witness complex LW(X, L, ). Again this complex depends on a landmark selection, but this time an extra parameter ν ∈ N is involved. The vertex set of LW(X, L, ) is again given by L. To define simplexes we need to introduce a notion of distance. For x ∈ X we define m(x) as the distance between x and the νth closest landmark point if ν is non zero, and set m(x) = 0 otherwise. Then, given two vertices l 1 and l 2 in L, the edge [l 1 l 2 ] is in LW(X, L, ) if we can find a witness point x ∈ X so that Finally a higher dimensional simplex is an element of the lazy witness complex LW(X, L, ) if all of its edges are. Note that again the inclusion LW(X, L, 1 ) ⊂ LW(X, L, 2 ) holds whenever 1 < 2 and again we can construct filtrations by inclusion. The usefulness of the lazy witness complex is that it is less computationally involved since it is determined from its 1-skeleton. The lazy witness complex is the simplest way to study the persistent homology of a dataset. In this paper we will always set ν = 1.
In this paper we will use matlab to perform all the homology computations and our programs will employ the library javaplex, available from [22]. We will use mathematica for the manipulations of the datasets. Our matlab programs and datasets are available at [23].

Topological analysis
Finally we collect some qualitative ideas on how the topological analysis based on persistent homology will be used in our context. Again we refer the reader to [3][4][5] for more examples of the practical uses of these techniques.
1. Topological data analysis provides qualitative information about a dataset. Heuristically it determines the topological properties of a dataset, such as its clustering in connected components, or the presence of loops or in general higher dimensional surfaces. Being topological, this information is independent on the set of coordinates or any metric used for the analysis. For example by regarding a dataset as a statistical

JHEP03(2016)045
approximation to an underlying topological manifold at a certain length scale, one computes the Betti numbers of such a manifold. From the barcodes one obtains information about the homologically non-trivial n-cycles as well as their characteristic length scale as measured by the proximity parameter .
2. The presence of barcodes in higher degree indicates that the data are organized forming higher dimensional homologically non-trivial cycles, at least at a certain length scale. This can be seen as evidence of existing correlations between the data. For example, if in a certain region of the point cloud the points are disposed along an n-cycle, it is possible that there exists relations between themselves, in the form of a series of algebraic equations which describe the n-cycle.
3. Short-lived persistent homology classes are generically regarded as noise, while longlived classes point out towards homologically robust features. This is intuitively clear and follows from the definition of the Vietoris-Rips complex and its variations. Therefore one is lead to look for long-lived bars, encoding the persistent homology classes. Of course, what does it mean to be short-or long-lived is somewhat subjective and depends sensitively on the physical problem. We will see example of physically interesting but short-lived persistent homology classes. In particular this can happen when the existence of a symmetry or modular property forces the same behavior at various length scales.
4. The topological analysis can be most effective when comparing different datasets. This perspective is commonly used in fields such as biology or neuroscience, where the barcodes of different datasets can reveal if a certain drug was effective or not. In our case one can for example select string vacua with or without a certain feature, say a certain particle present in the low energy spectrum, and ask if this comports qualitative differences, and of which type.
In the following we will apply these ideas on certain distributions of string vacua.

N = 2 vacua
Having set up our main computational tools, we proceed to use them in a few specific examples of N = 2 vacua of the type II string. Examples of such vacua include Calabi-Yau manifolds and Landau-Ginzburg models. The construction of Calabi-Yaus is an art on its own and while hundreds of thousands of examples are known explicitly, the list is far from exhaustive. The classification problem is still wide open, and the origin of Calabi-Yau varieties still rather mysterious. It is natural to wonder if the techniques we have exposed so far can be applied to the known distributions of Calabi-Yau varieties, and what kind of information can we hope to gain. Similar arguments hold in the case of Landau-Ginzburg models, which play an important part in the classification of N = 2 superconformal field theories.
In particular we will study the following (incomplete) set of string vacua:

JHEP03(2016)045
• The Skarke-Keuzer list from [24], containing Calabi-Yaus which can be realized as a hypersurface in a toric variety, and correspond to four dimensional reflexive polyhedra. Of this list, 30108 varieties have distinct Hodge numbers.
• Complete Intersection Calabi-Yaus (CICYs), which are constructed via a complete intersection of polynomials within a product of projective spaces. We take the list from [25], which contains the 7890 CICYs constructed in [26]. From this list we take the 266 pairs of distinct Hodge numbers and add their mirrors.
• A list of Landau-Ginzburg models, their abelian orbifolds and certain models with discrete torsion, taken from [27]. We parametrize these models by (χ, n +n) (for simplicity we remove by hand all those models which need an extra label, which are characterized by χ = 0).

Calabi-Yau compactifications
We begin by considering vacua of the type II string which have the form R 3,1 ×X where X is a Calabi-Yau threefold, a complex three dimensional manifold with trivial canonical bundle. These compactifications preserve N = 2 supersymmetry in R 3,1 . The Calabi-Yau theorem states that for each Kähler class ω ∈ H 1,1 (X; C) there exists a unique Ricci flat Kähler metric. The moduli space of Calabi-Yau metrics is parametrized by h 2,1 (X) = dim H 2,1 (X; C) complex structure deformations and h 1,1 (X) = dim H 1,1 (X; C) Kähler moduli, which correspond to scalar fields in the effective four dimensional theory. These Hodge numbers characterize the low energy effective action by determining geometrically the number of vector multiplets and hypermultiplets. The Euler characteristic of a Calabi-Yau is given by The superconformal theories which describe the propagation of strings on Calabi-Yaus come in pairs, a phenomenon known as mirror symmetry. Mirror symmetry has been established for many pairs of Calabi-Yaus. In this case we say that two Calabi-Yaus form a mirror pair (X, Y ) and they represent the same physical vacuum. The Hodge numbers for mirror pairs are related as h 1,1 (X) = h 2,1 (Y ) and h 2,1 (X) = h 1,1 (Y ). More deeply complexified Kähler moduli and complex structure moduli are exchanged. Often certain quantities associated with a Calabi-Yau can be computed exactly and this leads to interesting mathematical predictions for the mirror manifold. For example quantum corrections due to worldsheet instantons, which modify the low energy effective action, are associated with an extremely interesting enumerative problem, Gromov-Witten theory, counting holomorphic curves on X. Gromov-Witten theory produces symplectic invariants of X. In this note we will only consider the cruder topological information contained in the Hodge numbers and Euler characteristic; applications of persistent homology to enumerative problems of Calabi-Yau appear in [11].
Therefore as a first approximation we label a Calabi-Yau compactification by its Hodge numbers and Euler characteristic as (h 1,1 (X) + h 1,2 (X), χ(X)). Of course this is a rather crude approximation, since different Calabi-Yaus can have the same Hodge numbers. We would like to consider a distribution of these values as a point cloud X, where a Calabi-Yau X corresponds to the point x = (h 1,1 (X) + h 1,2 (X), χ(X)), and study its persistent JHEP03(2016)045 homology. A related question is if point clouds X obtained in this way from collections of varieties arising from different constructions have different topological features, as seen from their barcodes. We will address these questions with certain known lists of Hodge numbers of Calabi-Yaus.
The simplest construction of a Calabi-Yau manifold is as a hypersurface in a complex projective space. For example the quintic threefold can be seen as the vanishing locus of a homogenous polynomial of degree five f 5 (z 1 , . . . , z 5 ) = 0, where [z 1 , . . . , z 5 ] denote the homogeneous coordinates of P 4 . The Kähler form ω of the quintic descends from the Kähler form of P 4 , while the complex structure moduli correspond to the independent deformations of the defining equation f 5 (z 1 , . . . , z 5 ) = 0. More general constructions can be thought of as more elaborate versions of this simple example and many lists of Calabi-Yau families are available in the literature.
An example of such a list is the class of complete intersection Calabi-Yaus (CICYs), that is those which can be constructed via the complete intersection of polynomials in a product of projective spaces. Their classification was completed in [26,28] and their Hodge numbers computed in [29]. Such a list is available at [25] (to which we add the mirror Hodge numbers). Another list was obtained by Kreuzer and Skarke in [12], by classifying reflexive polyhedra in four-dimensions. Reflexive polyhedra in four-dimensions describe Calabi-Yau threefolds by realizing them as hypersurfaces in toric varieties [30], and the classification proceeds using the powerful combinatorial techniques of toric geometry.
Out of each one of these two lists we construct a point cloud and study its persistent homology using the lazy witness complex LW(X, L, ). To do so we wrote a program in matlab, available at [23], using the library javaplex. The results are shown in figure 1.
On the left we have the barcodes corresponding to the modules H • (LW(X, L, ); Z 2 ) for the Kreuzer-Skarke list of Calabi-Yaus corresponding to reflexive polyhedra, on the right JHEP03(2016)045 the list of CICYs to which we have added the mirror Hodge numbers. Recall that bars corresponding to N-persistence modules in degree zero (we will informally call these "Betti number 0") are a measure of the number of connected components, at every length scale. Similarly a barcode in degree one is a signature of non-trivial 1-cycles. The collection of varieties under consideration does not have any barcode in higher degree. The appearance of 1-cycles in figure 1 on the left, measured by the barcode in degree one, means that there are zones where no Calabi-Yau is present, 2 in the list we are considering. The empty regions in the distribution of Calabi-Yaus are detected at values of for which the points surrounding these areas are closer than (at this value we see the birth of an homology class) and disappear at values of greater than the characteristic length of the empty region, which is now filled up (and the homology class dies because it becomes a boundary).
Most bars in the barcode in degree zero are relatively short-lived. This is a sign that the distribution of manifolds is generically rather dense, since at small values of nearby points become part of the same simplex, ceasing to be isolated connected components. For visible values of in figure 1 (left) the vast majority of connected components associated with individual points has already disappeared. This is not true for every bar, pointing to the existence of isolated points or clusters in the distribution. An interesting feature of figure 1 is that certain bars come in pairs: this is a consequence of mirror symmetry, or more down to earth the symmetry of the Hodge numbers. The mirror bars correspond to two areas of the distribution with exactly the same behavior. Of course this is not true for any bars: areas which are close to the symmetry axis in the Hodge numbers distribution will start to "interact" with each other as increases before showing any mirror structure, becoming a single connected component.
Of course all these features could equivalently be "seen" by plotting the Hodge number distribution. The purpose of this discussion is to turn them into a mathematically precise statement concerning its topological features. The formalism of N-persistence modules provides the necessary tools.
A similar discussion holds for the distribution of CICYs, in figure 1 (right). Note that all sign of non-trivial homological structures disappear at values ∼ 30 of the proximity parameter, much smaller than in the case of the KS list. Now the barcode in degree one shows even less structure, a sign that the CICYs are more evenly distributed and the formation of 1-cycles is accidental. This is an example of short-lived homology classes which can be regarded as "noise" and excluded from the persistence analysis. 3 From this perspective the set of CICYs is topologically simpler than the set of varieties in the KS list. Now we focus on a particular zone, namely the "tip" of the distribution, that is the region with small h 1,1 (X) + h 2,1 (X). This region was identified as special in [31], on the ground that heterotic models with small Hodge numbers engineer supersymmetric extensions of the Standard Model with few generations of fermions. Such Calabi-Yaus appear to be rather rare and the corresponding area quite unpopulated. We wish to examine this area 2 Of course one should also take into account that the Hodge numbers are integers while the proximity parameter is a real variable. This will induce a few spurious 1-cycles, which are however too small to be noticed in figure 1. 3 We have confirmed this conclusion by repeating the computation for different landmark selections. We can try to give a topological rephrasing of the philosophy of [31], according to which Calabi-Yau varieties with small Hodge numbers are "special". In our language this would translate into the statement that the distribution of Calabi-Yaus with small values of the Hodge numbers has certain distinctive topological features. Indeed these specific 1-cycles continue to exist if we restrict our point cloud to Calabi-Yaus with h 1,1 (X) + h 2,1 (X) < 22, as shown in figure 2 on the right. The two odd features at Betti number one appear clearly, even if rather short-lived. The fact that they come in a pair is a consequence of mirror symmetry.
We have seen an instance of one of the main themes of this paper, that special vacua are associated with characteristic topological features. Clearly the analysis so far has been rather limited; however we think it elucidates the principle that one might expect that topological interesting structures at the level of persistent homology correspond to physical interesting settings.

Landau-Ginzburg vacua
Now we turn to Landau-Ginzuburg vacua. Such models describe two dimensional N = 2 superconformal field theories. They are governed by a number of chiral superfields Φ i , i = 1, . . . , N interacting via a quasi-homogeneous superpotential W (Φ i ). We will assume that W (Φ i ) has isolated and non-degenerate critical points only. Ground states are related to elements of the chiral ring R, obtained by taking the quotient of the space spanned by the chiral superfields by the Jacobian ideal of the superpotential W (Φ i ) The degeneracies of chiral primaries are encoded into the Poincaré polynomial expressed in terms of generators of the superconformal algebra. We will also consider only Landau-Ginzburg models with central charge c = 9. Certain orbifolds of N = 2 Landau-Ginzburg models describe conformal field theories at certain points of the Calabi-Yau moduli spaces. Such models can also be used in heterotic string compactifications, where they engineer an effective four dimensional theory with E 6 gauge symmetry and matter in the 27 representation. The fermionic spectrum is determined by the chiral ring R from (3.2), namely p 11 = n 27 is the number of fermion in the 27 representation and p 12 = n 27 . If the Landau-Ginzburg model has a geometrical interpretation as strings propagating in a Calabi-Yau X, then we can identify the Hodge numbers h 1,1 (X) = p 12 and h 2,1 (X) = p 11 . We will assume that the other non-trivial coefficients of (3.2) vanish, if necessary discarding the respective models from the lists of consistent vacua. We consider the collection of 10839 models constructed in [32] by the classification of non-degenerate quasi-homogeneous polynomials which can play the role of a superpotential of a c = 9 model. The massless spectra can be computed from (3.2) and result in 2997 different spectra (or pairs of Hodge numbers). This list was extended in [33] by classifying all the abelian symmetries of the above potentials which can be used to construct abelian orbifolds. This results in 3798 inequivalent spectra, which we will use for our analysis. An additional list of models can be obtained by considering discrete torsion as in [34], which results in 138 extra spectra. We have studied the persistent homology for the distributions of abelian orbifolds and discrete torsion models, taking as input the set of inequivalent spectra. We have labeled a Landau-Ginzburg spectrum by the vectors x = (χ = 2(p 12 − p 11 ), p 11 + p 12 ) as explained above, and used these to define our point cloud X. We have then computed the associated homology groups using the lazy witness complex LW(X, L, ) as a function of .
The barcodes resulting from the homology computation for H i (LW(X, L, ); Z 2 ) are shown in figure 3. The distribution of abelian orbifold vacua, on the left, appears to be the most complex from a topological viewpoint, also with respect to figures 1 and 2. This is most apparent in degree one, where several 1-cycles appear with no obvious regularity.
In degree zero the figure shows the existence of many long-lived connected components, a JHEP03(2016)045 signal that the vacua tend to cluster in certain regions. On the other hand the distribution of discrete torsion models is extremely simple, with no features in degree one.
From this perspective the two type of vacua appear topologically very different, despite arising from similar constructions. In this sense the topological analysis can discriminate between two physically similar situations.

Persistence in flux vacua
Now we proceed to apply our techniques to flux vacua of the type II B string. These are realized by an orientifold of a Calabi-Yau compactification, with fluxes turned on along compact cycles and a collection of D-branes. We will only consider complex structure and axion-dilaton moduli, stabilized by the flux induced superpotential. We are not really interested (yet) in fully stabilized and phenomenologically viable models, but we wish to analyze simple examples to show how to use techniques of topological data analysis and which kind of results one can expect.

Flux compactifications
Before discussing the uses of persistent homology, we briefly review some basic elements of flux compactifications. We will consider II B/F-theory vacua obtained by an orientifold of a Calabi-Yau compactification, as reviewed in [15][16][17]. In general one can use D3 branes, extended in the directions transverse to X, and D7 branes, wrapping holomorphic cycles to obtain a quasi-realistic N = 1 model where moduli are stabilized by fluxes and quantum corrections. We will be mostly interested in the distributions of flux vacua, and ignore issues of phenomenological viability of the model or of the simplifying assumptions.
We take a Calabi-Yau X, with h 2,1 (X) complex structure deformations and h 1,1 (X) Kähler moduli. For simplicity we will fix the Kähler class, so that Calabi-Yau metrics are JHEP03(2016)045 parametrized by complex structure deformations. The relevant moduli space parametrizing physical configurations is non-compact and has the form M = M c × H/SL(2; Z), where M c is the complex structure moduli space and H is the upper half plane where the axion-dilaton τ takes values. The axion-dilaton τ = C 0 + i /g s is a function of the Ramond-Ramond scalar C 0 and the string coupling constant g s . Equivalently such a space can be interpreted in F-theory as the moduli space of Calabi-Yau metrics on X × T 2 where the T 2 is elliptically fibered over X.
Let {A a , B a } with a, b = 1, . . . , h 2,1 + 1 be a symplectic basis of H 3 (X, R) and {α a , β a } its Poincaré dual integral cohomology basis, so that A choice of the complex structure z ∈ M c determines the Hodge decomposition On a Calabi-Yau H 3,0 z (X) is one dimensional and the fibration H 3,0 z (X) −→ M c defines the Hodge bundle. Take Ω z ∈ H 3,0 z (X). Similar arguments hold for the axion-dilaton modulus, so the actual Hodge bundle of physical interest here is as ω τ = dx + τ dy. In our basis the (3, 0) form Ω z for z ∈ M c can be represented as Ω z = z a α a − G a β a in terms of its periods where z a are local projective coordinates on the complex structure moduli space, and the functions G a (z) can be expressed as the derivatives of a single function, the prepotential G(z), as G a (z) = ∂ a G(z). The period vector Π(z) = (G b (z), z a ) plays an important role in the determination of the Kähler potential and flux superpotential. The Hodge bundle has a natural hermitian metric, the Weil-Petersson metric and the associated Kähler potential on M is where and Σ is the symplectic matrix of rank 2h 2,1 + 2. Now we turn on fluxes in the NS and RR field strengths F 3 , H 3 ∈ H 3 (X, Z). Since fluxes are quantized, they can be expressed in terms of the integer valued vectors f and h as (4.5) with a = 1, . . . , h 2,1 (X) + 1. We assemble these fluxes in the 4-form on X × T 2 given by

JHEP03(2016)045
The presence of non-trivial fluxes generates a non-trivial superpotential [35] which stabilizes complex structure moduli. Such a superpotential is a section of the line bundle L dual to the Hodge bundle, given by the functional acting on sections Ω z ∧ ω τ of the Hodge bundle.
In local coordinates the F-term equations which follow from the superpotential W G (z, τ ) are where D is the connection associated with the Weil-Petersson metric on M. The critical point equations (4.9) and (4.10) define supersymmetric vacua. Turning on fluxes gives a contribution to the total D3 brane charge In our cases, the orientifold can be seen as arising from an F-theory compactification. This involves a four-fold Z which is an elliptic fibration over X/g, with g the orientifold involution. In this case L = χ(Z)/24. In particular this sets a bound N flux ≤ L on the total amount of flux available. Therefore to find supersymmetric critical points, one has to solve the equations (4.10) in a certain region of the moduli space, as the fluxes take values satisfying the tadpole constraint.
Compactifications of this sort have been widely studied since they provide workable models where ideas about the statistical distributions of string vacua can be tested [13,14,36]. We will have nothing to say about this approach, but we mention a few points for completeness. A simple distribution of string vacua is the density which is just a sum of delta function contributions, each one centered at a supersymmetric vacuum (here i runs over axion-dilaton and complex moduli and we are again neglecting Kähler moduli). This distribution is in general pretty intractable and often it is simpler to

JHEP03(2016)045
deal with an approximate continuum distribution ρ(z). The integral of such a distribution over a certain region U of the moduli space corresponds to the number of supersymmetric vacua satisfying certain properties, which enter in the definition of ρ(z). A slightly different approach is to define an index density d I which differs from (4.13) only in that each delta function is weighted by the sign of the Jacobian (−1) F = sgn det i,j D 2 W G (z, τ ). One motivation for doing so is that such an index appears to be the proper generalization of the Morse index for dealing with vacua in supergravity. Indeed while in rigid supersymmetry the total number of vacua is often given by a topological formula, in supergravity vacua can be created or destroyed in pairs, precisely the situation of Morse theory. This analogy, discussed more in depth in [36], suggests that techniques based on persistent homology, which can be thought of as a version of Morse theory, could be usefully applied to the statistical distributions of vacua. In this paper we will focus on actual solution of the Fterm equations, and leave the analysis of statistical distributions to the future. An example of such a index density is [14] I ∼ M c top (T * M ⊗ L) , (4.14) the top Chern class of the bundle where D W G (z, τ ) takes values. Physically such an integral gives an estimate of the number of supersymmetric vacua. One can refine such a formula by counting vacua with specific properties, say a certain value of the cosmological constant. We will not consider such counting problems in this paper. However it is interesting to note that the number of vacua from (4.14) is almost a topological quantity (it would be for compact M). The point which is important for us is that a high degree of topological complexity of the moduli space of physical configurations M or of the dual Hodge bundle L, corresponds to a large number of critical points for the superpotential W G (z, τ ) and therefore to a large number of supersymmetric vacua. In other words we expect a distribution of vacua to have a high degree of topological complexity. Persistent homology is precisely a tool which can measure the topological complexity of a distribution of points. We therefore expect to obtain interesting information by applying techniques of statistical topology to ensembles of string vacua. In the remaining of this section we will see examples of such applications in simple models.

Rigid Calabi-Yau
We begin with the simplest case, a rigid Calabi-Yau with no complex structure moduli, as studied in [14]. Having no complex structure moduli, h 2,1 vanishes and therefore b 3 = 2 + 2h 2,1 = 2. This implies that the periods of the holomorphic three form Ω over the symplectic basis {A, B} of (4.1) can be taken to be Π = (1, i ). Therefore the flux superpotential is The moduli space is the upper half plane H modulo the action of SL(2, Z). To solve this equation we use mathematica to generate random vectors of integral fluxes obeying the tadpole constraint and only retain those values of τ which lie in a fundamental domain F for SL (2, Z) in H. This restriction is to avoid overcounting of physically equivalent vacua which lie in the same SL(2, Z) orbit. Then we use the values x = (Re τ, Im τ ) to construct a point cloud X and study its persistent homology via the lazy witness complex LW(X, L, ). The corresponding barcodes are shown in figure 4. From the distribution of the barcodes we learn the following facts. The structure of the barcode with Betti number one imply the presence of relatively long-lived and short-lived regularly distributed 1-cycles in the point cloud. These corresponds to the "voids" in the distribution of vacua, already noted in [14]: the distribution contains holes, at the center of which there is a big degeneracy of vacua. The presence of these holes implies that at certain values of the filtration parameter , non-trivial 1-cycles will form in the filtered homology. These cycles have different sizes, some are smaller (short-lived bars) and other are larger (long-lived bars). The presence of several degree one bars which are born and die at the same value of implies that the corresponding cycles have roughly the same size. The long-lived bars correspond to bigger cycles, since it takes more time to cover them up.

JHEP03(2016)045
The presence of a big degeneracy of vacua at the center of these holes shows up as a connected component in the degree zero barcode. Such bars are the very short-lived classes in figure 4. This identification follows from the fact that a number of these connected components disappear from the barcode's plot roughly at the same time as the 1-cycles. This is an example of what we can learn not just by looking at the barcodes by themselves but also at the correlations between the barcodes in different degrees. Note that such a perspective is typically not common in topological data analysis, where the very short-lived bars in the barcode in degree zero would have been interpreted as noise. We must be very careful in applying such techniques to string theory vacua since due to the action of the mapping class group on the moduli space M it is possible that interesting physical features are mapped to short-lived homology classes in the fundamental domain. This mechanism is important and worth stressing. The SL(2, C) action is a symmetry which relates physically equivalent configurations. To avoid overcounting one has to restrict the attention to a fundamental domain F , which contains precisely a single representative for each SL(2, C) orbit. However nothing guarantees that topological interesting features which are prominent in a fundamental domain, will remain so in another. One way in which we can avoid misidentifying these features is to cross correlate the barcodes' plots at different degrees, at least in our example. Most importantly we have now learned how to recognize such features.
On the other hand the formation of long-lived bars in degree zero, which implies the existence of more connected components at larger values of is a consequence of the data set being limited. This is a sign that the flux vacua are not uniformly distributed but tend to accumulate at small values of Im τ . Since the data set is limited, points at higher values of Im τ will be more rare when the sample of vacua is generated at random, and therefore will appear as long lived connected components (uncorrelated with any structure at Betti number one).
All of these conclusions could easily be drawn just by looking directly at the plot of flux vacua on the upper half plane, see for example [36, figure 6]. We have discussed this dataset in detail as an example of how these features would show up from the perspective of the barcodes. Now we turn to higher dimensional point clouds for which a simple plot is not available.

A Calabi-Yau hypersurface example
Now we turn to a more sophisticated example, studied in [38][39][40], where the moduli space M c parametrizing complex structure deformations is one dimensional. We will consider the Calabi-Yau X 8 (1 4 , 4), defined as the hypersurface in the weighted projective space P 4 (1 4 , 4) (where the notation w m denotes the m times repetition of the weight w). The relevant Hodge numbers of of the mirror X, which according to the Greene-Plesser orbifold construction [37] is a crepant resolution of X/Γ. The complex structure moduli space of X can be identified with P 1 \{0, 1, ∞}, where the three special points correspond to a Landau-Ginzburg point, a conifold point and the large complex structure point respectively.
In general complex structure moduli of X 8 (1 4 , 4) are not invariant under Γ. If we restrict our attention to the periods which are invariant under the Γ action, the corresponding Picard-Fuchs equations simplify greatly. One is left with a reduced period vector Π = (G 1 , G 2 , z 1 , z 2 ) which as a first approximation is not a function of the remaining moduli: those can only appear as higher order monomials which are invariant under the Γ action. This period vector coincides with the period vector ofX [38].
We are interested in an orientifold of this model which breaks supersymmetry down to N = 1. The orientifold in question acts as x 0 −→ −x 0 and ψ −→ −ψ, as well as worldsheet parity reversal. We turn on flux vectors h and f taking values in H 3 (X 8 (1 4 , 4), Z) and compatible with the Γ action, along the 3-cycles associated with the period vectorΠ. The tadpole cancellation condition (4.12) is now (4.18) Complex structure moduli which transform non-trivially under Γ can appear in the superpotential only at higher order as Γ-invariant monomials and will be stabilized at zero, the fixed point of Γ [38]. They will be neglected in the following. To summarize the only relevant moduli for the model at hand are the complex structure modulus ψ and the axion-dilaton τ .
To be more concrete, we will also consider a particular region of the moduli space, nearby the conifold point ψ con = 1. It will be useful to introduce an auxiliary variable x = 1 − ψ which measures the distance from the conifold locus. In this region the period vector can be explicitly computed as a function of x and we shall use the result of [39,40]. Explicitly the solutions of equations (4.9) and (4.10) have the form 20) in terms of a set of known constants, which the reader can find in [39, section 5]. We define a point cloud X where each point represents a vacuum, a solution given by (4.19) and (4.20) and parametrized by vectors of the form x = (Re x, Im x, Re τ, Im τ ). Note that contrary to the previous case, we cannot plot the whole point cloud X, but only its projection onto special planes. The persistent homology of X provides a new way to look at the full set of physical values of the moduli without any projection. Before analyzing the barcodes, we have however to deal with the SL(2, Z) symmetry acting on the axion-dilaton and on the flux vectors. Its action is  (4.20). To fix the conifold monodromy we only consider points such that arg x ∈ [−π, π), and because of the validity range of the approximated periods, we only keep points with x < 1.
We have computed the homology of the point cloud using the lazy witness complex LW(X, L, ) and the barcodes are shown in figure 5. The only non trivial barcode is for Betti number zero, higher persistent homology groups showing scant or no structure. In this case the dataset does not present any appreciable "voids". This is due to the tendency of flux vacua to cluster around the conifold singularity. Indeed this clustering shows up in figure 5 as the fact that generic bars are very short-lived. Figure 5 also shows three bars which are rather long-lived, with respect to the others (a fourth bar corresponds to the overall connected component at large ). Isn't this a contradiction with the clustering of vacua? The answer is no, and the reason is that we must also be careful in understanding the features of the approximation scheme we are using for our computation. Indeed in the lazy witness scheme we have chosen the landmark selector with the min-max algorithm, which selects points spread apart as much as possible. Therefore points which are outside JHEP03(2016)045 the clustering region will be privileged with respect to points inside, and this is the reason we see these extra bars. These bars correspond to points outside the clustering region, and since this is relatively empty the bars are longer than the average.
To clarify this point further, let us repeat the persistent homology computation, this time using a random selector to choose the landmark set L. Since the selector is random, from the clustering hypothesis we expect that almost all the landmark points will be in the clustering region, and therefore the barcodes' plot will show almost no structure. Indeed this is what we see in figure 6; in particular we note the different range of the proximity parameter respect to figure 5.
This example shows which kind of information we can get from the persistent homology of a point cloud of vacua, but also how different approximation schemes have to be handled with care.

A first look at heterotic models
Finally we conclude with a rather preliminary glance at a collection of phenomenologically interesting heterotic vacua. We will consider compactifications of the E 8 × E 8 heterotic string over a Calabi-Yau X. The heterotic string has very appealing features in the construction of phenomenological models of physics beyond the standard model [41].
Let us quickly review E 8 × E 8 heterotic N = 1 vacua. The data required for the compactification on X are two holomorphic vector bundles V andṼ (or more generically coherent sheaves) with structure group a subgroup of E 8 and a number of NS5 branes wrapping a holomorphic curve C in X. The bundle V is then used to construct a standard JHEP03(2016)045 model-like effective action whileṼ corresponds to the hidden sector. The low energy effective action is N = 1 supergravity coupled to a number of gauge and matter multiplets. Consistency with the low energy Bianchi identity requires a cohomological condition relating V,Ṽ, C and X. Typically one considersṼ to be trivial and the number of NS5 branes to be an adjustable parameter so that the only non-trivial condition involves the visible bundle V and X. Solving this condition determines a viable structure group H for V. The low energy gut theory is then based on a group G, the commutant of H in E 8 .
The simplest choice is the standard embedding, which identifies H with the SU(3) holonomy group of X giving an E 6 gut. More general non-standard embeddings are possible as long as V obeys a stability condition. To every holomorphic bundle or sheaf V we associate its slope as the ratio between its degree and its rank µ(V) = deg V rank V . Then a bundle V is µ-stable if for any sub-bundle J ⊂ V with 0 < rank J < rank V we have µ(J ) < µ(V). A poly-stable bundle is a direct sum of stable bundles, all of them with the same µ. A poly-stable holomorphic bundle V corresponds to a solution of the Donaldson-Uhlenbeck-Yau equations and therefore preserves supersymmetry.
The compactifications that we will consider rely on the choice of a poly-stable vector bundle V on X. The choice of this bundle breaks the E 8 gauge symmetry of the visible sector down to a gut group G, for example SU(5) or SO(10), the commutant of the structure group H of V in E 8 . A similar setup holds for the E 8 hidden sector. By turning on appropriate Wilson lines it is possible to further break G to the standard model group plus a number of U(1) factors. To have non-trivial Wilson lines X has to be non-simply connected, which is in general not the case for Calabi-Yaus constructed from complete intersections in weighted projective spaces. One can easily remedy this problem by considering a discrete group Γ freely acting on X and then working equivariantly with respect to Γ. In practice this is accomplished by working with the Calabi-YauX = X/Γ, for which π 1 (X) = Γ. Each element γ ∈ Γ corresponds to a 1-cycle. Wilson line operators W γ can be defined along these cycles in terms of a flat rank one bundle onX. The choice of a bundle V equivariant with respect to the Γ-action descends to a bundleV onX. The physical spectrum and the relevant low energy quantities can then be computed from the data of X and V. Physically the Wilson line operators break the gut group G to the subgroup commuting with all the Wilson lines. The low energy effective field theory onX is based on said group and can be chosen to be a phenomenologically realistic model.
In an impressive series of works [19][20][21], a large class of realistic models were constructed along these lines, containing the standard model gauge group, the matter content of the MSSM and no exotics. With an extensive use of computational algebraic geometry many low energy properties of such models were derived explicitly and are available in the database [42].
If we choose V to have a rank five special unitary structure group, so that the first Chern class c 1 (V) vanishes, the gut group will be SU(5), up to abelian factors. The latter are typically Green-Schwarz anomalous and decouple at high energies. The requirement of low energy supersymmetry implies that V has to be a µ-polystable bundle, a direct sum of µ-stable JHEP03(2016)045 bundles all with the same µ-stability parameter. A clever choice is a sum of line bundles with µ(L a ) = 0 for all a. It is quite remarkable that such a simple choice still allows for physically realistic models, and indeed a scan over such possibilities (and other closely related) done in [19][20][21] revealed a large class of viable models. We want to have a preliminary look at this database from the perspective of persistent homology. In particular we will only consider varieties X which are CICYs with V of the form (5.1), despite other possibilities being allowed. We want to construct a point cloud X out of this dataset, and we will do so in the simplest possible way, by collecting vectors of the form x = h 1,1 (X), h 2,1 (X), c 2 (L 1 ), . . . , c 2 (L 5 ) . This simple choice certainly does not do proper justice to the database built in [19][20][21] which contain extensive geometrical and physical information on the models. In particular such a parametrization is very coarse since a point in X represents many physically distinct models. Such a problem can be resolved by adding more and more parameters in the construction of the point cloud X, but we will leave such a detailed analysis for the future.
On the other hand such a point cloud X is embedded in R 7 : it cannot be plotted in any simple way and there is no readily available tool to gauge the properties of this distribution of vacua. We propose that persistent homology provides such a tool. We have considered around a hundred of models with distinct values of x ∈ X and computed the N-persistence modules given by the homology of the filtered Vietoris-Rips complex VR (X). We show the barcodes in figure 7. The barcodes show plenty of structure, even in degree four, if very short-lived. Figure 7 is a concrete, qualitative example of one of the main ideas of this note: that distributions of physically interesting vacua show a certain degree of topological complexity. In the distribution at hand we see the appearance of many n-cycles. This could imply for example that there exists subtle correlations between the vacua, in the form of some algebraic equations which describes the n-cycle. Of course to establish this decisively and quantitatively one has to go beyond the topological analysis presented here.
On the other hand the topological analysis is very efficient in comparing different distributions of vacua. As we have explained our working assumption is that distributions of physically realistic models are not random but have topological features. In the light of this idea, one could start comparing different distributions of vacua and look for the ones which exhibit more structure. Distribution with higher complexity are, in a certain sense, singled out.
Let us try to explore this possibility more in detail and compare directly two distributions of vacua by fixing certain parameters. Out of the database of [42] we can form more refined point clouds with more information. For example an information readily available is the number n 5 of vector-like pairs 5 and5 at the gut level, or the number of physical Higgs doublets n H in the low energy spectrum. We take these two parameters as inputs together with X to create an augmented point cloud Y. Each point y ∈ Y contains the same JHEP03(2016)045 geometrical information as above plus this extra information on the spectrum and has the form y = h 1,1 (X), h 2,1 (X), c 2 (L 1 ), . . . , c 2 (L 5 ), n 5 , n H . However this time, instead of just studying the topological features of Y, we select two subsets of data Y 1 and Y 3 as follows. All the points in both subsets have a single Higgs doublet n H = 1, but they differ in n 5 = 1 and n 5 = 3 respectively. 5 In other words the datasets of Y 1 and Y 3 contain geometrical parameters of a Calabi-Yau and an holomorphic bundle which give rise to different spectra with fixed characteristics. We want to compare their distributions from the perspective of their persistent homology. Again we study the filtered Vietoris-Rips complexes VR (Y 1 ) and VR (Y 3 ) and collect the information about their N-persistence modules in the barcodes shown in figures 8 and 9.
Although comparisons have to be made carefully since the two datasets have different number of entries, one feature is immediately clear: among the database of [42] for models with a single Higgs doublet, the distribution of vacua with n 5 = 1 is topologically much more complex than that of n 5 = 3 vacua. The latter indeed shows clear regularities in the barcodes. This is clear by comparing figures 8 and 9, but is also from the number of simplices generated during the persistent homology computation, which is greater for Y 1 by a factor of 50.

JHEP03(2016)045
If we assume that topologically rich structures correspond to physically interesting features, one would conclude that compactifications with n 5 = 1 would be preferred over compactifications with n 5 = 3. Again, this lends (modest) support to the idea that string vacua can be characterized by the topological properties of their distributions. In particular it opens up the possibility that vacua with realistic features can be singled out by the complexity of their persistent homology. In other words we are led to propose a qualitative vacuum selection principle based on the topological features of the distributions of vacua: topologically complex distributions of vacua are preferred over topologically simple ones. We believe this idea is worth further investigations.

Conclusions
In this paper we have explored the possibility that distributions of string vacua can be characterized by certain topological properties. A readily available tool to capture the topological features of a distribution of points is topological data analysis. In this framework persistent homology extracts topological invariants at every length scale. To a collection of vacua we associate persistence modules in various degree; the latter in turn are completely characterized by their barcodes. Barcodes are a graphical representation of the lifespan of the persistent homology classes of the distribution of vacua, as the length scale at which we look at the distribution is varied. They correspond to homological signatures which characterize the point cloud by the appearance of patterns within the dataset. In this framework patters are emergent with the length scale.
To exemplify these ideas we have studied three classes of vacua. In section 3 we have studied vacua with N = 2 supersymmetry, arising from compactifications of the type II string on a Calabi-Yau, or from Landau-Ginzburg models. By appropriately labeling families of such vacua we have constructed the corresponding point clouds and computed their persistent homologies. We have looked for topological structures in the distribution of certain Calabi-Yau manifolds, or superconformal fixed points. Both of these are interesting mathematical problems on their own, regardless of the physical applications. During the process we have learned how to use topological data analysis to reproduce the known features of such distributions. We have also established that certain distributions of vacua, such as Calabi-Yau with small Hodge numbers, have non-trivial topological features. Note that, regardless of the interpretation as string vacua, the study of Landau-Ginzburg distributions could also be seen as a very preliminary attempt at the study of the topology of the space of two dimensional superconformal field theories. It would be very interesting to understand if persistent homology could be an useful tool for this direction of investigation.
In section 4 we have applied our formalism to certain classes of flux vacua in type II B/F-theory compactifications. The models we have discussed are typically presented as examples of the statistical approach to the landscape of flux vacua, where one is interested in counting the number of vacua with certain properties. In this framework persistent homology has the advantage of allowing a simple visualization of the basic properties of a distribution of vacua, even when such a distribution has high dimension. We have discussed how the features of such distributions can be extracted from their barcodes in two simple JHEP03(2016)045 cases. We have given concrete examples of how the techniques of topological data analysis can be used to determine properties of the distributions of flux vacua, not just by counting their numbers but also by using the invariants of persistent topology.
In this note we have obtained distributions of flux vacua by solving directly the equations for supersymmetric vacua. Of course another possibility would be to use directly a continuum approximation to the index density of vacua as in [13,14,18,36]. One can easily construct a point cloud by discretizing an index density such as (4.14) and study its properties using statistical topology. Indeed we expect that this should be a rich area of investigation.
Finally in section 5 we have considered quasi-realistic compactifications of the heterotic string. We have considered a class of vacua labeled by a Calabi-Yau and the topological data of a certain holomorphic bundle. A striking aspect of such a class of vacua is that it exhibits non-trivial topological features, in the form of higher dimensional, if short-lived, n-cycles. The presence of such features suggests that distributions of phenomenologically viable vacua might be characterized by a high degree of topological complexity, as seen from their persistent homology modules. In other words it leads to the possibility that physically interesting features are associated with topologically interesting structures at the level of persistent homology. This possibility can be made rather concrete by comparing the persistent topological features of different classes of vacua. We have given a specific example by comparing two different set of vacua which differ in the number of 5 −5 pairs in the gut spectrum. The class of vacua with n 5 = 1 is topologically much richer and therefore from our perspective should be preferred.
Let us summarize the main ideas we have encountered in this note. First of all we have shown how techniques of topological data analysis can be applied to the problem of characterizing distributions of string vacua. Persistent homology can be used to extract qualitative information from the set of string vacua. The usefulness of such information depends on the questions one is asking. Certain features of string vacua can be understood from the more persistent homology classes in the barcodes; others cannot. Hopefully such techniques could be efficiently applied in conjunction with the usual tools from statistics and algebraic geometry typically used in studying distributions of vacua. We also hope to call the attention of the computational topology community to the extremely rich and diversified problem posed by the landscape of string vacua.
A particular interesting point in this respect is that, also due to the approximation schemes available, studying the persistent topology of a distribution of vacua is considerably easier than studying its geometrical properties. In other words to sample the relevant features one typically needs only a smaller number of vacua and the whole process is computationally less involved. On the other hand the usual ideas used in the study of persistent homology need to be partly revisited to be applied in this context: we have seen an example of very short-lived bars which reveal physically interesting features only when regarded in correlation with barcodes in other degrees. To which extent topological data analysis is physically relevant is at the moment not clear; certainly more work and ideas are needed to understand how much information and how many new insights can we gain from it. One could hope to learn something by analyzing much bigger datasets, or by studying systematically families of M-or F-theory compactifications.

JHEP03(2016)045
On a more speculative level the results of this note have led us to the idea that sets of vacua with a higher degree of topological complexity are singled out with respect to topologically simpler ones. It is tempting to suggest that the presence of topologically interesting features within a distribution of vacua is associated with physically interesting features. As we have already remarked, the presence of higher dimensional cycles can imply subtle correlations between the vacua distributed along the cycle, although a more refined case by case analysis is needed to establish this concretely. These correlations can assume the form of a set of equations which force the vacua to lie over a certain n-cycle. If they exist, such correlations would distinguish a family of vacua from others. Even if at this stage we have no general understanding of the physical significance of such correlations (except when they arise from a superpotential, where in principle it should be possible to derive them analytically), persistent homology provides a computational framework to visualize them. This leads us to the possibility of formulating a qualitative topological vacuum selection principle: that topologically complex distributions of vacua are physically preferred over topologically simple ones. Here the concept of topological complexity is concretely provided by persistent homology. Clearly much more work is required to turn this into a quantitative statement.