Black Holes and the loss landscape in machine learning

Understanding the loss landscape is an important problem in machine learning. One key feature of the loss function, common to many neural network architectures, is the presence of exponentially many low lying local minima. Physical systems with similar energy landscapes may provide useful insights. In this work, we point out that black holes naturally give rise to such landscapes, owing to the existence of black hole entropy. For definiteness, we consider 1/8 BPS black holes in N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathcal{N} $$\end{document} = 8 string theory. These provide an infinite family of potential landscapes arising in the microscopic descriptions of corresponding black holes. The counting of minima amounts to black hole microstate counting. Moreover, the exact numbers of the minima for these landscapes are a priori known from dualities in string theory. Some of the minima are connected by paths of low loss values, resembling mode connectivity. We estimate the number of runs needed to find all the solutions. Initial explorations suggest that Stochastic Gradient Descent can find a significant fraction of the minima.

Most machine learning tasks amount to choosing an optimal Neural Network (NN).This is usually done by first defining a loss function, which depends on various parameters of the NN (and number in millions in realistic applications), then one finds a suitable minima of the loss function, often by Stochastic Gradient Descent (SGD) [57].Thus gaining an understanding of the loss function landscape is crucial for machine learning.
The loss function, also referred to as the cost function, is in general a non-convex function.Finding a global minimum of a general non-convex function is hard [58,59,60], in fact NPhard [61], [62].Nevertheless in practice optimizing a Deep Neural Network (DNN) is usually possible.This has led many to believe that after all, for DNN-s the loss function does not have spurious local minima [63,64,65].This is indeed the case for shallow linear NN-s, which have only saddles and no local minima [66].For deep linear networks, all local minima of the squared loss function are global, with other critical points being saddles [67].It has been further argued that the same holds true for non-linear networks, under certain assumptions [67].For a fully connected network with squared loss and analytic activation function, all minima have been argued to be global [68].Similar results were previously shown in [69,70,71].
Although optimization of deep linear models exhibits some similarities with optimization of deep non-linear models [72], "local implies global" picture seems not to carry over to nonlinear DNN-s.E.g. spurious minima have been shown to exist for shallow Rectified Linear Unit (ReLU) networks [73].It has been suggested that absence of spurious local minima might be a feature specific to deep linear networks and might be destroyed as soon as non-linearity is introduced [74,75,76].For Convolutional Neural Networks (CNN) with piecewise linear activation function, spurious minima have been argued to be common [77].
Local or global, exponentially many minima seems to be a generic feature of the loss function.Typically DNNs involve millions of parameters, but even for moderately sized networks, the number of local optima and saddle points already shows exponential growth in the number of parameters [78,79,80,81].Existing evidence suggests that critical points that lie high on the loss landscape are more likely to be saddles, whereas low-lying critical points are more likely to be local minima [82,83,84].It has been argued that the main obstacle in finding global minimum comes not from local minima, but from saddles [85,86].
Apart from these generalities, realistic DNN-s are too complicated to admit much analytical understanding.Thus it is useful to study simpler models, which captures broad features of DNN-s [87].Physical systems are particularly appealing in this regard, as they can often provide a physics way of thinking about NN optimization.
Connection to Physics is usually made by interpreting the loss function as some sort of energy.Glassy systems present natural candidates as their energy landscapes necessarily have many minima.Indeed, connection between NN-s and spin glasses has been known for quite some time [88,89].More recently, the loss function of a fully-connected feed-forward deep network with ReLUs has been related to the Hamiltonian of a spin glass model [80,81,90].Critical points in these models have a layered structure [91,92].Nevertheless, similarities of DNN and glassy systems are limited [93].Most significantly, getting stuck at local minima for long time is the quintessential glassy behaviour [94,95,96], whereas DNN-s somehow manage to find their minima.
Other approaches to understand the loss landscape include algebraic geometric approach [97] and energy landscape approach [98].The later is known to be particularly useful in study of molecules [99].
Our goal in this work is rather humble.We point out that the physical potential of certain microscopic descriptions of black holes in string theory capture some key features of the loss landscape.The motivation for this work is to relate exponential degeneracy of loss function minima to the exponential degeneracy of quantum states in a statistical mechanical system.This connection can be made explicit for supersymmetric black holes, where the black hole entropy implies the existence of exponentially many degenerate states, which in turn can be related to the minima of a potential through supersymmetric quantum mechanics.Whereas connection between statistical mechanics and machine learning has been explored in the literature [100,101], this particular angle has not been explored to the best of our knowledge.
A black hole carries an entropy S BH = A Here A is the area of the event horizon of the black hole and P = G c 3 is the Planck length, G is the Newton's constant, is the Planck's constant and c is velocity of light.Providing a statistical mechanical interpretation of this entropy is a highly non-trivial challenge, supposed to be addressed by a theory of quantum gravity, such as string theory.
In deed, S BH is reproduced from microscopic state counting in a wide variety of black holes in string theory [106,107,108].It is done by providing a "microscopic description" of a black hole as a bound state of more fundamental solitonic objects in string theory, then coming up with a low energy description of that bound state dynamics and finally counting states carrying appropriate quantum numbers.In all known cases, this microscopic counting accounts for S BH , when the black hole is large, which is reflected in largeness of the quantum numbers.
Microscopic descriptions of black holes come in a wide variety in string theory.For definiteness, we consider 1/8 BPS black holes in N=8 String theory, carrying only Ramond-Ramond charges.The microscopic description of such black holes is given by bound state of D-branes, whose low energy dynamics is captured by a supersymmetric quantum mechanics with matrix and vector-like degrees of freedom [109,110].The number of potential minima of this supersymmetric quantum mechanics approaches e S BH asymptotically for large charges, i.e. for large black holes.Nevertheless, thanks to the dualities in string theory, the exact number of potential minima is known (without actually finding those) for arbitrary charges [108,111].This is a notable advantage over spin glass models for the loss landscape.E.g. testing the efficiency of an algorithm in finding different global minima is easier in current setting, as the exact number of minima is already known.
The black holes under consideration carry four kinds of electro-magnetic charges, three of which we take to be 1, and consider one parameter family of black holes labelled by the fourth charge, which we denote by N .For large N , the number of minima grows as e 2π √ N −2 ln 4N .For few small N -s, the number of minima are given in Table  The paper is organized as follows.In Section 2, we give a quick introduction to black hole entropy, and explain how it arises in string theory.Readers familiar with the topic can jump to Section 3, where we discuss the relevant potential to be minimized (3.5).In section 4, we estimate the number of runs required to find all the minima and discuss the performance of Stochastic Gradient Descent in this regard.In section 5 we discuss future directions.

A brief introduction to black hole entropy in string theory
This section contains well known facts about black holes in string theory and readers familiar with the subject can skip this section.
A quick introduction to black hole entropy: Black holes are astronomical objects, formed by collapse of dead stars.Their claim to fame is that nothing, not even light, can escape their gravitational pull.From a theoretical perspective, black holes are solutions to Einstein's equations with event horizon.A curious feature of such solutions is that they are entirely fixed by a handful of parameters, such as mass, angular momentum, electric and magnetic charges etc, and therefore leaving no scope for any microstructure and hence any entropy.However this is rather problematic, since one could violate the second law of thermodynamics, by simply throwing an entropic object into a black hole.Making use of an existing result in classical gravity that area of the event horizon of a black hole does not go down in a physical process [112], Bekenstein proposed that a black hole should be assigned an entropy proportional to the area of its event horizon [102,103].In fact it was shown in [104] that the "laws of black hole mechanics" have striking resemblance with the laws of black hole mechanics, although it was unclear whether this was a coincidence or not.In particular a black hole with non-zero temperature, is expected to emit thermal radiation, which would contradict the very definition of a black hole.
This issue was settled by Hawking as he found that quantum mechanically black holes emit thermal radiation [105], with a temperature consistent with the findings of [104].In particular, this work led to a precise formula for black hole entropy This remarkable finding however led to only deeper puzzles.Now that black holes are understood to have an entropy, as per standard statistical mechanical understanding of entropy, there must also exist e S BH black hole microstates.Classical gravity leaves us clueless, whereas the appearance of the Planck length P in the black hole entropy formula (2.1), suggests that answers lie in a theory of quantum gravity, such as string theory.
How black holes arise in string theory: The low energy physics of string theory is captured by supergravity theories.These supergravity theories have black hole solutions.
The black holes considered in this paper, are of this type.By the virtue of being embedded in string theory, it is often possible have an understanding of their microstates.We will restrict to black hole solutions, that preserve some supersymmetry.Such black holes are extremal, namely "their charge equals mass".Here charge refers to charge (both electric and magnetic) under various U(1) gauge fields of supergravity.These charges are meaningful even beyond four dimensional supergravity, in particulars their carriers can be identified as membrane like objects (called D-branes) in full string theory.This provides an alternative description of black holes as bound states of D-branes.This description, usually referred to as microscopic description, makes the microstates manifest.
Extremal black holes carry zero temperature and hence the entropy is simply the logarithm of ground state degeneracy1 .

The potential landscape in microscopic description of black holes
In most well known examples, the microscopic description of black holes is a two dimensional Conformal Field Theory (CFT) [106,107].In such description, one looks for degeneracy of states at a given CFT level, which is captured by Cardy formula [112].In this approach there is no energy landscape involved.
Since we are looking for a potential landscape with ground states corresponding to its minima, quantum mechanical, i.e. 0+1 dimensional microscopic descriptions are better suited.For supersymmetric black holes, this quantum mechanics is supersymmetric, i.e. preserves some supercharges, which annihilate the ground states.For supersymmetric quantum mechanics (SQM), it is known that the index I := N B − N F is given by the Euler characteristic of the vacuum manifold, i.e. manifold on which the potential vanishes [113,114,115].Since the potential is a sum of squares in this case, vacuum manifold is also the space of minima.In general, this vacuum manifold is a continuous space and its cohomology corresponds to the space spanned by the supersymmetric ground states.Rank of a form decides the bosonic/fermionic nature of the corresponding state.
For spherically symmetric supersymmetric black holes in four space time dimensions, Zero Angular Momentum Conjecture [110] further implies that the black hole microstates lie in middle cohomology2 , and carry zero angular momentum [111,118,119,120].This in particular implies N F = 0, thus the degeneracy d = N B + N F equals the index I.
Zero Angular Momentum Conjecture implies that the vacuum manifold is a set of points, in which case the cohomology only contains 0-forms.Each discreet global minima is associated with a 0-form, i.e. a black hole microstate.Consequently, the number of global minima equals the degeneracy, which is the exponential of the black hole entropy.This possibility is realized by 1/8 BPS black holes in N = 8 string theory, in a duality frame, where all charges are carried by D-branes [109,110].
Other relevant quantum mechanical descriptions include quiver quantum mechanics [116], especially scaling quivers [117,121,122], which share many similarities with single centered black holes.Whereas such quivers do exhibit exponentially many states in the middle cohomology, the vacuum manifold is not a discreet set of points, hence not particularly suitable for the present purpose.

The system
The system involves a single D2 brane stretched along the spatial directions x 4 , x 5 , a single D2 brane stretched along the spatial directions x 6 , x 7 , a single D2 brane stretched along the spatial directions x 8 , x 9 , and N D6 branes stretched along the spatial directions x 4 , x 5 , x 6 , x 7 , x 8 , x 9 .This D-brane configuration is summarized in the Table 2.Note that there is no spatial direction along which all the D-branes extend.Thus the low energy dynamics is that of an effective particle.This feature is rather distinctive, as very often the low energy dynamics of microscopic descriptions of black hole in string theory is captured by an effective string [106,107].

Field content and gauge symmetry
Excitation modes of the system correspond to to the modes of open strings stretched between these branes.Further the massless modes are the most relevant ones for low energy physics.Such modes/fields can be arranged in supersymmetric multiplets.We shall use the 4 dimensional supermultiplets, dimensionally reduced to 1 dimension, i.e.only time.
There are two different types of open strings: strings starting and ending on the same brane and strings starting and ending on different branes.
1.For each brane, first kind of strings give rise to the field content of N = 4 super Yang Mills theory, which has a N = 1 vector multiplet and three N = 1 chiral multiplets.The bosonic content of these multiplets are summarized in Table 3: The superscripts denote Table 3: Bosonic fields coming from strings starting and ending on the same D-brane the stack number.
A stack of N D-branes comes with a U (N ) gauge symmetry.Thus the three D2 branes are associated with three different U (1) symmetries, whereas the D6 brane stack is associated with a U (N ) symmetry.Fields described above furnish adjoint representations of respective gauge groups.Since U (1) acts trivially on adjoints, this means the fields on first three rows of Table 3 are gauge invariant or scalars, whereas those on the last row are N × N matrices transforming in adjoint representation of U (N ).
2. The second kind of strings for every pair (kl) of branes give rise to N = 2 hypermultiplet, or equivalently two N = 1 chiral multiplets Z kl , Z lk .We will abuse notation to denote the complex scalars of these multiplets by the same name.Z kl transforms as fundamental under U (N k ) and anti-fundamental under U (N l ).Here N k = 1 for k = 1, 2, 3 and N for k=4.
Note, no field is charged under the overall U (1).Hence the gauge symmetry of the system is , where the two U (1)-s can be taken to be any two relative U (1).A D-brane system spontaneously breaks various translational symmetries of the ambient spacetime.These broken symmetries appear as Goldstones in the worldvolume theory of the (3.1) The 7 supermultiplets containing these fields decouple from the dynamics.Fermions in these multiplets, which account for 28 off-shell degrees of freedom, correspond to the broken supersymmetries.This perfectly matches the fact that system preserves 4 real supercharges (N = 1 in 4 dimensional language) out of total 32 (N = 8 in 4 dimensional language).

The potential or loss function
The detailed Lagrangian of the system is discussed in [109,110].Here we directly jump to the potential.The potential (to be treated as the loss function) is a sum of 3 non-negative terms V is minimized, when each of V gauge , V D , V F individually vanish4 .Among these, V gauge can be set to 0, simply by putting the X fields to zero, i.e. taking 3 non-compact coordinates of different stacks of D-branes to coincide.X fields do not appear in V D and V F , and hence can be safely forgotten.In fact, for non-zero values of Z fields, which is the case for all minima, this is the only way to make V gauge vanish.
Setting V D to zero has the effect of fixing the scale of various fields.This effect can be incorporated by complexifying the U (1)×U ( 1)×U (N ) gauge invariance to C * ×C * ×GL(N, C).Hence it suffices to minimize V F , subject to the complexified gauge invariance.Thus in the following we shall discuss only about V F and relegate the details of V gauge and V D to Appendix B.
The "F-term potential" V F is derived in terms of an underlying superpotential W , which is detailed in Appendix C. For now, it is useful to first define some intermediate objects: We will use the words minima and solutions interchangeably.It suffices to think of V F as the loss function.Minima of V F are solutions to the equations Equations (3.6) are invariant under the complexified gauge transformations If the gauge redundancy is not fixed, then each discreet minima turns into a continuous space of minima, corresponding to the gauge orbit.This can be thought of as a trivial realization of is similar to the phenomena of mode connectivity found in some NN architectures, i.e. existence of paths of almost constant loss connecting two minima of the loss function [123,124].
The minima of V F can also be thought of as minima of a slightly simpler potential (E.4).We relegate this to Appendix E, since this simpler potential (E.4) is not the physical potential.But if one's sole concern is to find a potential landscape with exponentially many minima, then this simpler potential is as good as the physical one.For the special case of N = 1, the system can be further simplified to one with only 3 complex variables.This is discussed in Appendix F.

Searching for multiple minima with Stochastic Gradient Descent
Different minima of the loss function represent different ways of learning the same data, with minima having lower loss value representing better learning.When all the minima are global, as in linear DNN-s [67], all minima ought to be equally good by this standard.However, since the weights and biases of NN-s corresponding to different global minima are different, something must be different about learnings corresponding to different minima.Whether this difference is merely technical or it translates to some qualitative differences is not obvious at this stage.It is conceivable that at least some different minima might actually differ qualitatively.E.g. wider minima might offer better generalisability [125,126,127].
To study this important issue, first of all we need to be able to find all (or at least a large fraction of all) the minima.The ability of an algorithm to find various minima thus becomes relevant.The black holes discussed in last section provide an excellent settings to test this.In this section, we discuss how SGD performs in this regard, i.e. whether or not and in how many runs it can find all the minima.

Estimated number of runs required to find all the minima
Before we set out, it is useful to estimate the typical number of runs in which one might expect to find all the minima.More precisely, if the loss function has K minima and an algorithm randomly hits one minima in each run with equal likelihood, then what is the probability p(K, n) of finding all the solutions after n runs?Such a p(K, n) must satisfy Running the algorithm n times will produce a sequence (of minima) of length n.There are K n such sequences.The question at hand can be phrased as combinatorial questions about such sequences and the answer will likely involve objects like binomial coefficients.We make the following ansatz for p(K, n) which satisfies the condition (4.1) ∀ n, K ≥ 1, n, K ∈ Z.Here K a = K! a!(K−a)!are the binomial coefficients 5 .Satisfaction of the second condition of (4.1) is obvious, whereas satisfaction of the first condition can be verified.For the simple case of K=3, a derivation of this formula is given in Appendix G.
For K = 12, which is the number of minima of (3.5) for N = 1, the function p(K, n) is plotted in Fig. 1.Note, the estimate (4.1) ignores the possibility that a run may fail to find any minima whatsoever.As we discuss later, we actually find this to be a frequent occurrence.
A measure of efficiency of an algorithm in finding all the minima might be introduced as follows.Let n be the number of runs needed to reach probability 1 − for some small , i.e. p(K, n ) = 1 − .Let the fraction of minima found by an algorithm after n runs be f (n ).Then we might define As the number of runs approach infinity, or equivalently → 0, the efficiency E approaches its asymptotic value E 0 .If the algorithm is unable to find all the minima, then E < 1 for small enough .If the algorithm is able to find all the minima after n c runs, then E (nc) > 1.
If one keeps running the algorithm, E approaches its asymptotic value E 0 = 1.

Black holes:
A case study 4.2.1 N = 1 The case N = 1 is the simplest one and admits 12 global minima.The minima can be found using Mathematica [128] at one go, and rather quickly.The goal here is however to check whether the SGD can at all find all 12 minima or not.This question is of some interest since to the best of our knowledge, it is not clear as of now whether or not SGD can find all the minima of a loss function.The tensorflow library [129] is particularly helpful for this task.
Note that although we are using the machinery of machine learning, we are not quite doing machine learning.In particular, there is no neural network involved.
Coming to the loss function, firstly we note that if the parameters c ij -s are all taken to be real, then the equations (3.6) and the loss function (3.5) remain unchanged under simultaneous complex conjugation and/or sign change of all the fields.This implies that given any minima of the loss function (3.5) a field configuration obtained by complex conjugation and/or sign change of a minima, should also be a minima.Thus the minima occur in quadruplets.In the special case, where all the fields in a minima are either real of imaginary, these operations give doublets.To exploit these symmetries, we consider real c ij -s.To be specific, we work with the following set of c ij parameters: There is a C * × C * × C * complexified gauge redundancy, which affects various Z-fields, but not the Φ-fields.This gauge redundancy can be fixed for example by fixing Z 12 , Z 13 , Z 14 to arbitrary complex numbers.The specifics of gauge choice affect the minima, but not the gauge invariant fields or combinations of fields.Thus the gauge invariants offer a useful way of labelling the minima.In Table 4, we mention the Φ-fields for all 12 minima obtained using Mathematica [128].By virtue of being gauge invariant, any choice of gauge will result in Φ-fields for minima to be those in Table 4.
Φ (12)  Φ (23)  Φ One can further proceed to obtain some understanding of the loss function landscape.We only undertake some preliminary exploration in this direction.Given that we have the analytic form of the loss function (3.5) as well as locations of the minima, one might wonder about mode connectivity, i.e. existence of paths of relatively low loss connecting these minima.
With this question in mind we study the loss function along line segments connecting different minima.The resulting cross section of the loss function vanishes at the ends of such segments, and rises to some maximal value in the middle 6 .This maximal value varies depending on the segments.E.g. in Figure 2, we plot the loss function along segments connecting minimum 1 to other 11 minima.For some of the segments, the maximal value of the loss function is rather low.This feature resembles mode connectivity.The maximal value presumably depends on the parameters c ij -s.It is unclear whether mode connectivity persists even if c ij -s are taken to be larger than those in (4.4).Other questions of interest include distribution of critical points, especially the saddle points.We do not attempt this here.Now we comment on number of runs needed to find the minima.If we use the information that the minima appear in doublets/quadruplets, then it suffices to find one minima in every doublet/quadruplet.We could find at least one minima in each doublet/quadruplet in 35 runs.Thus the efficiency E 0 , as defined in (4.3) is unity.The number 35 is significantly smaller than the number of estimated runs, a random algorithm would take to find all the minima (4.2). Figure 1 suggests it would take about 80 runs to find all 12 minima.However the situation worsens, if we forget about the symmetries and try to find all 12 minima independently.Counting configurations with loss value < 10 −7 as valid minima, we have only 11 out of 12 minima in 200 runs 7 .These accounted for 109 runs.Among the remaining runs, some reached comparatively larger but still small loss values and the configurations were close to various minima.To be specific, number of configurations reached with loss of the order of 10 −7 , 10 −6 , 10 −5 , 10 −4 , 10 −3 , 10 −2 respectively are 2, 1, 5, 3, 5, 3. Rest had even higher loss values and some showed runaway behaviour as well.
Curiously, different minima appear with different frequencies.A visual representation of the same is given in Figure 3. Five blocks represent five different eigenvalue sets.We choose to plot the logarithms, since the eigenvalues differ by several orders of magnitudes.The widest block at the bottom of Figure 3 corresponds to minima 1,2.The narrowest block, corresponding to minima 9,10,11,12, however are not the most frequent ones.The loss function (3.5) becomes significantly complicated for N > 1 due to non-Abelian nature of the variables.But the essential task remains the same.The loss function (3.5) vanishes at the global minima and we decide on some small cut-off for loss function, which if reached, we take the configuration to be a minima.We take this cut-off to be 10 −6 .Again, using the fact that minima appear in doublets/quadruplets, it suffices to find one minimum for each doublet/quadruplet.With this fact utilised, for the N = 2 case we could find all 56 of 56 minima, whereas for the N = 3 case we could find 176 out of 208 minima.
The efficiency E 0 is respectively 1 and 11/13 ∼ .846for N = 2, 3. Number of runs in both cases were were hopelessly high and hence we did not keep track of the same.
For N = 4, we were unable to reach small enough values of the loss function in reasonable time using SGD.We believe this obstacle can be overcome with more effort and perhaps more expertise.But since is is not indispensable for the main argument of this work, we do not attempt it here.
As in the case of N = 1, we use SGD to look for the minima.We have made extensive use of the Tensorflow library [129].As previously stressed, we are not machine learning anything and there is no neural network in sight.We are merely using the machinery of machine learning to find the minima of (3.5).
Due to large number of variables involved, we do not list the field configuration corresponding to the minima for N > 1.The interested reader is referred to [110] for the case N = 2.
In the following, we mention some relevant details of our quest for the loss minima.Many of these details are also present in the N = 1 case, but assume more significance for N > 1, due to increased complexity.
• Initial values: Since the run might hit different minima depending on the starting point, we chose the initial values of all the variables randomly in the range (-5,5) from a normal or uniform distribution.For initial values beyond this range, the runs usually manifested a runaway behaviour, presumably due to the quartically diverging nature of the loss function.We also considered purely real/imaginary initial values, which yielded only limited success.
We also explored some special initial values, where derivatives of the loss function along some directions vanish and also initial values at which some of the F -s and G-s (3.5), (3.6) vanish.But these did not lead to any new minima.
• Two Cycles Optimization: As many runs seem to get stuck at plateaus with loss value of order 1, we switched to a two cycle optimization where we would only continue if the loss value had come below a set threshold after running for a set number of steps.If it had, we would use those values of parameters to run a second cycle, using that same optimizer, or a different one.
• Hyperparameter Tuning: The most important hypermeter is the learning rate, which we took to be smaller than 10 −2 , as the probability of runaway behaviour increased above this value.The number of steps needed to reach a minimum was of the order of 10 5 (sometimes more).Smaller learning rate (e.g.< 10 −3 ) required more steps (of the order of 10 7 or more) and more time (order of hours) to find a minimum.However once a small enough loss value has been reached, smaller step-size is desirable.So sometimes we tweaked the learning rate in the middle of the run accordingly.Another important hyperparameter was momentum, whose default value of 0.9.Even with these precautions, roughly one in every ten runs converged to a minima for the case of N = 3.
• Excluding already found minima: As in N = 1, some minima are found more frequently than others for N > 1 as well.To avoid this repetition, one may modify the loss function by adding a hump around an already found minima x 0 .This modified loss function does not have x 0 as a global minima, but other minima of the original loss function continue to be the minima of the modified loss function.A smooth choice of hump function will affect the locations as well as values of other minima as well, but these effects can be made arbitrarily small by appropriate choice of the hump.A more serious problem occurs due to new local minima and/or saddles introduced by the hump function.Not surprisingly, we did not find such modification of loss landscape to be helpful.A simple realization of this idea is presented in Appendix H.
• Gauge choice: For N=2 and higher, finding all the minima in a single gauge proved difficult, hence we ran the search for multiple gauge choices.For N=2, two gauge choices sufficed to produce all the minima, but more were required for N=3.In Appendix D, we discuss a wide class of gauge choices for arbitrary N > 1.
In order to distinguish same minima found in different gauges, gauge invariant markers were calculated and used to count the number of linearly independent minima found(given the choice of special c ij -s, minima can have two or four fold symmetry).
• Scaling: Under the scale transformations Z → λZ, Φ → λΦ, c ij → λ 2 c ij , the loss function (3.5) changes by an over all scaling factor of λ 4 .It follows that the minima of this rescaled loss function are related to those of the original one by simple scaling.This fact can be used to scale up/down the c ij -s, find minima of the resultant loss function and then transform back the minima by reverse scaling.This essentially has the effect of zooming in/out the loss landscape.This technique led to some new minima for N = 3.

Discussion
In this work we have focused on the presence of exponentially many low lying local minima of the loss landscape, which seems to be a generic feature found in diverse NN architectures.We have pointed out that the physical potential in microscopic description of black holes in string theory naturally give rise to similar landscapes.This is quite striking since these potentials can have as few as a couple of dozens parameters in simple cases, compared to millions of parameters in realistic DNN models.Furthermore, in this case, the exact number of minima is known a priori, owing to the stringent mathematical structure of string theory.Physically, the presence of exponentially many minima has its origin in black hole entropy.Apart from establishing a curious connection between quantum gravity and machine learning, our work provides a large class of computationally cheap testing grounds for studying questions related to the loss landscape, some of which we have explored in this work.
The possibilities offered by the connection made in this work far exceed explorations carried out here.Firstly, our computational resources limited our exploration to landscapes for small charges, i.e. landscapes with relatively small number of minima.It is imperative to carry out similar investigations for larger charges, where the exponential degeneracy of minima becomes apparent.We have mostly relied on SGD, as means of finding the minima, as preliminary experience indicated SGD to be the most efficient.However a detailed study of the performances of other algorithms is desirable.
Apart from finding the minima, the landscapes discussed in this work, offer perfect settings for testing the role of saddle points in search convergence.We have noted that with increasing N , the search gets slower.It would be interesting to enumerate the number of saddles, which should not be too difficult as the potential has an analytic form.Although we have not attempted this in the present work.
More realistic parallels of loss landscapes of non-linear DNN-s have non-degenerate minima.As mentioned earlier, such landscapes can easily be obtained from the ones discussed in this work, simply by adding to the potential a slowly varying function bound from below.It would be interesting to explore whether and how such changes affect the search for minima.
A more futuristic, yet perhaps most intriguing possibility would be to machine learn the minima of the loss function themselves.At present times, when one is usually content with finding any one set of hyperparameters with low enough loss value, this may sound too farfetched.But as the technical advancements take place at a galloping pace, it may soon be commonplace to find several local minima of the loss function of a DNN.Questions pertaining to multitude of minima would then be of practical interest.In the meantime, we might obtain theoretical insights into similar questions by studying the potential landscapes pointed out in this paper, which are much simpler to study due to small number of parameters involved. where F-term equations entail that all Z-s and Φ-s are non-vanishing.Demanding a vanishing V gauge then implies The D-term potential has the general form C. The superpotential the F-term potential in a supersymmetric theory has the following structure where ϕ α stands for various chiral multiplets (or complex scalars therein) in the theory.The superpotential in the present case is given by 9 [110] where Φ (12) , Φ (23) , Φ (31) are as defined in (3.2).

D. Gauge fixing
The system has a C * × C * × ×GL(N, C) gauge symmetry.First and second C * gauge symmetries can respectively be fixed by fixing Z 12 , Z 23 to some complex numbers.GL(N, C) can be fixed in many ways.We choose to fix one of Z 41 , Z 42 , Z 43 , Z 14 , Z 24 , Z 34 to a randomly chosen vector Z f ix and one of Φ 4  1 , Φ 4 2 , Φ 4 3 to a N × N matrix Φ f ix with all but N components fixed.Ability of taking any configuration of say Z 41 , Φ 4 1 to this configuration depends on the existence of a unique M ∈ GL(N, C) satisfying These being linear equations in entries of M , will generally admit unique solution.
Simplest choice of Φ f ix would be a diagonal matrix.We point out a more general choice, which is to fix first N − 1 rows of Φ f ix to randomly chosen numbers.For different choices of these random numbers, different minima might be more accessible.
As an explicit example, consider N = 3.In this case, where r ij -s are randomly chosen complex numbers (to be held fixed in the course of SGD) and v i -s are dynamic variables (to be varied in the course of SGD).

G. Derivation of p(3, n)
To start with we recall (4.2), which entails Let m S denote the number of sequences with entries drawn from the set S ⊂ {1, 2, 3}.Thus, Note m {12} is the number of sequences without 3, m {13} is the number of sequences without 2, m {23} is the number of sequences without 1.Thus, up to overcounting, number of sequences without at least one of 1, 2, 3 is m {12} + m {13} + m {23} .To take care of overcountings, we note the sequence 1 n is counted in both m {12} and m {13} .Similar statements hold for sequences 2 n and 3 n .With overcountings taken care of, the number of sequences without at least one of 1, 2, 3 is which precisely matches (G.1).Note p(3, 1) = p(3, 2) = 0, as expected.

H. Excluding already found solutions
We consider a simple function f with multiple degenerate minima and try to modify the function to another function f , such that one of the minima of f is not a minima of f , but others continue to be so, at least approximately.This can be achieved simply by adding a localized hump around the minimum to be excluded.A simple choice for the hump function around a point x 0 would be where the parameter a controls the height of the hump and λ its width.If a function f (x) has a minima at x 0 , then we can take f (x) = f (x) + t(x 0 ) to be the modified function.As an example, in Fig. 4, we plot the function f (x) = (x 2 − 1) 2 , t(1), f (x) = f (x) + t(x 0 ).As can be seen in Fig. 4, the modified function (the green plot) and the original function (the blue plot) essentially coincide except in a small neighbourhood of the point x = 1.Whereas the global minima at x = 1 is avoided, new local minima shows up.  .

1 .
Introduction and summary 2. A brief introduction to black hole entropy in string theory 3. The potential landscape in microscopic description of black holes 3.1 The system 3.2 Field content and gauge symmetry 3.3 The potential or loss function 4. Searching for multiple minima with Stochastic Gradient Descent 4.1 Estimated number of runs required to find all the minima 4.2 Black holes: A case study 4.2.1 N = 1 4.2.2N > 1 5. Discussion A. The index B. Details of V gauge and V D C. The superpotential D. Gauge fixing E. Simplification of the potential landscape F. Further simplification for N = 1 G. Derivation of p(3, n) H. Excluding already found solutions 1. Introduction and summary

. 7 )
where λ 12 ∈ C * , λ 23 ∈ C * , M ∈ GL(N, C).The U (1) × U (1) × U (N ) subgroup of this is the gauge group.Solutions to(3.6), that are related by C * × C * × GL(N, C) gauge transformations acting as in (3.7), are to be counted as same solution.In other words, a physical solution really stands for a gauge orbit.In this work, we shall get rid of C * × C * × GL(N, C) redundancy, by fixing gauge and will minimize the loss function in remaining variables.Various choices of gauge are discussed in Appendix D.

Figure 1 :
Figure 1: Probability p(12, n) of finding all 12 minima in n runs

Figure 2 :
Figure 2: Potential barriers along a straight line connecting minimum 1 to other 11 minimafor N = 1.The x-axis parameterises a line segment connecting solution 1 to solution j for j = 2, . . ., 12, such that x = 0 corresponds to solution 1 and x = 1 corresponds to solution j, "barrier 1,j " denotes the potential along such a line segment.

Figure 4 :
Figure 4: The blue graph is the original potential, the orange graph is the hump added and the green graph is the resultant modified potential.The hump function is 1−tanh(10(x−1) 2 −0.1) 2

Table 1 :
Number of minima of the potential landscape for small charges For non-linear DNN-s, a more desirable landscape would have been one with exponentially many low lying local minima, only one of which is global.Such a landscape might be obtainable by spontaneously breaking the supersymmetry.But we shall not explore this possibility in this work.

Table 2 :
The D-brane configuration .4) Here c 12 , c 13 , c 23 , c 14 , c 24 , c 34 are complex non-zero parameters.Note F 41 , F 42 , F 43 are matrices, G 41 , G 42 , G 43 are column vectors, G 14 , G 24 , G 34 are row vectors and rest of the combinations are numbers.The F-term potential V F is the sum of modulus square of all terms in (3.4), i.e.

Table 5 :
Table5summaries which minima was hit how many times.We note that minima in same doublet/quadruplet often Number of times various minima for N=1 was found (out of 200 runs), i.e. loss function dipped below 10 −7 appear with same/similar frequency, with minima 5,6 being the most frequent and minima 1,2 being the list.This raises the question -what is special about the minima that are found more frequently?We note that the least frequent minima hosts the largest as well as the smallest Hessian eigenvalues among all minima8, and therefore has the widest range of eigenvalues.The Hessian eigenvalues of various doublet/quadruplet are given in Table6.