Depth-First Search performance in a random digraph with geometric outdegree distribution

We present an analysis of the depth-first search algorithm in a random digraph model with independent outdegrees having a geometric distribution. The results include asymptotic results for the depth profile of vertices, the height (maximum depth) and average depth, the number of trees in the forest, the size of the largest and second-largest trees, and the numbers of arcs of different types in the depth-first jungle. Most results are first order. For the height we show an asymptotic normal distribution. This analysis proposed by Donald Knuth in his next to appear volume of The Art of Computer Programming gives interesting insight in one of the most elegant and efficient algorithm for graph analysis due to Tarjan.


Introduction
The motivation of this paper is a new section in Donald Knuth's The Art of Computer Programming [12], which is dedicated to Depth-First Search (DFS) in a digraph.Briefly, the DFS starts with an arbitrary vertex, and explores the arcs from that vertex one by one.When an arc is found leading to a vertex that has not been seen before, the DFS explores the arcs from that vertex in the same way, in a recursive fashion, before returning to the next arc from its parent.This eventually yields a tree containing all descendants of the the first vertex (which is the root of the tree).If there still are some unseen vertices, the DFS starts again with one of them and finds a new tree, and so on until all vertices are found.We refer to [12] for details as well as for historical notes.(See also S1-S2 in Section 4.) Note that the digraphs in [12] and here are multi-digraphs, where loops and multiple arcs are allowed.(Although in our random model these are few and usually not important.)The DFS algorithm generates a spanning forest (the depth-first forest) in the digraph, with all arcs in the forest directed away from the roots.Our main purpose is to study the properties of the depth-first forest, starting with a random digraph G; in particular we study the distribution of the depth of vertices in the depth-first forest.
The random digraph model that we consider (following Knuth [12]) has n vertices and a given outdegree distribution P, which in the main part of the present paper is a geometric distribution Gep1 ´pq for some fixed 0 ă p ă 1.The outdegrees (number of outgoing arcs) of the n vertices are independent random numbers with this distribution.The endpoint of each arc is uniformly selected at random among the n vertices, independently of all other arcs.(Therefore, an arc can loop back to the starting vertex, and multiple arcs can occur.)We consider asymptotics as n Ñ 8 for a fixed outdegree distribution.
In the present paper, we study the case of a geometric outdegree distribution in detail; we also (in Section 3) briefly give corresponding results for the shifted geometric outdegree distribution Ge 1 p1 ´pq, and discuss the similarities and differences between the two cases.The case of a general outdegree distribution (with finite variance) will be studied in a forthecoming paper [10], where we use a somewhat different method which allows us to extend many (but not all) of the results in the present paper and obtain similar, but partly weaker, results; see also Section 4. One reason for studying the geometric case separately is that its lack-of-memory property leads to interesting features and simplifications not present for general outdegree distributions; this is seen both in [12] and in the proofs and results below.In particular, the depth process studied in Section 2 will be a Markov chain, which is the basis of our analysis.
In addition to studying the depth-first forest, we give also (in Section 2.5) some results on the number of edges of different types in the depth-first jungle; this is defined in [12] as the original digraph with arcs classified by the DFS algorithm into the following five types, see Figure 1 for examples: ‚ loops; ‚ tree arcs, the arcs in the resulting depth-first forest; ‚ back arcs, the arcs which point to an ancestor of the current vertex in the current tree; ‚ forward arcs, the arcs which point to an already discovered descendant of the current vertex in the current tree; ‚ cross arcs, all other arcs (these point to an already discovered vertex which is neither a descendant nor an ancestor of the current vertex, and might be in another tree).(See further the exercises in [12].)Remark 1.1.Some related results for DFS in an undirected Erdős-Rényi graph Gpn, λ{nq are proved by Enriquez, Faraud and Ménard [5] and Diskin and Krivelevich [4], and DFS in a random Erdős-Rényi digraph has been studied for example in the proof of [13,Theorem 3].These models are closely related to our model with a Poisson outdegree distribution P; they will therefore be further discussed in [10].
Remark 1.2.We consider only the case of a fixed outdegree distribution P. The results can be extended to distributions P n depending on n, under suitable conditions.This is particularly interesting in the critical case, with expectations λ n Ñ 1; however, this is out of the scope of the present paper.
The main results for a geometric out-degree distribution are stated and proved in Section 2. We analyze the process dptq of depths of the vertices, in the order they are found by the DFS.For a geometric outdegree distribution (but not in general), dptq is a Markov chain, and we find its first-order limit by a martingale argument; moreover, we show Gaussian fluctuations.We also find results for the numbers of different types of arcs defined above; this includes verifying some conjectures from previous versions of [12].
In Section 3 we study briefly the case of a shifted geometric outdegree distribution.The same method as in Section 2 works in this case too, but the explicit results are somewhat different.One motivation for this section is to show that some of the Figure 1.Example of a depth-first forest (jungle) from [12], by courtesy of Donald Knuth.Tree arcs are solid (e.g. 9 ○Ñ 3 ○).For example, 3 relations found in Section 2 for a geometric outdegree distribution do not hold for arbitrary distributions.We end in Section 4 with some comments on the case of general outdegree distributions.
1.1.Some notation.We denote the given outdegree distribution by P. Recall that our standing assumption is that the outdegrees of the vertices are i.i.d.(independent and identically distributed).
The mean outdegree, i.e., the expectation of P, is denoted by λ.In analogy with branching processes, we say that the random digraph is subcritical if λ ă 1, critical if λ " 1, and supercritical if λ ą 1.
As usual, w.h.p. means with high probability, i.e., with probability 1 ´op1q as n Ñ 8. We use p ÝÑ for convergence in probability, and d ÝÑ for convergence in distribution of random variables.
Moreover, let pa n q be a sequence of positive numbers, and X n a sequence of random variables.We write X n " o p pa n q if, as n Ñ 8, X n {a n p ÝÑ 0, i.e., if for every ε ą 0, we have Pp|X n | ą εa n q Ñ 0. Note that this is equivalent to the existence of a sequence ε n Ñ 0 such that Pp|X n | ą ε n a n q Ñ 0, or in other words |X n | ď ε n a n w.h.p. (This is sometimes denoted "X n " opa n q w.h.p.", but we will not use this notation.)Furthermore, Note that X n " O L 2 pa n q implies X n " o p pω n a n q, for any sequence ω n Ñ 8.Note also that X n " O L 2 pa n q implies E X n " Opa n q; thus error terms of this type implies immediately estimates for expectations and second moments.In particular, for the most common case below, X n " O L 2 pn 1{2 q is equivalent to E X n " Opn 1{2 q and Var X n " Opnq.
Remark 1.3.We state many results with error estimates in L 2 , which means estimates on the second moment; we conjecture that the results extend to higher moment and estimates in L p for any p ă 8, but we have not pursued this.

Depth analysis with geometric outdegree distribution
In this section we assume that the outdegree distribution is geometric Gep1 ´pq for some fixed 0 ă p ă 1, and thus has mean λ :" p 1 ´p . ( When doing the DFS on a random digraph of the type studied in this paper, it is natural to reveal the outdegree of a vertex as soon as we find it.(See S1-S2 in Section 4.) However, for a geometric outdegree distribution, because of its lack-ofmemory property, we do not have to immediately reveal the outdegree when we find a new vertex v. Instead, we only check whether there is at least one outgoing arc (probability p), and if so, we find its endpoint and explore this endpoint if it has not already been visited; eventually, we return to v, and then we check whether there is another outgoing arc (again probability p, by the lack-of-memory property), and so on.This will yield the important Markov property in the construction in the next subsection.
In the following, by a future arc from some vertex, we mean an arc that at the current time has not yet been seen by the DFS.
2.1.Depth Markov chain.Our aim is to track the evolution of the search depth as a function of the number t of discovered vertices.Let v t be the t-th vertex discovered by the DFS (t " 1, . . ., n), and let dptq be the depth of v t in the resulting depth-first forest, i.e., the number of tree edges that connect the root of the current tree to v t .The first found vertex v 1 is a root, and thus dp1q " 0.
The quantity dptq follows a Markov chain with transitions (1 ď t ă n): This happens if, for some k ě 1, v t has at least k outgoing arcs, the first k ´1 arcs lead to vertices already visited, and the kth arc leads to a new vertex (which then becomes v t`1 ).The probability of this is (ii) dpt `1q " dptq, assuming dptq ą 0. This holds if all arcs from v t lead to already visited vertices, i.e., (i) does not happen, and furthermore, the parent of v t has at least one future arc leading to an unvisited vertex.These two events are independent.Moreover, by the lackof-memory property, the number of future arcs from the parent of v t has also the distribution Gep1 ´pq.Hence, the probability that one of these future arcs leads to an unvisited vertex equals the probability in (2.2).The probability of (ii) is thus This happens if all arcs from v t lead to already visited vertices, and so do all future arcs from the nearest ancestors of v t , while the p `1qth ancestor has at least one future arc leading to an unvisited vertex.The argument in (ii) generalizes and shows that this has probability (iv) dpt `1q " dptq ´ , assuming dptq " ě 0. By the same argument as in (ii) and (iii), except that the p `1qth ancestor does not exist and we ignore it, we obtain the probability (2.5) Note that (iv) is the case when dpt `1q " 0 and thus v t`1 is the root of a new tree in the depth-first forest.We can summarize (i)-(iv) in the formula where ξ t is a random variable, independent of the history, with the distribution where In other words, ξ t has the geometric distribution Gepπ t q.Define r dptq :" and note that (2.9) is a sum of independent random variables.Then dptq can be recovered from the simpler process r dptq as follows.
Remark 2.2.Similar formulas have been used for other, related, problems with random graphs and trees, where trees have been coded as walks, see for example [1,Section 1.3].Note that in our case, unlike e.g.[1], r dptq may have negative jumps of arbitrary size.
After the maximum at θ 0 , r pθq decreases and tends to ´8 as θ Õ 1.Hence, there exists a θ 0 ă θ 1 ă 1 such that r pθ 1 q " 0; we then have r pθq ą 0 for 0 ă θ ă θ 1 and r pθq ă 0 for θ ą θ 1 .We will see that in this case the depth-first forest w.h.p. contains a giant tree, of order and height both linear in n, while all other trees are small.
On the other hand, if λ ď 1 (i.e., p ď 1 2 ) (the subcritical and critical cases), then r 1 p0q ď 0 and r pθq is negative and decreasing for all θ P p0, 1q.In this case, we define θ 0 :" θ 1 :" 0 and note that the properties just stated for r still hold (rather trivially).We will see that in this case w.h.p. all trees in the depth-first forest are small.
Note that in all cases, and that θ 1 is the largest solution in r0, 1q to logp1 ´θ1 q " ´λθ 1 . (2.17) Remark 2.3.The equation (2.17) may also be written 1 ´θ1 " expp´λθ 1 q, which shows that the survival probability of a Galton-Watson process with Popλq offspring distribution defined in (1.1).
We define r `pθq :" r r pθqs `.Thus, by (2.15) and the comments above, We can now state one of our main results.
In Section 2.3 we will improve this when λ ą 1, and show that then the height Υ is asymptotically normally distributed (Theorem 2.11).
Remark 2.7.When λ ą 1, the height Υ and average depth d are thus linear in n, unlike many other types of random trees.This might imply a rather slow performance of algorithms that operate on the depth-first forest if it is built explicitly in a computer's memory.

Asymptotic normality.
In this subsection, we show that in the supercritical case λ ą 1, Theorem 2.4 can be improved to yield convergence of dptq (after rescaling) to a Gaussian process, at least on r0, θ 1 q.As a consequence, we show that the height Υ is asymptotically normal.
Recall that for an interval I Ď R, DpIq is the space of functions I Ñ R that are right-continuous with left limits (càdlàg) equipped with the Skorohod topology.For definitions of the topology see e.g.[3], [7], [11,Appendix A.2], or [8]; for our purposes it is enough to know that convergence in DpIq to a continuous limit is equivalent to uniform convergence on compact subsets of I. (Note that it thus matters if the endpoints are included in I or not; for example, convergence in Dr0, 1q and Dr0, 1s mean different things.) We define dp0q :" r dp0q :" 0. uniformly on r0, θ ˚s.(The opn 1{2 q here are random, but uniform in θ.)For |θ ´θ0 | ď n ´1{6 , we have Zpθq " Zpθ 0 q `op1q, since Z is continuous, and thus (2.56) yields, almost surely, Since max θ r pθq " r pθ 0 q, it follows that max On the other hand, for |θ ´θ0 | ě n ´1{6 , we have by a Taylor expansion, for some c ą 0, r pθq ď r pθ 0 q ´cpθ ´θ0 q 2 ď r pθ 0 q ´cn ´1{3 .

2.4.
The trees in the forest.
The arguments in the proof of Theorem 2.12 show that in the supercritical case λ ą 1, the DFS w.h.p. find first possibly a few small trees, then a giant tree containing all v t with O L 2 pn 1{2 q ď t ď θ 1 n `OL 2 pn 1{2 q, and then a large number of small trees.We give some details in the following lemma and theorem.Lemma 2.13.Let pa, bq be a fixed interval with 0 ď a ă b ď 1 and b ą θ 1 .Then w.h.p. there exists a root v t in the depth-first forest with t{n P pa, bq.
Theorem 2.14.Let T 1 be the largest tree in the depth-first forest.
Furthermore, the second largest tree has order |T 2 | " o p pnq.
Proof.Let ε ą 0. By covering rθ 1 , 1s with a finite number of intervals of length ă ε{2, it follows from Lemma 2.13 that w.h.p. every tree T having a root v t with t ą pθ 1 ´ε{2qn has |T| ď εn.
Suppose now λ ą 1.Consider the tree T in the depth-first forest that contains v tnθ 0 u , denote its root by v r and let v s be its last vertex.By the proof of Theorem 2.12, dptq ą 0 for O L 2 pn 1{2 q ď t ď θ 1 n ´OL 2 pn 1{2 q, and thus r " O L 2 pn 1{2 q and s ě θ 1 n ´OL 2 pn 1{2 q.
Remark 2.15.As said in Remark 2.3, θ 1 , the asymptotic fraction of vertices in the giant tree, equals the survival probability of a Galton-Watson process with Popλq offspring distribution.Heuristically, this may be explained by the following argument, well known from similar situations.Start at a random vertex and follow the arcs backwards.The indegree of a given vertex is asymptotically Popλq, and the process of exploring backwards from a vertex may be approximated by a Galton-Watson process with this offspring distribution.Hence, the probability of a "large" backwards process converges to the survival probability θ 1 .It seems reasonable that most vertices in the giant tree have a large backwards process, while most vertices outside the giant have a small backwards process.
Note also that the asymptotic size of the giant tree thus equals the asymptotic size of the giant component in an undirected Erdős-Rényi random graph Gpn, λ{nq, which heuristically is given by the same argument.(See also Remark 1.1 and [10].)2.5.Types of arcs.Recall from the introduction the classification of the arcs in the digraph G. Since we assume that the outdegrees are Gep1 ´pq and independent, the total number of arcs, M say, has a negative binomial distribution with mean λn, and, by a weak version of the law of large numbers, M " λn `OL 2 pn 1{2 q. (2.82) In the following theorem, we give the asymptotics of the number of arcs of each type.
Theorem 2.16.Let L, T , B, F and C be the numbers of loops, tree arcs, back arcs, forward arcs, and cross arcs in the random digraph.Then T " τ n `OL 2 pn 1{2 q, (2.84) F " ϕn `OL 2 pn 1{2 q, (2.86) where τ :" χ :" 1 ´ψ " θ 1 `λ 2 p1 ´θ1 q 2 , (2.88) (2.89) Proof.Let η t be the number of arcs from v t , and let η ă t , η " t , η ą t be the numbers of these arcs that lead to some v u with u ă t, u " t and u ą t, respectively.Then Furthermore, an arc v t v u with u ą t is either a tree arc or a forward arc; conversely, every tree arc or forward arc is of this type.Consequently, B: Let B t be the number of back arcs from v t ; thus B " ř n 1 B t .Let F t be the σ-field generated by all arcs from v i , i ď t (i.e., by the outdegrees η i and the endpoints of all these arcs); note that this includes complete information on the DFS until v t`1 is found, but also on some further arcs (the future arcs from the ancestors of v t`1 ).Then dptq is F t´1 -measurable and B t is F t -measurable.Moreover, η t is independent of F t´1 .Thus, conditioned on F t´1 , we still have η t " P; we also know dptq, and each arc from v t is a back arc with probability dptq{n.The equalitites τ " χ and β " ϕ mean asymptotic equality of the corresponding expectations of numbers of arcs.In fact, there are exact equalities.
Theorem 2.18.For any n, E T " E C and E B " E F " λ E d.
Proof.Let v, w be two distinct vertices.If the DFS finds w as a descendant of v, then there will later be Gep1 ´pq arcs from w, and each has probability 1{n of being a back arc to v. Similarly, there will be Gep1 ´pq future arcs from v, and each has probability 1{n of being a forward arc to w.Hence, if I vw is the indicator that w is a descendant of v, and B vw [F vw ] is the number of back [forward] arcs vw, then Summing over all pairs of distinct v and w, we obtain Note that r pθq in (3.4) is proportional to (2.15) for the (unshifted) geometric distribution with the same λ, but larger by a factor 1{p. Figures 3 and 4 show r pθq for both geometric distributions with the same p (0.6) and the same λ (2.0), respectively.
In the proof of Theorem 2.16, (2.108) still holds, and we obtain (2.85) with β " λα, and then (2.87) with χ " λ{2 ´β just as before (but recall that α now is different).On the other hand, now the expected numbers of back and forward arcs differ since E B " λ E d " λαn and E F " pλ´1q E d " pλ´1qαn because the average number of future arcs at a vertex after a descendant has been created is λ ´1.The asymptotic formula (2.84) holds as above with τ :" 1 ´ψ; hence (2.98) implies that (2.86) holds too, with ϕ " λ{2 ´τ ; as just noted, we now have ϕ " pλ ´1qα ‰ β.Collecting these constants, we see that Theorem 2.16 holds with τ :" 1 ´ψ " θ 1 `λ 2 p1 ´θ1 q 2 , ( Thus the equality β " ϕ and the equality of the expected number of back and forward arcs in Theorems 2.16 and Theorem 2.18 was an artefact of the geometric degree distribution.Similarly, χ " λ{2 ´β ă λ{2 ´ϕ " τ , and the equality of the expected numbers of tree arcs and cross arcs in Theorem 2.18 also does not hold.We summarize the results above. Theorem 3.1.Let the outdegree distribution P be the shifted geometric distribution Ge 1 ppq with p P p0, 1q.Then Theorems 2.4-2.16hold, with the constants now having the values described above (and always λ ą 1), while Theorem 2.18 does not hold.

A general outdegree distribution: stack size
In this section, we consider a general outdegree distribution P, with mean λ and finite variance.
When the outdegree distribution is general, the depth does not longer follow a simple Markov chain, since we would have to keep track of the number of children seen so far at each level of the branch of the tree toward the current vertex.Instead we get back a Markov chain if we instead of the depth consider the stack size Iptq, defined as follows.
2.21) is OpT ˚q " Opnq.Consequently, (2.21) yields dptq ´nr `pt{nq " O L 2 pn 1{2 q. (2.28)It remains to consider T ˚ă t ď n.Then the argument above does not quite work, because π t OE 0 and thus Var ξ t Õ 8 as t Õ n.We therefore modify ξ t .We define p π t :" maxtπ t , π T ˚u; thus p π t " π t for t ď T ˚and p π t ą π t for t ą T ˚.We may then define independent random variables p ξ (Note that T ˚and M ˚depend on the choice of θ ˚.)For t ď T ˚, the definition of Moreover, for t{n ď θ 1 , we have min 1ďjďt r pj{nq " Op1{nq, while for t{n ě θ 1 , we have min 1ďjďt r pj{nq " r pt{nq.Hence, for all t ď T ˚, min 1ďjďt r pj{nq " r pt{nq ´r `pt{nq `Op1{nq, t such that p ξ t " Gepp π t q and p ξ t ď ξ t for all t ă n. (Thus, p ξ t " ξ t for t ď T ˚.)`1 ´p ξ i ˘.(2.30)Since p ξ i ď ξ i , (2.30) implies that p dptq ě dptq for all t.
Binpm, pq implies E X 2 " mpp1 ´pq `pmpq 2 , : By (2.100) and (2.84), we have (2.86) with ϕ " λ 2 ´τ , which agrees with (2.89) by (2.88) and a simple calculation.In particular, ϕ " β.C: Similarly, it follows from (2.97) and (2.85) that we have (2.87) with χ :" λ{2 ´β.Since we have found β " ϕ, and we always have τ `ϕ " λ{2 " β `χ, see (2.97) and (2.100), we thus have χ " τ , and thus (2.88) holds.Note that T `F and B `C are asymptotically normal; this follows immediately from (2.91) and (2.92) by the central limit theorem.Conjecture 2.17.All four variables T, B, F, C are (jointly) asymptotically normal. F [12]rk 2.20.A simple argument with generating functions shows that the number of loops at a given vertex v is Gep1 ´p{pn ´np `pqq; these numbers are independent, and thus L " NegBin `n, 1 ´p{pn ´np `pq ˘with E L " p{p1 ´pq " λ " Op1q and VarpLq " pp1 ´p `p{nq{p1 ´pq 2 " λp1 `λ{nq " Op1q[12].Moreover, it is easily seen that asymptotically, L has a Poisson distribution, L The probability (2.2) is replaced by p1 ´t{nq{p1 ´pt{nq, but the number of future arcs from an ancestor is still Gep1 ´pq, and, with θ :" t{n, [12])Finally, E T `E F " E C `E B by (2.98) and (2.95), and thus(2.110)impliesET"E C. Remark 2.19.Knuth[12]conjectures, based on exact calculation of generating functions for small n, that, much more strongly, B and F have the same distribution for every n.(Note that T and C do not have the same distribution; we have T ď n´1, while C may take arbitrarily large values.)dÝÑPopλq as n Ñ 8.Thus λ ą 1, and only the supercritical case occurs.As in Section 2, the depth dptq is a Markov chain given by (2.6), but the distribution of ξ t is now different.and instead of (2.14) we have E r dptq " n r pθq `Op1q where now r pθq takes the new value r pθq :" ,The rest of the proof remains the same with minor modifications, and leads to, instead of (2.78), with θ :" t{n, )