Solving String Problems on Graphs Using the Labeled Direct Product

Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, either by using efficient ad-hoc approaches which do not generalize to all input graphs, or by indexing difficult graphs and having worst-case exponential complexities. In the absence of an ubiquitous and polynomial tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms in the worst case: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.


Introduction
Motivated by various application domains appearing during the last decades, a significant branch of string algorithm research has focused on extending string problems from texts to more complex objects, such as labeled rooted trees (e.g.modeling XML documents [12]) and labeled graphs (e.g.modeling pangenome graphs [13,29]).For example, the string matching in labeled graphs (SMLG) problem asks to find an occurrence of a given string S inside a labeled graph G, that is, a walk of G whose concatenation of vertex labels (spelling) is S. On rooted trees, SMLG can be solved in linear time [1], but on general graphs it admits both quadratic-time conditional lower bounds [4,9,10,14] and optimal algorithms of matching time complexity [3,28,21].Despite this active interest in the SMLG problem, the graph extensions of three other fundamental string problems have received none or little attention so far: longest common substring, matching statistics, longest repeated substring.On strings, the former two problems can also be seen as relaxations of the exact string matching problem (for e.g.handling approximate matching) [17,23], and all problems can be seen as basic instances of pattern/motif discovery in strings [25].In this paper we consider their natural generalizations to Σ-labeled graphs, namely to tuples G = (V, E, L), with V and E the sets of vertices and edges, respectively, and L : V → Σ assigning to each vertex a label from Σ (the original string problems can be obtained by taking all graphs to be labeled paths).

Problem 1 (Longest common string problem (LCSP)).
Given G 1 , G 2 Σ-labeled graphs, find a longest string S occurring in G 1 and in G 2 .
Problem 2 (Matching statistics problem (MSP)).Given G 1 , G 2 Σ-labeled graphs, compute for every vertex v of G 1 the length MS(v) of a longest walk of G 1 starting at v whose spelling has an occurrence in G 2 .
Problem 3 (Longest repeated string problem (LRSP)).Given a Σ-labeled graph G, find a longest string S having at least two distinct occurrences in G.
When defined on strings, all problems can be solved in linear time and space as basic textbook applications of the suffix tree [17,8], or of the suffix array and the longest common prefix (LCP) array [24,26], under the standard assumption to be working with an integer alphabet, i.e. containing integers from a range that is linear-sized with respect to the input.On labeled graphs, only LCSP has been considered by Shimohira et al. [31].They solved it in time O(|E 1 | • |E 2 |), where E 1 and E 2 are the edge sets of G 1 and G 2 , respectively, and only if one of the graphs is acyclic, and they left the general case of two cyclic graphs as an open problem [31].Moreover, there are no known analogous algorithms for MSP and LRSP, and no characterization of the possible solutions for LRSP.Regarding hardness, note that the O(|E 1 | • |E 2 |)-time algorithm of [31] for LCSP is optimal under the same conditional lower bounds as for SMLG [4,9,10,14], since the decision version of SMLG (i.e.whether there is an occurrence of the string) is linear-time reducible to LCSP.In fact, the same holds also for MSP.Nevertheless, there exists no analogous lower bound for LRSP.Note that the three problems defined with occurrences as directed paths, i.e. visiting each vertex at most once, are NP-complete (Observation 1), and hence in this paper we consider walk occurrences.SMLG, LCSP and LRSP have connections also to Automata Theory, since the spellings of all walks of a finite labeled graph form a regular language.Indeed, one can transform a labeled graph into an NFA by making every vertex a final state, adding a new initial state connected to all vertices, and moving the label of each vertex on each incoming edge: a string occurs in a labeled graph if it is accepted by its corresponding NFA; strings common to two labeled graphs correspond to strings accepted by the intersection of their corresponding NFAs; repeated strings of a labeled graph correspond to ambiguous words of the resulting NFA, namely words having at least two accepting computations.Ambiguity of automata (or its lack thereof) has been studied in the context of Descriptional Complexity Theory [18,7,16,5], not to be confused with Descriptive Complexity Theory.For example, the degree of ambiguity of an NFA is the maximum number of accepting computations of any word by the automaton.While there are works about studying upper bounds of such metric [33,2], to the best of our knowledge there is no research on the longest ambiguous words of an NFA.
As a first result on labeled graphs, we observe that on labeled directed trees (i.e.rooted trees with all edges oriented away from the root) LRSP and LCSP can also be solved in linear time and space as an easy application of the tree counterparts of the suffix tree and the suffix array: the suffix tree of a tree [22] and the XBW transform of a tree (XBWT) [12].The former, introduced by Kosaraju in 1989, generalizes the suffix tree to represent all suffixes of the strings spelled by the upwards paths of a given tree and admits linear-time construction algorithms [6,30].The latter, introduced by Ferragina et al. in 2005, is an invertible transform, also computable in linear time, encoding a tree as an ordered list of elements each corresponding to a vertex: the order of these elements first follows the lexicographical ordering of the unique path from the parent of the corresponding vertex to the root of the tree, then the pre-order visit of the tree.LRSP of a tree can be solved directly by either structure in the same way as LRSP of a string, while LCSP of two trees can be solved with a simple adaptation of either structure.
In this paper we introduce the labeled direct product of two labeled graphs G 1 and G 2 , denoted inspired from e.g. the Cartesian product construction for the intersection language accepted by two finite state automata (Section 2).While not a completely novel idea, this product cleanly encodes each and every pair of walks of the input graphs spelling the same string and it appears as the right conceptual tool to optimally solve string problems on graphs, in the same way as the suffix tree is a ubiquitous tool for optimal algorithms on text.Our results are as follows.

Conceptually simpler and more efficient algorithms
The current state-of-the-art algorithm for SMLG was introduced in 1997 by Amir et al. [3] in the context of hypertexts (i.e.directed graphs such that each vertex is labeled with a string).Given a Σ-labeled graph G = (V, E, L) and a string ) is equivalent to finding a path of maximum length of the DAG G 1 ⊗ G 2 , which is also solvable in time and space linear in the size of space dynamic programming algorithm of Shimohira et al. [31] for LCSP, but can also be faster and use less space, if or the alphabet Σ has constant size (Remark 3).Otherwise, our algorithm implies a greater space usage, since it stores G 1 ⊗ G 2 : choosing not to store the edges of G 1 ⊗ G 2 and instead computing them when needed results in a time and space complexity closer to that of the existing algorithm for LCSP (Remark 5).

Simple solution to an open problem
In addition to providing simple algorithms on DAGs, the labeled product graph also allows for conceptually simple and efficient solutions on arbitrary graphs.For example, LCSP on two graphs containing cycles was left open by Shimohira et al. [31], and in Section 4 we show that it is solvable by just checking whether G 1 ⊗ G 2 has a cycle, and if not, still finding a path of maximum length in G 1 ⊗ G 2 (see Table 1 for a summary of the complexity results for LCSP).
, of the time complexities.The linear-time algorithms assume an integer alphabet and the quadratic-time algorithms are optimal under OVH (Theorem 5).

Solutions to new problems
The labeled direct product also allows for solutions to related problems.For MSP on DAGs we analogously find paths of maximum length from some vertices of G 1 ⊗ G 2 (Theorem 1).We generalize this algorithm on arbitrary graphs by computing the strongly connected components (SCCs) of G 1 ⊗ G 2 and by checking a condition analogous to that of LCSP for every vertex v of G 1 (Theorem 2).These algorithms use time and space linear in the size of LRSP on a DAG G is equivalent to finding paths of maximum length passing through specific vertices of G⊗G (Theorem 1).On arbitrary graphs, we use further interesting connections between purely graphtheoretic concepts of the labeled product graph (SCCs) and string-theoretic ones (non-deterministic vertices).The difference of LRSP with respect to LCSP and MSP is that the problem may admit repeated strings of infinite length or repeated strings of unbounded lengths-these two scenarios may not coincide.Even if the difference between these two concepts may seem artificial, their study is necessary for the natural characterization of LRSP solutions.Indeed, in Section 3 we show that such cases can be efficiently identified (infinite repeated strings can be identified in G ⊗ G by checking reachability from a certain set of vertices to a non-trivial SCC, while repeated strings of unbounded length can be identified by checking reachability from a non-trivial SCC to some non-deterministic vertex).If none of these cases happen, we show that the problem is solvable with the DAG algorithm for LRSP on an acyclic subgraph of G⊗G (Theorem 3).The entire procedures take time and space linear in the size of G⊗G.In addition, we can also output a linear-size representation of a longest repeated string, infinite or not.

Optimality under conditional lower bounds
In Section 4 we show that the above algorithms of worst-case quadratic-time complexity are also conditionally optimal.First, we note how the quadratic lower bounds of [9,14] imply the same quadratic lower bounds for LCSP and MSP.Second, in Theorem 4 of Section 4, we show that on DAGs that are deterministic (i.e. the labels of the out-neighbors of every vertex are all distinct) the SMLG problem has a linear-time reduction to LRSP, which thus implies the same lower bounds for LRSP as in [9,14] (holding also for deterministic DAGs).To the best of our knowledge such a reduction does not exist when the problems are defined on strings.Third, in Theorem 5 we show that, under the Orthogonal Vectors Hypothesis (OVH) [34], there can be no truly sub-quadratic algorithm solving LRSP, even when the graph is a DAG, with vertex labels from a binary alphabet, maximum in-degree and out-degree of any vertex at most 2, and is deterministic.Our reduction for LRSP is simpler than that of [9], but with an interesting difficulty arising from the fact that we must encode the orthogonal vectors input in the same graph, and must ensure that the occurrences of the longest repeated string are distinct.
Moreover, in the same way as the labeled direct product graph is a general tool for obtaining algorithms, the construction behind our reduction could also be a general approach to obtain conditional Table 2: Summary, for all the variants of LRSP on a graph (V, E, L), of the time complexities.The linear-time algorithms assume an integer alphabet and the quadratic-time algorithms for directed graphs are optimal under OVH (Theorem 5).We leave as an open problem improving our solution to LRSP on undirected trees, when the string occurrences are paths, or proving it is conditionally tight.

Graph Class
Graph Type directed undirected paths

NP-complete
Section 5 on walks on paths on walks on paths Occurrence definition lower bounds for string problems on graphs.For example, our OVH construction (simpler than [9]) also provides a conditional lower bound for LCSP, and in Corollary 3 we show our reduction also proves the same conditional lower bound for a variant of MSP.

The full complexity picture of LRSP
Finally, since on directed graphs LRSP turned out the most complex problem to solve, in Section 5 we complete its complexity picture by studying it also on undirected graphs, by similarly considering undirected paths, trees and graphs, and the path and walk variants of the problem (see Table 2).While these results are simpler than for directed graphs, they exhibit some interesting complexity dichotomies on analogous classes of graphs.For example, for walk occurrences the problem is linear-time solvable on general undirected graphs, as opposed to having a conditional quadratic-time lower bound on general directed graphs).Note that the SMLG problem has the same complexity on both directed and undirected graphs [9], making this dichotomy for LRSP more interesting.Moreover, when defined on paths, we obtain only a quadratic-time algorithm on undirected trees, even though LRSP is linear on directed trees.As such, we put forward as an interesting open problem either improving this complexity, or proving a lower bound.

Notation and preliminaries
Given a non-empty and finite alphabet Σ, we denote with Σ * and Σ ω the set of all finite and infinite strings over Σ, respectively.For convenience, we also define Σ + := Σ \ {ε}, with ε the empty string.We say that Σ is an integer alphabet if it contains integers from a range that is linear-sized with respect to the input of the problem at hand, allowing linear-time lexicographical sorting.Given the Σ-labeled graph G = (V, E, L), a walk in G is any finite or infinite sequence of vertices p = (p 0 , p 1 , p 2 , . . .), such that there is an edge from any p i to its successor in p.If all vertices of p are pairwise distinct, then p is called a path.The length of a finite walk p is its number of edges.Just as strings can be concatenated to form longer strings, walks can be concatenated to form longer walks under the condition that the result is still a walk in G. Two walks p = (p 0 , p 1 , p 2 , . . .), q = (q 0 , q 1 , q 2 , . . . ) in G are distinct, in symbols p = q, if there is an index i such that p i = q i .A finite (resp.infinite) string occurring in G (or simply, a string of G) is any string S ∈ Σ * (resp.S ∈ Σ ω ) such that there is a finite (resp.infinite) walk p = (p 0 , p 1 , p 2 , . . . ) in G with S = L(p) := L(p 0 )L(p 1 )L(p 2 ) . . . .We say that p is an occurrence of S in G, that p spells S in G or that S has a match in G.A string S occurring in G is repeated if there are at least two distinct occurrences of S in G, in symbols ∃p, q walks in G such that p = q ∧ L(p) = L(q) = S. Throughout the paper, we will assume that every vertex has at least one in-neighbor or out-neighbor, so it holds that

|V | ≤ 2|E|, |V | ∈ O(|E|) and we can simplify a complexity bound such as O(|V | + |E|) into O(|E|).
Remark 1.In solving LCSP, MSP and LRSP we can assume that for any input labeled graph G = (V, E, L) it holds that |V |∈ O(|E|), because: • the problems become trivial when considering only paths of length 0, in the sense that there is a common or repeated string of length one if and only if there are different vertices labeled with the same character in the respective graphs; if in G there are vertices without both incoming and outgoing edges, they can be treated separately since the strings they generate have all length 1, thus we will assume throughout the rest of the paper that every vertex v ∈ V has at least one incoming or outgoing edge, meaning that |E| ≥ |V |/2. •

The Labeled Direct Product
Recall that the direct product of two graphs is the graph whose vertex set is the Cartesian product of the vertex sets of the initial graphs where we have an edge between two vertices if there are corresponding edges in the initial graphs between vertices on the first component and between vertices on the second component.This product has been studied in the literature in both the undirected and directed setting, under the names conjunction, tensor product, Kronecker product, and others (see [19, p. 21] and [20]).We will use instead the labeled direct product of G 1 and G 2 , obtained as the subgraph of the direct product of G 1 and G 2 induced by the vertices for which their two components have the same label.Although this notion is similar to the automaton recognizing the intersection of two automata (see [27]), the key difference is that the labeled direct product graph does not contain any pair of edges/transitions with mismatching labels.

Definition and basic properties
Consider the following definition, and see also Figures 2 and 3.
, where: and Given a vertex q = (u, v) ∈ V , let π 1 (q) := u and π 2 (q) := v. Given a walk q = (p 0 , p 0 ), (p 1 , p 1 ), . . . in G 1 ⊗ G 2 , we denote with π 1 (q) and π 2 (q) the walks (p 0 , p 1 , . . . ) and (p 0 , p 1 , . . . ) in G 1 and G 2 , respectively.We state the following basic fact about the correspondence between the pairs of walks in G 1 and G 2 and the walks in G 1 ⊗ G 2 .In particular, this implies that the projections of any cycle in G 1 ⊗ G 2 are two cycles in G 1 and G 2 reading the same string and vice versa.
Since all the algorithms we develop consist in analyzing the labeled direct product of the input graphs, we must take great care in the time and space spent on its construction.Moreover, in Remark 5 we show that its size can be computed efficiently, making the run time of our algorithms predictable.
) time and space, because each pair of vertices and each pair of edges need to be considered at most once.Assuming Σ to be an integer or a constant-size alphabet we can do better than the naive construction algorithm with respect to time or space: • if Σ is an integer alphabet, by first sorting lexicographically the lists of edges of G 1 and G 2 , the product G 1 ⊗ G 2 can then be built in linear-time with respect to its size, by simply pairing all edges of G 1 and G 2 with matching labels; • if Σ has constant size, there is no need to store the edges of G 1 ⊗ G 2 , since for all a ∈ Σ we can report in time linear in the solution all a-labeled out-neighbors of any vertex (u, v) by pairing all a-labeled out-neighbors of u and v in G 1 and G 2 , respectively; • if Σ is an integer alphabet, preprocessing G 1 , G 2 in order to report the (number of) out-neighbors of any vertex (u, v) of G 1 ⊗ G 2 is equivalent to the SetIntersection problem, for which Goldstein et al. proved conditional lower bounds on the trade-off between the space and time used in its solution [15]; if we choose not to store at all the edges of G 1 ⊗ G 2 , the algorithms exploiting Remark 4. Since vertices and edges in G 1 ⊗ G 2 correspond to vertices and edges in G 1 and G 2 with matching labels, the size of in practice, or for some families of labeled graphs.In particular, if each a ∈ Σ is the label of at most O(1) pairs of vertices in this is not in contradiction with the conditional lower bounds of Section 4 because the graph obtained in the reduction of Theorem 5 uses only two labels, Θ(|V |) times each.
Remark 5.If Σ is an integer alphabet, the size of G 1 ⊗ G 2 can be computed in time linear in the size of the input graphs G 1 and G 2 .Indeed, let V a 1 , V a 2 be the sets of a-labeled vertices of G 1 , G 2 , respectively, and let E a,b i be the set edges of G i connecting an a-labeled vertex to a b-labeled vertex, with i = 1, 2. Then it is easy to see that and that |V | + |E | can be easily computed after sorting the vertex and edge sets of G 1 and G 2 .Note that if Σ has constant size, the size of G 1 ⊗ G 2 can be found in constant time after the independent sorting of G 1 and G 2 .

Optimal algorithms for DAGs
We first consider the case when the direct product graph is a DAG.Note that , is equal to one plus the length of a path of maximum length in G 1 ⊗ G 2 starting from any vertex in {v} × V 2 ; a longest repeated string of a DAG G is spelled by a path of maximum length of G ⊗ G visiting at least one vertex (u, v) such that u = v.
Indeed, for every vertex (u, v) of the product graph (V , E , L ) we can compute by dynamic programming the length + (u, v) of the longest path starting at (u, v): has maximum value and by retrieving the string corresponding to a path of length + (u, v) starting at (u, v); • MSP is solved by finding for each v ∈ V 1 the maximum value of + (v, w) + 1, with (v, w) a vertex of G 1 ⊗ G 2 , and this can be done by iterating once over all vertices in V .
We can analogously compute for each (u, v) ∈ V the length − (u, v) of the longest path in (V , E , L ) ending at (u, v): • LRSP is solved by iterating over all vertices (u, v) of G ⊗ G such that u = v, and obtaining the length of the longest repeated string in G whose occurrences pass through the distinct vertices u and v, as + (u, v) + − (u, v) + 1 (a longest repeated string of this length can then be retrieved).
) words in space.For all three problems plus SMLG, if the product graph is given, then the solution takes linear time in the size of the product graph.

Optimal Algorithms for General Graphs
Since LCSP, MSP and LRSP defined on paths are NP-complete (see Section 4), we focus here on the three problems defined on walks.If we deal with graphs containing cycles, then the length of the walks and strings to consider is not bounded anymore so we modify the three problems to require the detection of the relative cases.In fact, we will show that in all cases we can also report a linear-size representation of the corresponding common or repeated strings.As we stated in the introduction, the three problems admit worst-case quadratic-time solutions based on the labeled direct product graph, i.e.G 1 ⊗ G 2 for LCSP and MSP and G ⊗ G for LRSP.• V cyc as the set of all vertices of the product graph involved in a cycle, namely those belonging to an SCC consisting of at least two vertices; also, for G ⊗ G = (V , E , L ) we define: • V ndet as the set of all vertices (v, v) ∈ V with v a non-deterministic vertex of G, that is, v has two out-neighbors labeled with the same character.

LCSP and MSP
Since the graphs can contain cycles, the common strings in LCSP and MSP can now have infinite length.
The algorithm solving LCSP on any two Σ-labeled graphs G 1 , G 2 consists of the following simple checks in Infinite length common strings Check if G 1 ⊗ G 2 contains a cycle; if so, return (i) the string spelled by any cycle and (ii) the symbol ω; otherwise Finite length common strings Proceed as in the algorithm for the DAG case from Section 2.2 on The correctness of this algorithm follows from the fact that there is a common string of infinite length if and only if there is a common string of infinite length of the form S ω (see also Lemma 1 below).
MSP can be solved as well by studying the Strongly Connected Components (SCCs) of Infinite length matching statistics For all (u, v) ∈ V cyc set MS(u) = ∞.

LRSP
In LRSP on general graphs we have one of the following three cases, as seen in Figure 4: 1.The graph has an infinite repeated string.
2. The graph does not have any infinite repeated string, but the length of the repeated strings is unbounded.
3. The length of the longest repeated string is bounded and there are repeated strings of a finite maximum length (as is the case for texts, trees and DAGs).
An undesirable feature of infinite strings is that they can be aperiodic.However, analogously to some results of Büchi automata theory stating that the "important" strings are ultimately periodic [32, p. 137], that is they are of the form RS ω with R ∈ Σ * and S ∈ Σ + , in Lemma 1 we show that the presence of infinite repeated strings can be detected by looking for ultimately periodic strings.Its easy proof, which we omit, finds a cycle in G ⊗ G used by two distinct occurrences of an infinite repeated string w (since G is finite): we can build RS ω by identifying R as the prefix of w spelled by the path reaching the cycle and S as the string spelled by the cycle.Lemma 1.Given a Σ-labeled graph G = (V, E, L), there is an infinite repeated string occurring in G if and only if there is a string RS ω ∈ Σ ω in G spelled by two distinct walks rs ω and r s ω in G, with The two distinct walks spelling RS ω provided by Lemma 1 imply the existence of a walk in G ⊗ G passing trough a vertex q in V diff and reaching a vertex q in V cyc .Note that q = q can hold, in which case the infinite repeated string is of the form S ω .Thus, we obtain: Corollary 1 (Infinite repeated strings).G has an infinite repeated string if and only if any q ∈ V diff reaches any q ∈ V cyc , and if R is the spelling of a path from q to q and S is the spelling of a cycle starting and ending in q , then RS ω is an infinite repeated string in G.
If the graph has no infinite repeated string, the remaining difficulty is that of repeated strings of unbounded length.Formally, we say that G has repeated strings of unbounded length if for each n ∈ N there is a repeated string S ∈ Σ * occurring in G such that |S| > n.It is easy to see that in the graph of Figure 4 (right) there are no infinite repeated strings and the unbounded repeated strings are of the form R + S, with R, S ∈ Σ + .Indeed, these unbounded strings are of this form and their occurrences have a common prefix after which they diverge.This divergence happens by visiting a non-deterministic vertex, as shown by the next two results.Lemma 2. Given a Σ-labeled graph G = (V, E, L) without infinite repeated strings, G has repeated strings of unbounded length if and only if there are R, S ∈ Σ + such that R m S is repeated in G for each m ≥ 1. , E , L ).If G has repeated strings of unbounded length, then there must be some k ∈ N such that there is a repeated string T ∈ Σ * of length k > |V |, with p = (p 0 , p 1 , . . ., p k−1 ) and p = (p 0 , p 1 , . . ., p k−1 ) two distinct occurrences of T in G. Then q := (q 0 , q 1 , . . ., q k−1 ) := p⊗p is a walk in G⊗G visiting more than |V | vertices, so by the pigeonhole principle there must be a vertex visited more than once: let j, j ∈ N be two indices such that 0 ≤ j < j ≤ k − 1 and q j = q j .Since p and p are distinct walks in G, there must be also an index i ∈ N such that 0 ≤ i ≤ k − 1 and p i = p i .Index i can be in three different positions relative to j and j :

Proof. (⇐) This side is trivial. (⇒) Let
1. if i < j < j , then (q i , . . ., q j−1 )(q j , . . ., q j −1 ) ω is an infinite and ultimately periodic walk in G ⊗ G and its projections are occurrences of an ultimately periodic, infinite and repeated string in G, since π 1 (q i ) = p i = p i = π 2 (q i ), contradicting our hypothesis; 2. if j ≤ i ≤ j , then (q j , . . ., q j −1 ) ω is a periodic walk in G ⊗ G and its projections are occurrences of a periodic, infinite and repeated string in G, a contradiction; 3. if j < j < i, then (q j , . . ., q j −1 )(q j , . . ., q i ) is a walk in G ⊗ G with a cyclic prefix that can be pumped, so (q j , . . ., q j −1 ) m (q j , . . ., q i ) is a walk in G ⊗ G for each m ≥ 1; the projections in G of these strings are occurrences of repeated strings of the form R m S, with R = L (q j , . . ., q j −1 ) and S = L (q j , . . ., q i ) , since π 1 (q i ) = p i = p i = π 2 (q i ).
Point 3. of the above proof shows that the index where the projections of the distinct walks differ must occur after every cycle of the walk considered in G⊗G, proving that any of these walks has a proper prefix of vertices of the form (u, u) containing a cycle and this prefix ends in a vertex (v, v) ∈ V ndet .Note that (u, u) = (v, v) can hold.We obtain the following result.

Corollary 2 (Unbounded repeated strings). If G has no infinite repeated string, then G has repeated strings of unbounded length if and only if any
with a path, and if R is the spelling of a cycle starting and ending in (u, u) and S is the spelling of a path starting from (u, u) and ending with (v, v) and with an out-neighbor of (v, v) with a sibling having the same label, then R m S is a repeated string for each m ≥ 1.
We solve LRSP on the general graph G by combining Corollaries 1 and 2 and Theorem 1: Infinite length repeats Check if any q ∈ V diff reaches any q ∈ V cyc even with an empty path.If so, return (i) the string spelled by the path from q to q , (ii) the string spelled by any cycle starting from q and (iii) the symbol ω.
Unbounded length repeats Check if any (u, u) ∈ V cyc reaches any (v, v) ∈ V ndet even with an empty path.If so, return (i) the string spelled by any cycle starting (and ending) at (u, u), (ii) the symbol + and (iii) the string spelled by the path from (u, u) to an out-neighbor of (v, v) with a sibling having the same label (since (v, v) ∈ V ndet ).

Finite length repeats
Remove from G ⊗ G all vertices in V cyc (obtaining a DAG), and proceed as in the algorithm for the DAG case from Section 2.2 on this graph.Suppose now that both of these checks return false.First, since the first check failed, V diff ∩ V cyc = ∅, because any vertex in V diff ∩ V cyc reaches itself with an empty path.Second, from any q ∈ V cyc no vertex q ∈ V diff is reachable (with a non-empty path, since V diff ∩ V cyc = ∅).Indeed, suppose for a contradiction that q ∈ V diff is a vertex reached from q with a shortest (non-empty) path P , and let q * ∈ V be the vertex on this path right before q .Since P is shortest, then q * / ∈ V diff , and thus q * ∈ V ndet .However, this contradicts the assumption that the second check of the algorithm returned false.
Finally, since the two occurrences of a repeated string must pass through a vertex in V diff , and no vertex in V diff is reached, or reaches a vertex in V cyc , then we can remove all vertices in V cyc from G ⊗ G, obtaining a DAG.In this DAG, as in Section 2.2, we look for the longest path passing through a vertex in V diff .
The SCCs of G ⊗ G and the sets V cyc , V diff , V ndet can be computed in linear time in the size G ⊗ G. Reachability between two sets of vertices of G ⊗ G (and a corresponding path) can also be implemented in linear time in the size of G ⊗ G.The algorithm for the final DAG case runs in linear time in the size of G ⊗ G, by Theorem 1.

Hardness
The NP-hardness of the SMLG problem defined on path occurrences (implying the NP-hardness of LCSP and of MSP, also defined on path occurrences) was already observed in previous works such as [9].We similarly observe that the same holds also for LRSP.
Observation 1. LRSP defined on paths is NP-hard, even if we restrict alphabet Σ to contain just a single character.This follows by reducing from the Hamiltonian Path problem on directed graphs.Given a graph G, create a graph G made up of two copies of G and label all vertices with the same character.It easily holds that G has a Hamiltonian path if and only if the length of the longest repeated string in G equals the number of vertices of G.
As noted in the introduction, the quadratic lower bounds of [9,14] for SMLG imply the same quadratic lower bounds for LCSP and MSP, namely that the two problems cannot be solved in truly sub-quadratic time under the Orthogonal Vectors Hypothesis (that we discuss below in this section) and that the shaving from the quadratic-time complexity of arbitrarily high or high enough logarithmic factors would contradict other hardness conjectures.We now show a linear-time reduction from SMLG to LRSP on a a . . .deterministic DAGs, which thus implies the same lower bounds for LRSP as in [9,14] (since they hold also for deterministic DAGs).• the reduction does not hold if G contains cycles or if G is a DAG with some non-deterministic vertices, because there could be infinite repeated strings in G or the repeated strings of G could be extended by gadget H 1 ; • to the best of our knowledge, such a reduction does not exist when the problems are defined on strings.
Nevertheless, this reduction creates an instance of LRSP with vertices of arbitrarily high in-degree and out-degree, and also increases the alphabet size by the number of edges of the graph.Therefore, in the rest of this section we give a direct reduction from the Orthogonal Vectors Problem (OVP) to LRSP, which will allow for both constant in-and out-degree, and binary alphabet.
In OVP we are given two sets of binary vectors A, B ⊆ {0, 1} d , with |A| = |B| = n and d = ω(log n), and we need to determine whether there exist a ∈ A, b ∈ B so that a • b = 0, where a The Orthogonal Vectors Hypothesis (OVH) states that no (randomized) algorithm can solve OVP on instances of size n in O(n 2−ε poly(d)) time for constant ε > 0 [34].Given an instance A, B for OVP, we will construct a DAG G such that A and B contain a pair of orthogonal vectors if and only if the length of the longest repeated string in G is of a certain value, to be introduced at the end of the reduction.
To start with, we use the alphabet Σ = {0, 1, c}, where c is used to simplify the proofs.At the end, we will observe that all c-labeled vertices can be relabeled with 0. We start by building two types of gadgets: ) it just has a 0-labeled vertex.All vertices in each level have edges going to all vertices in the next level, and there are no edges between vertices of the same level (Figure 6, right).This is the same type of gadget used also in [9].
To build up the intuition, take a ∈ A and b ∈ B and observe, similarly as in [9], that the string spelled by G a has an occurrence in G b if and only if a and b are orthogonal.Thus, the graph made up of a copy of G a and a copy of G b has a longest repeated string of length d + 1 if and only if a and b are orthogonal.However, we cannot put together all gadgets G a and G b as separate components of the same graph, because such a simple construction cannot restrict the location of occurrences of the longest repeated string.Intuitively, we need the longest repeated string to have one occurrence in the part of the graph corresponding to the G a gadgets, and one occurrence in the part of the graph corresponding to the G b gadgets.We achieve this by (i) building a tree structure on top of the G b gadgets that assigns to each gadget its own unique prefix; and by (ii) building a "universal" structure on top of gadgets G a to make them reachable by reading any possible prefix added to gadgets G b .More specifically, we introduce the following two gadgets with log 2 n + 1 = k + 1 levels: • gadget T , seen in Figure 7 (left), is a complete binary tree of height k + 1 and with 2 k ≥ n leaves, in which the root is c-labeled, all left children are 0-labeled and all right children are 1-labeled; trivially, any root-to-leaf path in such a tree has a different label; • a "universal" DAG U with a c-labeled source followed by k levels of vertices where each level has two vertices, labeled with 0 and 1, and each vertex in a level is connected to the vertices of the next level, as can be seen in Figure 7 (right).
Our gadgets can be arranged in a non-deterministic DAG as seen in Figure 8 (left): the two sinks of gadget U are connected to each source of gadgets G a , with a ∈ A, and each leaf of the tree gadget T is connected to the source of a different gadget G b , with b ∈ B; if n is not a power of two, some leaves of gadget T can be left without any out-neighbors.To have a deterministic DAG, we can further merge all gadgets G a in a keyword tree (trie) K Ga 1 ,...,Ga n (see Figure 8, right), so the set of strings spelled by the entire graph is unchanged (and the leaves of the keyword tree remain all distinct since all vectors in A are distinct).Note that the longest path in this graph has length k + d + 1, with k = log 2 n .Lemma 3.For an instance A and B for OVP, the deterministic DAG G built as in Figure 8  a1 [2] . . .

a1[d]
c an [1] an [2] . . .To make the alphabet binary, it is easy to see that it suffices to relabel all c-labeled vertices with 0. This proves that, under OVH, there can be no truly sub-quadratic time algorithm.Theorem 5.If OVH holds, then for no ε > 0 there is a O |V | 2−ε -time or O |E| 2−ε -time algorithm for LRSP, even when restricted to deterministic DAGs, labeled with a binary alphabet, in which both the maximum in-degree and out-degree of any vertex are at most 2.

U T U
Proof.It is easy to check that the maximum in-degree, and out-degree of any vertex of the graph in the reduction of Figure 8 is at most 2. Lemma 3 guarantees the correctness of the reduction, so it remains to analyze its complexity.The resulting graph G has O(nd) vertices and O(nd) edges and can be constructed in O(nd) time, since the keyword tree can be constructed in time linear in the size of its inputs.Thus, if LRSP has an O |V |2−ε -time or an O |E| 2−ε -time algorithm for some ε > 0, OVP has an O (nd) 2−ε -time algorithm, contradicting OVH.
Our OVH reduction for LRSP immediately proves an OVH reduction also for LCSP (by taking the two components of G as input graphs G 1 and G 2 for LCSP).
It also provides a quadratic lower bound for an apparently simpler version of MSP, which on two paths (i.e.strings) can be solved in a trivial manner.Let MSP * be defined as MSP, with the difference that we are given a single vertex v 1 of G 1 , a single vertex v 2 of G 2 , and we need to compute the length of the longest string having an occurrence in G 1 starting at v 1 and an occurrence in G 2 starting at v 2 .To obtain the OVH reduction, it can be easily checked that it suffices to take as G 1 the subgraph of G built from B (Figure 8, middle) with v 1 being its source vertex, and as G 2 the graph build from A (Figure 8, right) with v 2 being its source vertex.
even when both input graphs are deterministic DAGs, labeled with a binary alphabet, in which the maximum in-degree and out-degree of any vertex is at most 2.
Even though the above result about MSP * holds also for deterministic DAGs, its hardness stems from the fact that we do not know which path in G 1 to match with a path in G 2 in order to maximise their length.However, if G 2 is just a path, then the problem is solvable in linear time.

LRSP on Undirected Graphs
In this section we study LRSP on undirected graphs, namely when the occurrences of a repeated string are allowed to use an undirected edge in any of its two directions.Even if these results are easier than the results on directed graphs, they complete the complexity picture of LRSP.We will study the variants of LRSP on undirected paths and undirected walks, and consider the same classes of undirected graphs: paths, trees 2 and general graphs3 .Theorem 6. LRSP on undirected graphs and defined on path occurrences can be solved as follows: 1. On an undirected graph G that is a path, LRSP can be solved in linear time.
2. On an undirected graph G that is a tree, LRSP can be solved in quadratic time.

On general undirected graphs, LRSP is NP-complete, since the same reduction as in Observation 1
works also for undirected graphs.
Proof.For 1., note that the occurrences of a repeated string are obtained by either moving only forward or only backward (because LRSP is defined on paths, and thus occurrences cannot repeat vertices).Thus, if T is the spelling of G from one end to another, we can reduce LRSP on G to finding the longest repeated substring of the text T $T −1 , where $ is a new separator character, and T −1 is T reversed.Thus, we can solve LRSP on G in linear time.For 2., we show that LRSP on an undirected tree with n vertices, defined with path occurrences, can be reduced to LRSP on a directed tree with O(n 2 ) vertices (where there is no distinction between path and walk occurrences).Indeed, let v 1 , . . ., v n be the vertices of a Σ-labeled undirected tree T .For each i ∈ {1, . . ., n}, construct the directed tree T vi by setting v i as root and orienting all edges away from v i .Also, let (u 1 , . . ., u n ) be a directed path of n new vertices.Construct the directed tree T , rooted at u 1 , by combining the path (u 1 , . . ., u n ) with trees T v1 , . . ., T vn , adding the directed edges (u n , v i ), for all i ∈ {1, . . ., n}.The vertices of each T vi are labeled as in T , and u 1 , . . ., u n are labeled with a new character $ / ∈ Σ.Clearly, the number of vertices of T is n 2 + n.We claim that T has a longest repeated string of length if and only if T has a longest repeated string of length n + , spelled by a path starting with (u 1 , . . ., u n ).This is proved by combining the following remarks: • All occurrences of any repeated string of T longer than n must start in the same vertex u i , with i ∈ {1, . . ., n}, since $ / ∈ Σ.Moreover, if a repeated string of maximum length in T has length greater than n then its occurrences start at u 1 .
• Consider two distinct occurrences of a longest repeated string in T of length n + , with ≥ 1, both starting from u 1 .By construction of T , their suffixes of length correspond to two distinct occurrences of a string of length in T .Vice versa, given two distinct occurrences of a repeated string w in T , there are two corresponding distinct occurrences of $ n w in T of length n + , both starting from u 1 .
Thus, the longest repeated strings in T correspond to the longest repeated strings in T and vice versa (if there are no repeated strings in T then the repeated strings of T have length lesser than n), so we can apply the linear-time solutions based on the suffix tree of a tree or the XBWT to obtain a globally quadratic-time algorithm.For 3., observe that the same reduction as in Observation 1 works also for undirected graphs (since the Hamiltonian path problem is NP-hard also on undirected graphs).
For LRSP defined on walk occurrences, observe first that we can replace each undirected edge with a pair of edges oriented in opposite directions.Thus, we can solve the problem in quadratic time, using the algorithm from Section 3 for directed graphs.However, we show that LRSP can be solved in linear time on general undirected graphs, using the following lemma, greatly simplifying the problem.Lemma 4. Given a Σ-labeled undirected graph G = (V, E, L), G has a repeated string of length at least 2 if and only if G has an infinite repeated string.
Proof.Let p = (p 0 , p 1 , . . ., p k−1 ) and p = (p 0 , p 1 , . . ., p k−1 ) be distinct occurrences of a string, such that p i = p i for some 0 ≤ i ≤ k − 1.We can build an infinite repeated string with just two pair of adjacent vertices visited by p and p : • if i = 0 then (p 0 , p 1 ) ω , (p 0 , p 1 ) ω are distinct occurrences of (L(p 0 )L(p 1 )) ω ; • if i > 0 then (p i−1 , p i ) ω , (p i−1 , p i ) ω are distinct occurrences of (L(p i−1 )L(p i )) ω .Lemma 4 implies we can just check if there is a repeated string of length 2 (in which case there is an infinite repeated string) and, if not, of length 1.These two checks can be done in linear time, provided that the vertex set and the edge set of G are already ordered in lexicographical ordering: there is a repeated string of length 2 if and only if there are two distinct edges with matching edges, and there is a repeated string of length 1 if and only if there are two distinct vertices with the same label.

Conclusions and Future Work
In this paper we introduced the labeled direct product graph as a straightforward algorithmic tool, since it naturally encodes all pairs of walks in the original graphs having matching labels.Through simple applications, we developed optimal and predictable algorithms for existing problems on labeled graphsstring matching in labeled graphs (SMLG) and longest common substring (LCSP)-and for extensions of string problems that we introduced-matching statistics (MSP) and longest repeated string (LRSP).For SMLG and LCSP this resulted in more efficient algorithms than the existing quadratic-time ones, since the product graph excludes all pairs of mismatching vertices and edges.
Regarding complexity, we extended the existing conditional quadratic-time lower bounds for SMLG of [9,14] to LRSP with a linear-time reduction, if the input graph of SMLG is a deterministic DAG.Since the SMLG lower bounds trivially hold for LCSP and MSP, this means that our algorithms (and the existing one for LCSP) are conditionally optimal.Moreover, we designed a single, more efficient reduction from the Orthogonal Vectors Problem (OVP) to LCSP, MSP and LRSP proving that the three problems cannot be solved in truly sub-quadratic time under the Orthogonal Vectors Hypothesis (OVH), even if the graphs in input are acyclic, deterministic (i.e.every vertex has at most one a-labeled out-neighbor, for every a ∈ Σ), labeled from a binary alphabet, and such that the maximum in-degree and out-degree of any vertex are at most 2.An interesting aspect of these results is that there is no known reduction of SMLG to LRSP when the problems are defined on strings and that the OVP reduction holds also for the modification of MSP trying to match the walks starting from just two vertices, even when the graphs have the same restrictions as before.
Our algorithms are based on linear-time analyses of the labeled direct product graph corresponding to each problem, so we spent some effort in studying its construction.Indeed, if the sets of vertices and edges of the graphs are sorted following the lexicographical order, then the construction of the product graph takes time and space linear in its size, thus under the standard assumption to work with an integer alphabet our algorithms globally reach this time and space complexity.This also means that the size of the labeled direct product graph is a tighter complexity upper bound for SMLG, LCSP, MSP and LRSP.Plus, the size of the product graph can be precomputed in time linear in the size of the input graphs, making it possible to report the run time of our algorithms before their computation.If the alphabet has constant size, there is no need to store the edges of the product graph, whereas if the alphabet is an integer one then the choice not to store the edges of the graph is a version of the SetIntersection problem, leading to a space and time trade-off.
Finally, we presented a complete complexity picture of LCSP and LRSP on different classes of directed graphs and we did the same for LRSP on undirected graphs.The only open case is LRSP defined on path occurrences (i.e. when there are no repeated vertices) in undirected trees, for which we obtained only a quadratic-time algorithm in Section 5, with no matching lower bound.Since the number of different paths of an undirected tree is only quadratic, we believe this problem cannot encode an OVP instance.Thus, we pose the open problem of finding a linear-time algorithm for this variant.
Recall that in the introduction we encoded a labeled graph as an NFA and we argued that SMLG, LCSP and LRSP (where we focus on finite strings) are special cases of similarly defined problems for finite-state automata (over finite words).The quadratic-time conditional lower bounds automatically carry over to these problems, and as the classical quadratic-size construction of an NFA recognizing each and every word accepted by two input NFAs solves LCSP, we deem that there is a quadratic-size NFA encoding all ambiguous words of any input NFA thus solving LRSP, and we leave this for future work.We also leave as future work to find more problems on labeled graphs solved by the labeled direct product graph, or that can be tackled with the same general strategy of precomputing a data structure to globally obtain time savings during the actual computation.

Figure 1 :
Figure 1: An example of an {a, b, c}-labeled graph made up of two components: the longest common string between the two components is ab, and the longest repeated string of the whole graph is ab as well, since all longer strings spelled by some walk have exactly one occurrence.Taking the right component as G 1 and the left component as G 2 , we have that MS(6) = 1, MS(7) = 2 and MS(8) = 1.
the algorithm works by constructing a directed acyclic graph (DAG) G having vertex v i for each vertex v ∈ V and for each position i ∈ {1, . . ., m}, such that there is an edge between two vertices v i ,w i+1 if L(v) = S[i] and (v, w) ∈ E:SMLG is then solved by finding and reporting a path of length |S| − 1 in G .Instead, by treating the pattern S as a labeled path G S , we can solve SMLG by simply finding a path of length |S| − 1 in G ⊗ G S (Section 2).Since G S is a path, then G ⊗ G S is a DAG, and thus such a path can be found in time linear in the size of G ⊗ G S .Our labeled product is a subgraph of the DAG G by Amir et al.: G considers mismatching vertices v i such that L(v) = S[i] and it avoids computing the edges from mismatching vertices but not the (remaining) edges to mismatching vertices.Thus, their algorithm always takes time Ω(|V | • |S|), even when G ⊗ G S has smaller size, and takes time Θ(|E| • |S|) for some families of inputs where G ⊗ G S has smaller size.Moreover, LCSP on DAGs

Figure 2 :Figure 3 :
Figure 2: An example of two {a, b, c}-labeled graphs G 1 , G 2 and their labeled direct product graph G 1 ⊗ G 2 on the right.Since G 2 is a DAG, G 1 ⊗ G 2 is a DAG as well.a

Figure 4 :Definition 2 .
Figure 4: An example (left) of a non-deterministic graph G with two distinct cycles (1, 2, 3, 4, 5) and(1 , 2, 3, 4, 5); their (uncountably many) infinite repetition generates the same infinite string; and a nondeterministic graph (right) with no infinite repeated strings but with finite repeated strings of unbounded length, precisely of the form (ba) k c, (ab) k ac and c k c, for every k ≥ 0.

Figure 5 :
Figure 5: Scheme for the reduction of SMLG to LRSP: H 1 and H 2 are two copies of the same gadget made of n + 1 levels and the edges of G are not shown.Note that the reduction holds only if G is a deterministic DAG.

Theorem 4 .
Given a string P ⊆ Σ * and a Σ-labeled deterministic DAG G = (V, E, L), there exists a Σ -labeled DAG G , with Σ = Σ ∪ V , having O(|V | + |E| + |P |) vertices and edges, computable in linear time in the size of P and G, and such that P has an occurrence in G if and only if the longest repeated string of G has length |V | + |P | + 1.Proof.Given a deterministic graph G = (V, E, L) and a patternP = P [1] • • • P [m] labeled on alphabet Σ, with V = {v 1 , . . ., v n }and m, n > 0, the reduction consists of transforming pattern P into a labeled graph G P , that is a path of m vertices spelling P , and building a Σ -labeled graph G with Σ = Σ ∪ V , assuming V ∩ Σ = ∅.Graph G , as seen in Figure5, contains G, G P , and two copies H 1 , H 2 of a simple gadget H appropriately connected to them.Gadget H is made of a path of n vertices spelling a n , with a ∈ Σ chosen arbitrarily, ending in a level of n vertices each labeled with a different vertex of V .In H 1 each of these final vertices is connected to the respective vertex of G and in H 2 they are all connected to the source of G P .The resulting graph G is a deterministic DAG made of two connected components each having exactly one source, and it is easy to see that the longest repeated string of G has length at most |V | + |P | + 1.Each repeated string of this maximum length has one occurrence per component, starting at the respective source.If P has an occurrence (u 1 , . . ., u m ) in G, then a n u 1 P is a repeated string in G of maximum length.Conversely, every repeated string in G of maximum length |V | + |P | + 1 is of the form a n u i P , with u i ∈ V , and its occurrence in the first component has as its suffix an occurrence of P in G.Graph G has O(|V | + |E| + |P |) vertices and edges and its construction is straightforward, so the reduction takes linear-time in the size of the starting SMLG instance.Two interesting aspects of Theorem 4 are as follows:
a path consisting of a starting c-labeled vertex followed by d vertices, where the i-th vertex is labeled with a[i] (Figure 6, left); • for each b = (b[1], . . ., b[d]) ∈ B, graph G b is a DAG with d + 1 levels such that: (i) the zeroth level consists of a single c-labeled source vertex; (ii) the i-th level has both a 0-labeled vertex and a 1-labeled vertex if b

Figure 7 :
Figure 7: Gadget T (left), a complete binary tree with k + 1 levels, and gadget U (right), reading every possible string of length k + 1 that can be read in T .

Figure 8 :
Figure 8: First scheme (left) for the OV reduction, made of two subgraphs G A (left) and G B (right):G A is non-deterministic; second scheme (right) for the OV reduction: K Ga 1 ,...,Ga n is the keyword tree (trie) of gadgets G a , a ∈ A.

Table 1 :
Summary, for some variants of LCSP defined on walk occurrences on two graphs G 1 the answer to LCSP and LRSP is the empty string ε if and only if the sets of labels used in G 1 and G 2 do not intersect, or if each vertex of G is labeled with a different character (implying |Σ| ≥ |V |); this can be easily checked assuming we are working with an integer alphabet, if not it is still O(|V | log|V |), so we will assume that there is a common or repeated string of length 1, unless stated otherwise.
be a DAG even if both G 1 and G 2 contain cycles.Thanks to Remark 2, SMLG, LCSP, MSP and LRSP can be solved by finding paths of maximum length in the corresponding direct product graph: an occurrence of pattern S in graph G corresponds to a path of length |S| − 1 in G ⊗ G S , where G S is a labeled path of |S| vertices spelling S; a longest common string of G 1 and G 2 is spelled by a path of maximum length in G 1 ⊗ G 2 ; the matching statistics MS as in the algorithm for the DAG case from Section 2.2.Note that in this second step we consider an acyclic subgraph of G 1 ⊗ G 2 .The above algorithms correctly solve LCSP and MSP.Moreover, if G 1 ⊗ G 2 is given, they can be implemented to run in time linear in the size ofG 1 ⊗ G 2 .If not, they run in time O(|E 1 | • |E 2 |),where E 1 and E 2 are the edge sets of G 1 and G 2 , respectively.