1 Introduction

A recent theme in the study of combinatorics on words has been the generalization of regularity properties from substrings to subsequences. For example, given a string T over an ordered alphabet, the longest increasing subsequence problem is to find the longest subsequence of increasing symbols in T [11, 33]. Several variants of this problem have been proposed [14, 28]. These problems generalize to the task of finding such a subsequence that is not only present in one string, but common to two given strings [21, 31, 34], which can also be viewed as a specialization of the longest common subsequence problem [23, 27, 35].

More recently, the problem of computing the longest square word that is a subsequence [30], the longest palindrome that is a subsequence [9, 25], the lexicographically smallest absent subsequence [29], and longest rollercoasters [6, 16, 18] have been considered.

Here, we focus on subsequences that are Lyndon words, i.e., strings that are lexicographically smaller than all of their non-empty proper suffixes [32]. Lyndon words are objects of longstanding combinatorial interest (see, e.g., [19]), and they have also proved to be useful algorithmic tools in various contexts (see, e.g., [3]). The longest Lyndon substring of a string is the longest factor of the Lyndon factorization of the string [8], and it can be computed in linear time [13]. The longest Lyndon subsequence of a unary string is just one letter, which is also the only Lyndon subsequence of a unary string. A (naive) solution to find the longest Lyndon subsequence is to enumerate all distinct Lyndon subsequences and pick the longest one. However, the number of distinct Lyndon subsequences can be as large as \(2^n\), e.g., for a string of increasing numbers \(T = 1 \cdots n\). In fact, there are no bounds known (except when \(\sigma =1\)) that bring this number in a polynomial relation with the text length n and the alphabet size \(\sigma \) [22], and thus deriving the longest Lyndon subsequence from all distinct Lyndon subsequences can be infeasible. In this article, we focus on the algorithmic aspects of computing this longest Lyndon subsequence in polynomial time without the need to consider all Lyndon subsequences. Specifically, we study the problems of computing:

  1. 1.

    the lexicographically smallest (common) subsequence of each length (in Sect. 3), and

  2. 2.

    the longest Lyndon subsequence (in Sect. 4), with two variations considering online computation (in Sect. 4.3) and the restriction that this subsequence has to be common to two given strings (in Sect. 5).

The first problem serves as an appetizer. Although the notions of Lyndon and lexicographically smallest subsequences share common traits, our solutions to the two problems are mostly independent (except for some tools shared by the online algorithms for both problems).

Compared to an earlier conference version of this paper [4], we describe here an algorithm with significantly improved time complexity for the online setting. Additionally, we added more illustrations, examples, and the analysis of special cases with simpler algorithmic ideas to ease the understanding of the article. Last but not least (in Sect. 6), we evaluate the implementation of one of our proposed algorithms on commonly studied datasets.

2 Preliminaries

Let \(\Sigma \) denote a totally ordered set of symbols called the alphabet. An element of \(\Sigma ^*\) is called a string. The alphabet \(\Sigma \) induces the lexicographic order \(\prec \) on the set of strings \(\Sigma ^*\). We denote the empty string with \(\varepsilon \). Given a string \(S \in \Sigma ^*\), we denote its length with |S| and its i-th symbol with S[i] for \(i \in [1..|S|]\).Footnote 1 Further, for integers \(1\le i \le j \le |S|\), we write \(S[i..j] = S[i]\cdots S[j]\) to denote the substring of |S| starting at position i and ending at position j and \(S[i..] = S[i..|S|]\) to denote the suffix of S starting at position i. The empty string is a substring of every string S and can be referred to as \(S[j+1..j]\) for any \(j\in [0..|S|]\). For \(\ell \in [0..|S|]\), a length-\(\ell \) subsequence of a string S is a string \(S[i_1] \cdots S[i_\ell ]\) with \(i_1< \cdots < i_\ell \). For a string V, we denote \(\textsf{pos}_{S}(V)=\min \{ i \in [0..|S|]: V\text { is a subsequence of }S[1..i]\}\); in particular, \(\textsf{pos}_{S}(V)=0\) if \(V=\varepsilon \) and, following the convention that \(\min \emptyset = \infty \), we assume \(\textsf{pos}_{S}(V)=\infty \) if V is not a subsequence of S.

A non-empty string is a Lyndon word [32] if it is lexicographically smaller than all its non-empty proper suffixes. Equivalently, a string is a Lyndon word if and only if it is smaller than all its proper cyclic rotations.

The algorithms we present in the following assume that the input consists of strings of length at most n whose characters are drawn from an integer alphabet \(\Sigma := [1.. \sigma ]\) of size \(\sigma = \mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\).Footnote 2

3 Lexicographically Smallest Subsequence

As a starter, we propose a solution for the following related problem: Maintain, for each length \(\ell \), the lexicographically smallest length-\(\ell \) subsequence of T as the characters of T arrive online one at a time (in the left-to-right order).

3.1 Dynamic Programming Approach

The idea is to apply dynamic programming that computes, for all lengths \(0 \le \ell \le i \le n\), the lexicographically smallest length-\(\ell \) subsequence of T[1..i], denoted by \(D[i,\ell ]\). We observe that \(D[i,0]=\varepsilon \) is the empty word and \(D[i,i]=T[1..i]\). In the remaining cases, our algorithm considers \(D[i-1,\ell ]\) or \(D[i-1,\ell -1]\cdot T[i]\) as candidates for \(D[i,\ell ]\); see Algorithm 1 for a pseudocode and Fig. 1 for an example.

figure a
Fig. 1
figure 1

The lexicographically smallest subsequences of the prefixes of the example string \(T = \texttt {bccadbaccbcd}\). We only show the output for \(\ell \in [1..5]\) and denote the undefined values \(D[i,\ell ]\) (for \(i<\ell \)) with \(\bot \)

Fig. 2
figure 2

Sketch of the proof of Lemma 1. We can easily fill the fields shaded in blue (the 0-th row and the main diagonal). Further, the entries to the left of the diagonal are all undefined (denoted \(\bot \)). A cell to the right of it (red) is based on its left-preceding and diagonal-preceding cell (green) (Color figure online)

Lemma 1

For all \(0 \le \ell \le i \le n\), Algorithm 1 correctly computes \(D[i,\ell ]\), the lexicographically smallest subsequence of T[1..i] of length \(\ell \).

Proof

The proof is done by induction over the prefix length i. We first observe that \(D[i,0] = \varepsilon \) (the only length-0 subsequence of any string) and \(D[i,i] = T[1..i]\) (the only length-i subsequence of T[1..i]).

In what follows, we show that the claim also holds for \(D[i,\ell ]\) with \(0 \le \ell < i\) assuming that all entries \(D[i-1,\cdot ]\) have been computed correctly. Note that \(D[i-1,\ell ]\) and \(D[i-1,\ell -1]\cdot T[i]\) are both length-\(\ell \) subsequences of \(T[1..i]\). Hence, it suffices to prove that one of these two subsequences is the lexicographically smallest one. For a proof by contradiction, suppose that T[1..i] has a length-\(\ell \) subsequence L with \(L \prec D[i,\ell ]\).

If \(L[\ell ] \ne T[i]\), then L is a subsequence of \(T[1..i-1]\), and therefore \(D[i-1,\ell ] \preceq L\) according to the induction hypothesis. However, \(D[i,\ell ] \preceq D[i-1,\ell ]\); a contradiction.

If \(L[\ell ] = T[i]\), then \(L[1..\ell -1]\) is a subsequence of \(T[1..i-1]\), and therefore \(D[i-1,\ell -1] \preceq L[1..\ell -1]\) according to the induction hypothesis. However, \(D[i,\ell ] \preceq D[i-1,\ell -1]\cdot T[i] \preceq L[1..\ell -1]\cdot T[i] = L\); a contradiction. Hence, \(D[i,\ell ]\) is indeed the lexicographically smallest subsequence of T[1..i] of length \(\ell \). \(\square \)

Let us analyze the complexity of Algorithm 1. If we stored the subsequences explicitly, the entries of our two-dimensional table \(D[0..n,0..n]\) would occupy \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) space in total. However, in order to reduce the space consumption to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\), we just store a flag that determines whether we built D[ij] from \(D[i-1,\ell ]\) or \(D[i-1,\ell -1] \cdot T[i]\). To restore the string represented by D[ij], we backtrack with the help of the stored flags while reading \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) cells and characters. In this setting, the initialization of entries D[i, 0] and D[ii] costs \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) time. Line 5, where we compute the lexicographical minimum of two subsequences, is executed \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) times. If we perform this computation with naive character comparisons, for which we need to check \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) characters (which we first need to restore by reading \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) previous cells), we pay \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time in total, which is the bottleneck of this algorithm.

Lemma 2

We can compute the lexicographically smallest subsequence of T for each length \(\ell \) online in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time with \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) space.

Unfortunately, the lexicographically smallest subsequence of a given length is not a Lyndon word in general, so this dynamic programming approach does not solve our problem of finding the longest Lyndon subsequence. In fact, if T has a longest Lyndon subsequence of length \(\ell \), then there can be a lexicographically smaller subsequence of the same length. For instance, \(T = \texttt {aba}\) has the longest Lyndon subsequence ab, while the lexicographically smallest length-2 subsequence is aa.

3.2 Speeding Up String Comparisons

Below, we improve the time bound of Lemma 2 by maintaining the entries of the D[0..n, 0..n] table in a trie [15]. Mathematically, the trie of a string family is defined as a rooted tree whose nodes represent all the prefixes of the strings in the family. (Multiple strings may share the same prefix.) The root represents the empty prefix and, for every non-empty prefix P, the parent of the node representing P is the node representing \(P[1..|P|-1]\) and the edge to the parent is labeled by the character P[|P|]; see  Fig. 3 for an example. We develop a custom trie implementation which supports the following methods in constant time:

  • \({\textbf {insert}}(v,c)\): inserts a new leaf attached to a node v using an edge labeled with character c, and returns a handle to the created leaf; the node v cannot already have an outgoing edge labeled with c.

  • \({\textbf {parent}}(v)\): returns the handle to the parent of a node v (or \(\bot \) if v is the root).

  • \({\textbf {edge-label}}(v)\): returns the label of the incoming edge of a node v (or \(\bot \) if v is the root).

  • \({\textbf {precedes}}(u, v)\): decides whether the string represented by a node u is lexicographically smaller than the string represented by a node v.

Implementation of the trie For each node v, we explicitly store its parent \({\textbf {parent}}(v)\), label \({\textbf {edge-label}}(v)\), and depth \({\textbf {depth}}(v)\). We do not keep pointers from v to its children, and thus each node occupies constant space. Moreover, we maintain the underlying (unlabeled) tree using the dynamic data structures of [2, 10], answering lowest common ancestor (LCA) and level ancestor queries (\({\textbf {level-anc}}(u,d)\) returns the ancestor of a node u on depth d), respectively, in constant time. Both data structures support the insertion of leaves in constant time and, consequently, their space consumption is proportional to the tree size. In order to implement \({\textbf {precedes}}(u,v)\), we first compute the lowest common ancestor w of u and v. For the special case that u is an ancestor of v, or vice versa, we return false if \(v=w\) and true if \(u = w \ne v\). Otherwise, we use level ancestor queries \({\textbf {level-anc}}(u,{\textbf {depth}}(w)+1)\) and \({\textbf {level-anc}}(v,{\textbf {depth}}(w)+1)\), to select the children \(u'\) and \(v'\) of w on the paths towards u and v, respectively. In that case, we return true if \({\textbf {edge-label}}(u')\prec {\textbf {edge-label}}(v')\) and false otherwise; see  Fig. 3.

Fig. 3
figure 3

A trie described in Sect. 3.2 speeding up comparisons. The trie is a snapshot of the example computation shown in Fig. 1, where we are at \(i = 6\) and \(\ell = 4\). There, we want to decide whether \(D[5,4] = \texttt {bcad}\) or \(D[5,3]\cdot T[6] = \texttt {badb}\) is lexicographically smaller. For that, we take the LCA v of the nodes (highlighted by black circles) representing D[5, 4] and D[5, 3]; v is the child of the root with label b. Next, we compare the labels of v’s two children leading to D[5, 4] and D[5, 3], respectively (these two children are marked in yellow). It suffices to compare the labels of these children to determine that \(D[5,3]\cdot T[6]\) is lexicographically smaller than D[5, 4]

Application of the trie Instead of a flag, each cell of D now stores a handle to its respective trie node. The root node of the trie represents the empty string \(\varepsilon \), so we associate \(D[i,0] = \varepsilon \) with the root node for all i.

To implement Line 5 of Alg 1, we first retrieve the handles to nodes u and v representing \(D[{i-1},\ell -1]\) and \(D[i-1,\ell ]\), respectively. We proceed as follows for deciding whether \(D[i-1,\ell -1]\cdot T[i] \prec D[i-1,\ell ]\): If \(u={\textbf {parent}}(v)\), we only have to compare T[i] with \({\textbf {edge-label}}(v)\), which is the last character of \(D[i-1,\ell ]\). Otherwise, we know that \(D[i-1,\ell -1]\) is not a prefix of \(D[i-1,\ell ]\), and hence \(D[i-1,\ell -1]\cdot T[i]\prec D[i-1,\ell ]\) holds if and only if \(D[i-1,\ell -1]\prec D[i-1,\ell ]\), which we determine using \({\textbf {precedes}}(u,v)\); see Fig. 3 for an example. If \(D[i,\ell ]= D[i-1,\ell ]\), we store the handle to v at \(D[i,\ell ]\). Otherwise, we call \({\textbf {insert}}(u, T[i])\) and store the resulting handle at \(D[i,\ell ]\). This insertion is valid (meaning that u has no outgoing edge with label T[i] yet) because all trie nodes at depth \(\ell \) correspond to \(D[j,\ell ]\) for \(j\in [\ell ..i-1]\), and all these subsequences are at least as large as \(D[i-1,\ell ]\) in the lexicographic order.

As for Line 6, we retrieve the handle to node u representing \(D[i-1,i-1]\), call \({\textbf {insert}}(u,T[i])\), and store the resulting handle at D[ii]. This insertion is valid because the trie does not yet have any node at depth i.

Complexity Analysis The number of trie operations is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) (constantly many for each entry \(D[i,\ell ]\)), and each of them is implemented in constant time. Hence, the overall time and space complexities become \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\).

Theorem 3

We can compute the table D[0..n, 0..n] online in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) space.

3.3 Most Competitive Subsequence

If we want to find only the lexicographically smallest subsequence of the whole string T for a fixed length \(\ell \), this problem is also called Find the Most Competitive Subsequence.Footnote 3 It admits a folklore linear-time solution that scans T from left to right and maintains, in a stack \(\textsf{S}\), a subsequence of T[1..i] of length between \(\ell +i-n\) and \(\ell \) chosen to minimize \(\textsf{S}\cdot \$\) in the lexicographic order, where \(\$ \succ \max \Sigma \) is a sentinel character. Here, the lower bound \(\ell +i-n\) guarantees that, when we are near the end of the text, we have enough characters to extend \(\textsf{S}\) to a length-\(\ell \) subsequence of T[1..n]. Let \(\textsf {top}\) denote the top element of \(\textsf{S}\). When processing text position i, we recursively pop top as long as (a) \(\textsf{S}\) is not empty, (b) \(T[\textsf {top}] \succ T[i]\), and (c) \(|\textsf{S}| \ge \ell +i-n\). Finally, we push T[i] on top of \(\textsf{S}{}\) if \(|\textsf{S}{}| < \ell \). Since a text position gets inserted into \(\textsf{S}\) and removed from \(\textsf{S}\) at most once, the algorithm runs in linear time.

Observe that we can repeatedly use this solution to compute the lexicographically smallest subsequences of T of multiple lengths. The overall running time for all lengths \(\ell \in [1..n]\) is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) and the algorithm uses \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) working space, but it does not produce intermediate answers for the prefixes of T (as online algorithms do).

Given \(T = \texttt {cba}\) as an example, for \(\ell = 3\), we push all three characters of T onto \(\textsf{S}\) and output \(\texttt {cba}\). For \(\ell = 2\), we first push \(T[1] = \texttt {c}\) onto \(\textsf{S}\), but then pop it and push \(\texttt {b}\) onto \(\textsf{S}\). Finally, although \(T[3] \prec T[2]\), we do not discard \(T[2] = \texttt {b}\) stored on \(\textsf{S}\) since we need to produce a subsequence of length \(\ell = 2\). A more elaborate execution on our running example is given in Fig. 4.

Fig. 4
figure 4

Computing the most competitive subsequence of length \(\ell = 5\) of the example string \(T = \texttt {bccadbaccbcd}\). The stack is shown vertically below of T for each step of the algorithm of Sect. 3.3. For \(\ell = 6\), our stack would first differ at text position \(i= 10\), where we would discard only the topmost c (instead of both of them). Then, the stack would store the subsequences aacb for \(i=10\), aacbc for \(i=11\), and aacbcd for \(i=12\)

3.4 Lexicographically Smallest Common Subsequence

Another variation is to ask for the lexicographically smallest subsequence of each distinct length that is common with two strings X and Y. Luckily, our ideas of Sects. 3.1 and 3.2 can be straightforwardly translated. For that, our matrix D becomes a cube \(D_3[0..L,0..|X|,0..|Y|]\), where \(L:= \text {LCS}[|X|,|Y|]\) and \(\text {LCS}[x,y]\) denotes the length of a longest common subsequence of \(X[1..x]\) and \(Y[1..y]\). The entries \(D_3[\ell ,x,y]\) are well-defined for \(\ell \le \text {LCS}[x,y]\) and computed by taking the lexicographically smallest string among at most three candidates for \(\ell ,x,y\ge 1\):

$$\begin{aligned} D_3[\ell ,x,y] = \min {\left\{ \begin{array}{ll} D_3[\ell -1,x-1,y-1]\cdot X[x] &{} \text {if~} X[x] = Y[y], \\ D_3[\ell ,x-1,y] &{} \text {if } \ell \le \text {LCS}[x-1,y], \\ D_3[\ell ,x,y-1] &{} \text {if } \ell \le \text {LCS}[x,y-1]. \end{array}\right. } \end{aligned}$$

Moreover, \(D_3[0,x,y]=\varepsilon \) for all \(x\in [0.. |X|]\) and \(y\in [0.. |Y|]\), which gives us an induction basis similar to the one used in the proof of Lemma 1, so that we can use its induction step analogously. The table \(D_3\) has \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) cells, and filling each cell can be done in constant time by representing each cell as a handle to a node in the trie data structure proposed in Sect. 3.2. For that, we ensure that we never insert a subsequence of \(D_3\) into the trie twice. To see that, let \(L \in \Sigma ^+\) be a subsequence computed in \(D_3\), and let \(D_3[\ell ,x,y] = L\) be the entry at which we called \({\textbf {insert}}\) to create a trie node for L (for the first time). By monotonicity of \(D_3\) (that is, due to \(D_3[\ell ,x,y] = \min _{x' \in [0..x], y' \in [0..y]: \text {LCS}[x',y']\ge \ell } D_3[\ell ,x',y']\)) and since L is already a common subsequence of \(X[1..\textsf{pos}_{X}(L)]\) and \(Y[1..\textsf{pos}_{Y}(L)]\), we must have \(x=\textsf{pos}_{X}(L)\) and \(y=\textsf{pos}_{Y}(L)\). Moreover, the monotonicity of \(D_3\) further implies that all other entries \(D_3[\ell ,x',y']=L\) satisfy \(D_3[\ell ,x'-1,y']=L\) (if \(x'>x\)) or \(D_3[\ell ,x',y'-1]=L\) (if \(y'>y\)), so we copy the handle to the trie node representing L instead of calling insert when filling out \(D_3[\ell ,x',y']=L\).

Theorem 4

Given two strings XY of length at most n, we can compute the lexicographically smallest common subsequence for each length \(\ell \in [1..n]\) in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) space.

4 Computing the Longest Lyndon Subsequence

Fig. 5
figure 5

Longest Lyndon subsequences of selected prefixes of a text T. The i-th row of bars below T depicts the selection of characters forming a Lyndon subsequence. In particular, the i-th row corresponds to the longest Lyndon subsequence of T[1..9] for \(i=1\) (green), T[1..11] for \(i=2\) (blue), and of T[1..12] for \(i=3\) (red). The first row (green) also corresponds to a longest Lyndon subsequence of T[1..10] and T[1..11] (when extended with T[11]). Extending the second Lyndon subsequence (blue) with T[12] also gives a Lyndon subsequence, but shorter than the third Lyndon subsequence (red). Having only the information of the Lyndon subsequences in T[1..i] at hand seems not to give us a solution for \(T[1..i+1]\) (Color figure online)

In the following, we want to compute the longest Lyndon subsequence of T. See Fig. 5 for examples of longest Lyndon subsequences. As a starter, let us consider the following special case.

Theorem 5

Given a string of length n, in which each character only appears once, we can compute its longest Lyndon subsequence in (a) \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(1)\) space, or (b) \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sqrt{\lg n})\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space.

Proof

For each text position \(i \in [1..n]\), we consider all characters in T[i..n] that are at least as large as T[i]. These characters form the longest Lyndon subsequence starting at T[i]. Our answer is the longest among these n candidates. We can compute the length of each candidate in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) time, and thus obtain our first solution. For the second solution, we use the offline orthogonal range counting procedure of Chan and Pătraşcu [7, Corollary 2.3]. Specifically, we apply it for points (jT[j]) for \(j\in [1..n]\) and rectangles \([i+1..n]\times [T[i]+1..\sigma ]\) for \(i \in [1..n]\). This call takes \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sqrt{\lg n})\) time and outputs the number of input points located in each rectangle, which is \(|\{j \in [i+1..n]: T[j] > T[i] \}|\) for the i-th rectangle. \(\square \)

For the general case, compared to the dynamic programming approach for the lexicographically smallest subsequences introduced above, we follow the sketched solution for the most competitive subsequence using a stack, which here simulates a traversal of the trie \(\tau \) storing all pre-Lyndon subsequences of T, where a word is pre-Lyndon if it is a prefix of a Lyndon word. The trie \(\tau \) is a subgraph of the trie storing all subsequences of T, sharing the same root. This subgraph is connected since, by definition, if S is a pre-Lyndon word, then all prefixes of S are also pre-Lyndon (if S is a prefix of a Lyndon word V, then all prefixes of S are also prefixes of V). We say that the string label of a node v is the string read from the edges on the path from the root to v. For every node v of T, we store \(\textsf{pos}_{T}(V)\), where V is the string label of v. Observe that, unless v is the root, the label of the incoming edge, which is the last character of V, equals \(T[\textsf{pos}_{T}(V)]\).

4.1 Basic Trie Traversal

Problems already emerge when considering the construction of \(\tau \) since there are texts like \(T = 1 \cdots n\) for which \(\tau \) has \(\mathop {}\mathopen {}\Theta \mathopen {}(2^n)\) nodes. Instead of building \(\tau \), we simulate a preorder traversal on it. With simulation, we mean that we enumerate the pre-Lyndon subsequences of T in lexicographic order. For that, we maintain a stack \(\textsf{S}\) storing the text positions \((i_1, \ldots , i_\ell )\) associated with the path from the root to the node v we currently visit, i.e., if V is the string label of v, then \(i_j = \textsf{pos}_{T}(V[1..j])\) and thus \(V[j]=T[i_j]\). At each node v, we first check whether V is a Lyndon word (if so, it is considered as an answer). Then, we recursively traverse the subtree of v. For this, we need to iterate, in the lexicographic order, over all characters c such that Vc is a pre-Lyndon word. For each such character, we determine \(\textsf{pos}_{T}(Vc)\), which is the smallest text position \(i_{\ell +1}>i_\ell \) with \(T[i_{\ell +1}]=c\). If there is such position \(i_{\ell +1}\), we push it onto \(\textsf{S}{}\), recurse, and then pop \(i_{\ell +1}\). We apply the following facts to check whether a given subsequence is a Lyndon or a pre-Lyndon word.

Facts about Lyndon Words A Lyndon word cannot have a border, that is, a non-empty proper prefix that is also a suffix of the string [13, Prop. 1.1]. Given a string S of length n, an integer \(p \in [1..n]\) is a period of S if \(S[i] = S[i+p]\) for all \(i \in [1..n-p]\). We use the following facts:

  1. (Fact 1)

    The shortest period of a Lyndon word S is the length |S|.

  2. (Fact 2)

    (Fact The prefix S[1..p] of a pre-Lyndon word S with shortest period p is a Lyndon word. In particular, a pre-Lyndon word S is a Lyndon word if and only if its shortest period is |S|.

  3. (Fact 3)

    Consider a pre-Lyndon word S with shortest period p and a character \(c \in \Sigma \). Then:

    • If \(c \succ S[|S|-p+1]\), then Sc is a Lyndon word.

    • If \(c=S[|S|-p+1]\) and S is not the largest character of \(\Sigma \),Footnote 4 then Sc is a pre-Lyndon word with shortest period p.

    • Otherwise, Sc is not a pre-Lyndon word.

Proof

   

  1. Fact 1.

    If S has a period smaller than |S|, then S is bordered.

  2. Fact 2.

    If S[1..p] was not Lyndon, then there would be a suffix X of S with \(X \prec S[1..|X|]\); hence, \(X Z \prec S Z\) for every \(Z \in \Sigma ^*\), so S cannot be pre-Lyndon.

  3. Fact 3.

    Follows from Fact 2 and [13, Lemma 1.6].

\(\square \)

Checking pre-Lyndon Words Now, suppose that our stack \(\textsf{S}\) stores the text positions \((i_1, \ldots , i_\ell )\). To check whether \(T[i_1] \cdots T[i_\ell ]\cdot c\) for a character \(c \in \Sigma \) is a pre-Lyndon word or a Lyndon word, we augment each position \(i_j\) stored in \(\textsf{S}\) with the shortest period of \(T[i_1] \cdots T[i_j]\), for \(j \in [1..\ell ]\), so that we can apply Fact 3 to check whether \(T[i_1] \cdots T[i_j]\cdot c\) is a pre-Lyndon word and, if so, retrieve its shortest period, both in constant time for any \(c \in \Sigma \).

Trie Navigation To find the next text position \(i_{\ell +1}\), we may need to scan \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) characters in the text, and hence need \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) time for walking down from a node to any of its children. However, for each text position i and each character \(c\in \Sigma \), we can store the leftmost occurrence \(i'\ge i\) of the smallest character \(c'\succeq c\) that occurs in T[i..n]. As a result, we can traverse the trie in constant time per node during our preorder traversal.

This already gives an algorithm that computes the longest Lyndon subsequence with \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sigma )\) space and time linear in the number of nodes in \(\tau \). However, since the number of nodes can be exponential in the text length, we develop ways to omit nodes that do not lead to the solution. Our aim is to find a rule to prune trie nodes that surely do not contribute to the longest Lyndon subsequence of T. For that, we use the following notion of irrelevance:

Definition 6

Consider a pre-Lyndon subsequence U of T. We say that U is irrelevant if T has a Lyndon subsequence V of length \(|V|=|U|\) such that \(V\prec U\) and \(\textsf{pos}_{T}(V)\le \textsf{pos}_{T}(U)\). Otherwise, U is relevant.

Fig. 6
figure 6

Sketch of the second case in proof of Lemma 7, where the suffix S is assumed to be longer than W

Lemma 7

If L is the lexicographically smallest length-\(\ell \) Lyndon subsequence of T (for some \(\ell \in [1..n]\)), then all prefixes of L are relevant.

Proof

For a proof by contradiction, suppose that \(L=UW\) for an irrelevant prefix U. Consider an integer i such that U is a subsequence of T[1..i] and W is a subsequence of \(T[i+1..n]\). By definition of irrelevance, T[1..i] contains a Lyndon subsequence \(V \prec U\) of length \(|V|=|U|\). These conditions imply that VW is a subsequence of T that satisfies \(VW \prec UW\). By definition of \(L=UW\), this means that VW is not a Lyndon word, i.e., it contains a proper suffix \(S\prec VW\). We consider two cases:

  • If S is a suffix of W, then S is also a suffix of the Lyndon word UW, and hence \(S \succ UW \succ VW\), a contradiction.

  • Otherwise (\(|S| > |W|\), see Fig. 6 for a visualization), S is of the form \(V'W\) for a proper suffix \(V'\) of V. Since V is a Lyndon word, we have \(V' \succ V\). Moreover, V is not a prefix of \(V'\), so this implies \(S=V'W \succeq V' \succ VW\), a contradiction.

\(\square \)

Due to Lemma 7, we do not omit the solution if we skip the subtrees rooted at irrelevant nodes, i.e., nodes whose string labels are irrelevant. Algorithmically, we exploit this observation as follows: We maintain an array \(\textsf{L}[1..n]\), where \(\textsf{L}[\ell ]\) is the smallest position \(\textsf{pos}_{T}(V)\) among the length-\(\ell \) Lyndon subsequences V explored so far. We initialize all entries of \(\textsf{L}\) with \(\infty \). Now, whenever we visit a node u whose string label is a length-\(\ell \) pre-Lyndon subsequence U, then U is irrelevant if and only if \(\textsf{L}[\ell ] \le \textsf{pos}_{T}(U)\): indeed, since we traverse the trie in the lexicographic order, the condition \(\textsf{L}[\ell ] \le \textsf{pos}_{T}(U)\) is equivalent to the existence of a Lyndon subsequence \(V\prec U\) of length \(\ell \) with \(\textsf{pos}_{T}(V)\le \textsf{pos}_{T}(U)\).

Time Complexity Next, we analyze the complexity of this algorithm. For that, we say that a string is immature if it is pre-Lyndon but not Lyndon. Let us first bound the number of relevant Lyndon nodes visited. Whenever the algorithm processes a relevant Lyndon subsequence U of length \(\ell \), it decreases \(\textsf{L}[\ell ]\) from a value strictly larger than \(\textsf{pos}_{T}(U)\) (if \(\textsf{L}[\ell ]\le \textsf{pos}_{T}(U)\), then U would be irrelevant) to \(\textsf{pos}_{T}(U)\). We can decrease an individual entry of \(\textsf{L}\) at most n times, so there are at most \(n^2\) relevant Lyndon subsequences in total. While each node can have at most \(\sigma \) children, due to Fact 3, at most one child can be immature. Since the depth of the trie is at most n, we therefore visit at most \(n^3\) immature nodes, and at most \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) relevant nodes in total. All irrelevant nodes are leaves in the pruned tree, so the overall number of visited nodes is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3\sigma )\). As noted above, our trie navigation infrastructure allows for traversing the pruned trie in constant time per node.

Theorem 8

We can compute the longest Lyndon subsequence of a string of length n in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 \sigma )\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sigma )\) space.

Fig. 7
figure 7

The trie \(\tau \) traversed by the algorithm of Theorem 8, with each node labeled by the value \(\textsf{pos}_{T}(V)\) computed for its string label V. Irrelevant nodes (whose subtrees are pruned) are drawn in gray and have a dashed incoming edge. For simplification, we sometimes omit irrelevant nodes representing subsequences ending with the last character of the text (every node which does not yet have an outgoing edge with label d should have such an irrelevant child). Each immature node is surrounded by a rectangle box. Remembering that immature nodes do not contribute to our pruning technique, we cannot prune bccdc with its leftmost occurrence ending at text position 8, since all formerly found subsequences with the same length ending at or before 8 are immature. The relevant Lyndon nodes have the property that, when fixing a depth, reading the Lyndon nodes from left to right gives a decreasing sequence of text positions. When pruning the node with string label abab and label 10, we have \(\textsf{L}= [4,6,8, 9,10,11,12, \infty , \ldots ]\) and can prune this node because \(\textsf{L}[|\texttt {abab}|] = 9\) does not exceed the label 10

4.2 Improving Time Bounds

We further improve the time bounds by avoiding visiting irrelevant nodes. For that, we make use of the following queries:

  • Range maximum query: Given an interval \([i..j] \subseteq [1..n]\), retrieve the position of the largest character of the substring T[i..j], i.e., return \({{\,\mathrm{arg\,max}\,}}_{k \in [i..j]} T[k]\);

  • Range successor query: Given an interval \([i..j] \subseteq [1..n]\) and a character c, retrieve the position \(k \in [i..j]\) of the lexicographically smallest character T[k] in T[i..j] with \(T[k] \succeq c\), i.e., return \({{\,\mathrm{arg\,min}\,}}_{k \in [i..j] :\; T[k] \succeq c } T[k]\).

Each query returns a text position. In case of ties, they return the leftmost among candidate positions.

Now, suppose we are at a relevant node u with string label U of length \(\ell \) and period p. Then, we want to consider all characters c such that Uc is a relevant pre-Lyndon subsequence of T. By Fact 3, all these characters satisfy \(c \succeq U[\ell -p+1]\) (so that Uc is pre-Lyndon) and occur in \(T[\textsf{pos}_{T}(U)+1..\textsf{L}[\ell +1]-1]\) (so that Uc is relevant). In the context of our preorder traversal, each such child can be found iteratively using range successor queries: starting from \(b = U[\ell -p+1]\), we want to find the lexicographically smallest character \(c \succeq b\) such that c occurs in \(T[\textsf{pos}_{T}(U)+1..\textsf{L}[\ell +1]-1]\) and locate the leftmost such occurrence. This task can be accomplished using the wavelet tree [20] of T, which can be constructed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n\log \sigma )\) time and answers range successor queries in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(\lg \sigma )\) time [17, Theorem 7]. In particular, we can use the wavelet tree instead of the \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sigma )\) pointers to the subsequent characters to arrive at \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space. The time complexity reduces to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 \log \sigma )\).

In order to bring the time down to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\), we do not want to query the wavelet tree each time, but only whenever we are sure that u has at least one relevant Lyndon child. For that, we build a data structure of [5], which can be constructed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) time and answers range maximum queries (RMQ) on T in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(1)\) time.Footnote 5 When we are at the relevant node u, we issue an RMQ to locate the leftmost occurrence of the largest character c in \(T[\textsf{pos}_{T}(U)+1..\textsf{L}[\ell +1]-1]\). Then, we analyze the sequence Uc using Fact 3:

  • If Uc is not pre-Lyndon, then u has no relevant children.

  • If Uc is immature, then u has no relevant Lyndon children. Moreover, \(\textsf{pos}_{T}(Uc)\) is the position reported by the range maximum query. Hence, we do not need to use the wavelet tree.

  • Finally, if Uc is Lyndon, we know that u has at least one relevant Lyndon child: while Uc might still be irrelevant if \(\textsf{L}[\ell +1]\) is decreased before we visit Uc, the only nodes that may decrease \(\textsf{L}[\ell +1]\) before we visit Uc are relevant Lyndon children of u.

This observation allows us to find all relevant children of u (including the single immature child, if any) by iteratively conducting \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(k)\) range successor queries, where k is the number of relevant Lyndon children of u. Thus, the total number of wavelet tree queries asked is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) and the overall runtime is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 + n^2 \lg \sigma ) = \mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\).

Theorem 9

We can compute the longest Lyndon subsequence of a string of length n in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space.

We remark that, by Lemma 7, our algorithm can be easily modified to compute, for each length \(\ell \), the lexicographically smallest length-\(\ell \) Lyndon subsequence of T (if one exists). For this, it suffices to output, for each \(\ell \), the first visited Lyndon subsequence of length \(\ell \).

4.3 Online Computation

If we allow for more space to maintain the trie data structure introduced in Sect. 3.2, we can modify our \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 \sigma )\)-time algorithm of Sect. 4.1 to perform the computation online, i.e., with T given as a text stream. To this end, let us recall the trie \(\tau \) of all pre-Lyndon subsequences introduced at the beginning of Sect. 4. In the online setting, when reading a new character c, for each subsequence S given by a path from \(\tau \)’s root (S may be empty), we add a new node for Sc if Sc is a pre-Lyndon subsequence that is not yet represented by such a path. Again, storing all nodes of \(\tau \) explicitly would cost us too much, so we prune irrelevant nodes obtaining a trie \(\tau '\) of size \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3\sigma )\). The problem is that we can no longer perform the traversal in lexicographic order, so we instead keep multiple fingers in the trie \(\tau '\) constructed up so far and use these fingers to advance the trie traversal in text order.

With a different traversal order, we need an updated definition of \(\textsf{L}[1..n]\): Now, once the algorithm starts processing T[i], the entry \(\textsf{L}[\ell ]\) stores the lexicographically smallest length-\(\ell \) Lyndon subsequence of \(T[1..i-1]\) (represented by a pointer to the corresponding node of \(\tau '\)) or is empty if no such subsequence exists. Further, we maintain \(\sigma \) lists \(\textsf{P}_{c}\) (\(c \in [1..\sigma ]\)) storing pointers to nodes of \(\tau '\). Once the algorithm starts processing T[i], the list \(\textsf{P}_{c}\) contains pointers to all relevant nodes with string label U such that Uc is a pre-Lyndon word that is not a subsequence of \(T[1..i-1]\) (i.e., \(\textsf{pos}_{T}(Uc)\ge i\)). Initially, \(\tau '\) consists only of the root node, and each list \(\textsf{P}_{c}\) stores only the root node. Whenever we read a new character T[i] from the text stream, for each node v with string label V in \(\textsf{P}_{T[i]}\), we insert a leaf with string label \(S:=V\cdot T[i]\) (as a child of v). The characterization of \(\textsf{P}_{T[i]}\) guarantees that \(\textsf{pos}_{T}(S)=i\), so such a node does not exist yet. In order to keep the table \(\textsf{L}[1..n]\) up-to-date, we also check whether S is a Lyndon word satisfying \(S\prec \textsf{L}[|S|]\) (which can be tested using the data structure of Sect. 3.2) and, if so, we further set \(\textsf{L}[|S|]:=S\). Next, we clear \(\textsf{P}_{T[i]}\) and iterate again over the newly created leaves. For each such leaf \(\lambda \) with label S, we check whether \(\lambda \) is relevant by performing a comparison \(S \preceq \textsf{L}[|S|]\). If \(\lambda \) is relevant, we put \(\lambda \) into \(\textsf{P}_{c}\) for each character \(c \in \Sigma \) such that Sc is a pre-Lyndon word. By doing so, we effectively create new events that trigger a call-back to the point where we stopped the trie traversal.

Overall, we generate exactly the nodes visited by the algorithm of Sect. 4.1 (although in a different order). In particular, there are \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) relevant nodes, and we issue \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(\sigma )\) events for each such node. The operations of Sect. 3.2 take constant time, so the total time and space complexity of the algorithm are \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3\sigma )\).

Theorem 10

We can compute the longest Lyndon subsequence online in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 \sigma )\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3 \sigma )\) space.

Fig. 8
figure 8

Online computation on the prefix bccadb of our running example. The trie on the left shows \(\tau \), where nodes are labeled by a rank reflecting the order at which a node has been created. This rank is used in the lists \(\textsf{P}_{}\) and the table \(\textsf{W}\) as pointers to the trie nodes. Like before, nodes with rectangular boxes have immature string labels. On the right, we show the non-empty entries of \(\textsf{L}\) and \(\textsf{W}\), where each row corresponds to one length \(\ell \). On reading the first d from the text, we do not create a node for bd since we already have bc, which also needs a character larger or equal to c to be extended to a Lyndon subsequence of length three

We can improve space and time bounds by treating immature subsequences and Lyndon subsequences separately. First, we only add a leaf \(\lambda \) with string label S into \(\textsf{P}_{c}\) if Sc is immature (i.e., we no longer store \(\lambda \) in \(\textsf{P}_{c}\) if Sc is Lyndon). Second, we treat Sc being Lyndon now differently with a table \(\textsf{W}[0..n,1..\sigma ]\) of size \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n\sigma )\). Throughout the execution of the algorithm, \(\textsf{W}[0,c]\) stores (a pointer to) the root node for each \(c \in [1..\sigma ]\). For each length \(\ell \ge 1\) and character \(c\in [1..\sigma ]\), the entry \(\textsf{W}[\ell ,c]\) stores a pointer to a relevant node with string label S of length \(\ell \) such that Sc is immature. If there is no such node, the entry \(\textsf{W}[\ell ,c]\) remains empty. If there are multiple candidates, we pick the one with the lexicographically smallest string label S. This choice is dictated by the following corollary:

Corollary 11

(of Lemma 7) Consider two nodes u and v of \(\tau '\) with string labels U and V, respectively, such that \(|U| = |V|\) (u and v are on the same depth), \(V \prec U\), and Uc and Vc are immature. Assume that we construct, later on, a child of u whose string label is Lyndon. Then this child is actually irrelevant.

Proof

Suppose that we read a character \(T[i] \succ c\) such that we can create a child node \(u'\) of u with string label \(U \cdot T[i]\). Then \(V \cdot T[i]\) is also a Lyndon word, and we can apply Lemma 7. \(\square \)

When reading character T[i], for each length \(\ell \in [1..i]\), we might create at most one relevant Lyndon node of length \(\ell \) (the string label S of this node satisfies \(\textsf{pos}_{T}(S)=i\)). By Lemma Corollary 11, the parent of this Lyndon word is among the nodes \(\textsf{W}[\ell -1,1..T[i]-1]\). For each node v with string label V in \(\textsf{W}[\ell -1,1..T[i]-1]\), the string \(V\cdot T[i]\) is a Lyndon subsequence of T[1..i]. If there are multiple candidates, it suffices to consider one with the lexicographically smallest label V. If \(V\cdot T[i] \prec L[\ell ]\) holds for this label, then \(V\cdot T[i]\) is relevant and satisfies \(\textsf{pos}_{T}(V\cdot T[i])=i\). Hence, we create a new leaf linked with v using an edge with label T[i]. Moreover, we set \(\textsf{L}[\ell ]:=V\cdot T[i]\). This way, we add all the new relevant Lyndon nodes. As for the immature nodes, we scan \(\textsf{P}_{T[i]}\): given a node v in \(\textsf{P}_{T[i]}\) with string label V, if \(V\cdot T[i] \prec L[\ell ]\), we create a new leaf linked with v using an edge with label T[i]. Before we complete processing T[i], we need to update the lists \(\textsf{P}_{}\) and the table \(\textsf{W}\). Thus, we clear \(\textsf{P}_{T[i]}\) and, for each newly created leaf \(\lambda \) with string label S, we use Fact 3 to compute the character c such that Sc is immature. We then append \(\lambda \) to \(\textsf{P}_{c}\) and, if \(\textsf{W}[|S|,c]\) is empty or Sc is lexicographically smaller than the string label of \(\textsf{W}[|S|,c]\), we also put \(\lambda \) at \(\textsf{W}[|S|,c]\).

Per read character T[i], we scan \(\textsf{W}\) in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n\sigma )\) time, which results in insertion of \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) relevant Lyndon nodes. Moreover, we process \(\textsf{P}_{T[i]}\) (in time proportional to its length), which results in insertion of some relevant immature nodes. The total number of relevant nodes is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) and, for each such node, we issue one event into the lists \(\textsf{P}_{}\). Hence, the total running time is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\).

Theorem 12

We can compute the longest Lyndon subsequence online in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) space.

5 Longest Common Lyndon Subsequence

Given two strings X and Y, we want to compute the longest common subsequence (LCS) of X and Y that is Lyndon. In the special case that all characters in X and Y are unique, we can make use of an algorithm computing the LCS of two strings of length \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) with unique letters in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n\log \log n)\) time and \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space [24, Theorem 2(b)]: Similar to Theorem 5, we scan X from left to right, and compute, for each visited character X[i], the LCS of \(\pi _{X[i]}(X[i+1..])\) and \(\pi _{X[i]}(Y[\textsf{pos}_{Y}(X[i])+1..])\), where \(\pi _{X[i]}\) discards all characters that are smaller than X[i]; if X[i] does not occur in Y, we omit text position i and directly continue with \(i+1\). If we take the maximum of all these at most n LCS lengths and increment this maximum by one for the matched character, we obtain the longest common Lyndon subsequence of X and Y in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2\log \log n)\) total time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space.

For the general case, we can extend our algorithm finding the longest Lyndon subsequence of a single string as follows. First, we explore, in lexicographic order, the trie of all common pre-Lyndon subsequences of X and Y. A node with string label L of length \(\ell \) is represented by a stack \((x_1, y_1), \ldots , (x_\ell , y_\ell )\) with \(x_j=\textsf{pos}_{X}(L[1..j])\) and \(y_j=\textsf{pos}_{Y}(L[1..j])\). The depth-first search works like an exhaustive search in that it tries to extend L with subsequent characters \(c\in \Sigma \) such that Lc is pre-Lyndon and c occurs in both \(X[x_{\ell }+1..]\) and \(Y[y_{\ell }+1..]\). For each such character c, the pair \((x_{\ell +1},y_{\ell +1})\) consists of the positions of the leftmost occurrences of Lc in X and Y, respectively, which can be precomputed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n\sigma )\) time and space.

The algorithm uses again the array \(\textsf{L}\) to check, while processing a pre-Lyndon subsequence U, whether we have already found a Lyndon subsequence V of the same length satisfying \(V\prec U\), \(\textsf{pos}_{X}(V)\le \textsf{pos}_{X}(U)\), and \(\textsf{pos}_{Y}(V)\le \textsf{pos}_{Y}(U)\). For that, \(\textsf{L}[\ell ]\) stores not only one position, but a list of positions (xy) such that X[1..x] and Y[1..y] have a common Lyndon subsequence of length \(\ell \). Although there can be \(n^2\) such pairs of positions, we only store those that are pairwise non-dominated. A pair of positions \((x_1,y_1)\) is called dominated by a pair \((x_2,y_2) \ne (x_1,y_1)\) if \(x_2 \le x_1\) and \(y_2 \le y_1\). A set storing pairs in \([1..n] \times [1..n]\) can have at most n elements that are pairwise non-dominated, and hence \(|\textsf{L}[\ell ]| \le n\).

At the beginning, all lists of \(\textsf{L}\) are empty. Suppose that we visit a node v with pair \((x_\ell ,y_\ell )\) representing a common Lyndon subsequence of length \(\ell \). We then query whether \(\textsf{L}[\ell ]\) has a pair dominating \((x_\ell ,y_\ell )\). In that case, we can skip v and its subtree. Otherwise, we insert \((x_\ell ,y_\ell )\) and remove pairs in \(\textsf{L}[\ell ]\) that are dominated by \((x_\ell ,y_\ell )\). Such an insertion can happen at most \(n^2\) times. Since \(\textsf{L}[1..n]\) maintains n lists, we can update \(\textsf{L}\) at most \(n^3\) times in total. Checking for domination and insertion into \(\textsf{L}\) takes \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) time. The former can be accelerated to constant time by representing \(\textsf{L}[\ell ]\) as an array \(R_\ell \) storing in \(R_\ell [i]\) the value y of the tuple \((x,y) \in \textsf{L}[\ell ]\) with \(x \le i\) and the lowest possible y, for each \(i \in [1..n]\). Then, a pair \((x,y) \not \in \textsf{L}[\ell ]\) is dominated if and only if \(R_\ell [x] \le y\).

Example 13

For \(n = 10\), let \(\textsf{L}_\ell = [(3,9), (5,4), (8,2)]\). Then, all elements in \(\textsf{L}_\ell \) are pairwise non-dominated, and \(R_\ell = [\infty ,\infty ,9,9,4,4,4,2,2,2]\). Inserting (3, 2) would remove all elements of \(\textsf{L}_\ell \), and decrease all finite entries of \(R_\ell \) to 2. Alternatively, inserting (7, 3) would only involve updating \(R_\ell [7] \leftarrow 3\); since the subsequent entry \(R_\ell [8] = 2\) is less than \(R_\ell [7]\), no further entries need to be updated.

An update in \(\textsf{L}[\ell ]\) involves changing \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) entries of \(R_\ell \), but that cost is dwarfed by the cost for finding the next common Lyndon subsequence that updates \(\textsf{L}\). Such a subsequence can be found while visiting \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n \sigma )\) irrelevant nodes during a naive depth-first search (cf. the solution of Sect. 3.1 computing the longest Lyndon sequence of a single string). Hence, the total time is \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^4 \sigma )\). The space complexity is dominated by the representation of the array \(\textsf{L}\) with the arrays \(R_\ell \). Since each \(R_\ell \) uses \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n)\) space for \(\ell \in [1..n]\), the total space is bounded by \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\).

Theorem 14

We can compute the longest common Lyndon subsequence of a string of length n in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^4 \sigma )\) time using \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^2)\) space.

Fig. 9
figure 9

Computing the longest Lyndon subsequence on prefixes of various texts. Left: Computation time. Right: Lengths of the longest Lyndon subsequence

6 Experiments

We implemented our algorithm computing the longest Lyndon subsequence of Theorem 8 and benchmarked this implementation on various texts. Our implementation, written in Rust, is publicly available.Footnote 6 We evaluated our implementation on a server with an Intel i3-9100 CPU, running Debian 11. Since the time complexity of our algorithm is far from linear, we only benchmarked the computation on a few hundred numbers of characters. For such short strings, the task of the RMQ data structure and the wavelet tree can be performed by a linear scan on the text without degrading the performance too much (in fact, linear scan on such short strings is particularly fast due to data locality and its cache-friendly nature). We tested our algorithm on prefixes of the datasets of the Pizza &Chili corpus,Footnote 7 and on artificial random datasets random.x, where \(x \in \{2,4,16,32\}\) denotes the alphabet size. Additionally, we took the Thue–Morse, Fibonacci, and the period-doubling sequence. Figure 9 depicts the evaluation results. We observe that the running time is super-linear on all instances. The time also depends on the alphabet size since we need more time for random.x than for random.y with \(x > y\). Another observation is that the lengths of the longest Lyndon subsequences we output grow linearly with the input size. Here, the dataset pitches has a slightly shorter output. Despite the fact that the length-300 prefixes of pitches and english have 126 and 44 distinct characters, respectively, the running time on both datasets for these two prefixes is roughly the same. We therefore conclude that the distribution of the characters has also an impact on the running time.

7 Conclusion

This article has shed light, for the very first time, on the computation of the longest Lyndon subsequence. We began by studying the lexicographically smallest subsequence and the most competitive subsequence. Both problems are related to Lyndon subsequences in that they are all based on the lexicographic order. In the main part of this article, we focused on the computation of the longest Lyndon subsequence, for which we proposed algorithms for the offline and the online case. Finally, we extended our offline algorithm to compute the longest common Lyndon subsequence of two strings. Different but much easier solutions can be obtained in the special case when all characters are unique. Table 1 summarizes the algorithmic complexities we obtained or observed during the analysis of our algorithms computing the subsequences we studied.

Table 1 Algorithmic complexities for computing subsequences of various kinds studied in this article

7.1 Open Problems

It is known that the longest common subsequence of two strings of length n cannot be computed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^{2-\epsilon })\) time for any \(\epsilon > 0\) unless the strong exponential time hypothesis (SETH) is false [1]. This conditional lower bound has been translated to other variations like finding the longest square subsequence [26, Section 4]. Unfortunately, we do not see whether we can find similar (conditional) lower bounds for the problems studied in this article. Lower bounds would either justify our time and space complexities, or give hope in finding better algorithms.

For the online computation studied in Sect. 4.3, the current bottleneck is the trie representation used, which represents \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) nodes explicitly, and therefore needs \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}(n^3)\) time and space. We wonder whether we can find an implicit representation for the immature and irrelevant nodes that improves both complexities.

On the practical side, it is possible to enhance our implementation of Sect. 6 to cover also the algorithmic improvements described in Sect. 4.2. To be competitive with the current implementation, efficient implementations of range minimum queries and range successor queries need to be used. However, we are not aware of any optimized implementation of range successor queries.

Finally, we remark that we can extend our techniques for a special case of so-called Galois words [12, Section 6]. Galois words are defined in the setting of the alternating order \(\prec _{alt}\), which is given by ranking odd positions with the classic lexicographic order, but even positions in the opposite order, when comparing two strings character by character. For instance, \(\texttt{ab} \prec _{alt}\texttt{aa} \prec _{alt}\texttt{bb} \prec _{alt}\texttt{ba}\). A Galois word is then a word that is strictly smaller than all its cyclic rotations. A major difference to the lexicographic order is that a prefix of a string S is only smaller than S if its length is even, e.g., \(\texttt{ab} \prec _{alt}\texttt{a}\). Now, if we stipulate that a prefix P of a string S always exhibits \(P \prec _{alt}S\) (so we slightly modify the standard definition), then we can directly translate our techniques to compute the longest non-bordered Galois subsequence. This is because a non-bordered string is Galois if all its proper suffixes are \(\prec _{alt}\)-larger than itself. However, it is not clear to us how to find the longest bordered one, because our modified definition of \(\prec _{alt}\) for the prefixes does not make sense when regarding bordered Galois subsequences. For instance, aba is a bordered Galois word in the standard definition of the \(\prec _{alt}\)-order.