1 Introduction

In biology, duplication is an important part of evolution. There are two kinds of duplications: arbitrary segmental duplications (i.e., select a segment and paste it somewhere else) and tandem duplications (which is in the form of \(X\rightarrow XX\), where X is any segment of the input sequence). It is known that the former duplications occur frequently in cancer genomes [4, 16, 20]. On the other hand, the latter are common under different scenarios, for example, it is known that the tandem duplication of 3 nucleotides CAG is closely related to the Huntington disease [15]. In addition, tandem duplications can occur at the genome level (acrossing different genes) for certain types of cancer [17]. In fact, as early as in 1980, Szostak and Wu provided evidence that gene duplication is the main driving force behind evolution, and the majority of duplications are tandem [21]. Consequently, it was not a surprise that in the first sequenced human genome around 3% of the genetic contents are in the form of tandem repeats [13].

Independently, tandem duplications were also studied in copying systems [7]; as well as in formal languages [2, 5, 22]. In 2004, Leupold et al. posed a fundamental question regarding tandem duplications: what is the complexity to compute the minimum tandem duplication distance between two sequences A and B (i.e., the minimum number of tandem duplications to convert A to B). In 2020, Lafond et al. [9] answered this open question by proving that this problem is NP-hard for an unbounded alphabet. In fact, Lafond et al. proved later that the problem is NP-hard even if \(|\Sigma |\ge 4\) by encoding each letter in the unbounded alphabet proof with a square-free string over a new alphabet of size 4 (modified from Leech’s construction [14]), which covers the case most relevant with biology, i.e., when \(\Sigma =\{\texttt {A},\texttt {C},\texttt {G},\texttt {T}\}\) (for DNA sequences) or \(\Sigma =\{\texttt {A},\texttt {C},\texttt {G},\texttt {U}\}\) (for RNA sequences) [11]. Independently, Cicalese and Pilati showed that the problem is NP-hard for \(|\Sigma |=5\) using a different encoding method [3].

Motivated by the above applications (especially when some mutations occur after the duplications), some new problems related to duplications are proposed and studied in this paper. Given a sequence S of length n, a letter-duplicated subsequence (LDS) of S is a subsequence of S in the form \(x_1^{d_1}x_2^{d_2}\ldots x_k^{d_k}\) with \(x_i\in \Sigma \), where \(x_j\ne x_{j+1}\) and \(d_i\ge 2\) for all i in [k] and j in \([k-1]\). (Each \(x_i^{d_i}\) is called an LD-block.) Naturally, the problem of computing a longest letter-duplicated subsequence (LLDS) of S can be defined, and a simple linear time algorithm can be obtained. An example can show the idea behind this problem: \(B=AACACAGATGAT\), and due to local mutations, insertions and deletions it becomes \(S=AACACGTCGAT\), but a longest letter-duplicated subsequence \(X_1=AACCGG\) or \(X_2=AACCTT\) would still give us the skeleton of the initial sequence B. (Recently, Lafond et al. [10] have considered a slightly more complex version but the corresponding running times are significantly higher. In the conclusion section, we will discuss that perspective a little more.)

We remark that recently a similar problem called longest run subsequence was studied by Schrinner et al. [18, 19], it differs from our problem in that each letter appears consecutively at most once in the solution as a run (which is a substring containing one or more repetitions of the same letter), and the goal is the same, i.e., the length of such a subsequece is to be maximized. For this problem, additional results on FPT intractability can be found in [6] and additional approximation results can be found in [1].

In this paper, we focus on some important variants around the LLDS problem, focusing on the constrained and weighted cases. The constraint is to demand that all letters in \(\Sigma \) appear in a resulting LDS, which simulates that in a genome with duplicated genes, how to compute the maximum duplicated pattern while including all the genes. Then we have two problems: feasibility testing (FT for short, which decides whether an LDS of S containing all letters in \(\Sigma \) exists) and the problem of maximizing the length of a resulting LDS where all letters in the alphabet appear, which we call LLDS+. It turns out that the status of these two problems change quite a bit when d, the maximum number a letter can appear in S, varies. We denote the corresponding problems as FT(d) and LLDS+(d) respectively. Let \(|S|=n\), we summarize our main results in this paper as follows:

  1. 1.

    We show that when \(d\ge 4\), both FT(d) and (the decision version of) LLDS+(d) are NP-complete, which implies that LLDS+(d) does not have a polynomial-time approximation algorithm when \(d\ge 4\).

  2. 2.

    We show that when \(d=3\), both FT(d) and LLDS+(d) admit an O(n) time algorithm, by exploiting a new property of the problem.

  3. 3.

    When a weight of an LD-block is any positive function (i.e., it does not even have to grow with its length), we present a non-trivial \(O(n^2)\) time dynamic programming solution for this Weighted-LDS problem.

Note that the parameter d, i.e., the maximum duplication number, is of practical interest in bioinformatics, since in many genomes duplication is a rare event and the number of duplicates is usually a small constant. For example, it is known that plants have undergone up to three rounds of whole genome duplications, resulting in a number of duplicates bounded by 8 [23].

An earlier version of this paper appeared in [12], where the NP-completeness results were only shown for \(d\ge 6\) and the some partial results were shown for \(d=3\) (a huge gap was left open for \(d=4,5\)). In this paper, we close this gap completely.

This paper is organized as follows. In Sect. 2 we give necessary definitions. In Sect. 3 we focus on showing that the LLDS+ and FT problems are NP-complete when \(d\ge 4\). In Sect. 4 we present the linear time algorithms for both FT and LLDS+ when \(d=3\). In Sect. 5 we give polynomial-time algorithms for Weighted-LDS. We conclude the paper in Sect. 6, where we summarize our results and also discuss some related recent research.

2 Preliminaries

Let \(\mathbb {N}\) be the set of natural numbers. For \(q\in \mathbb {N}\), we use [q] to represent the set \(\{1,2,...,q\}\). Throughout this paper, a sequence S is over a finite alphabet \(\Sigma \). We use S[i] to denote the i-th letter in S and S[i..j] to denote the substring of S starting and ending with indices i and j respectively. (Sometimes we also use (S[i], S[j]) as an interval representing the substring S[i..j].) With the standard run-length representation, S can be represented as \(y_1^{a_1}y_2^{a_2}\ldots y_q^{a_q}\), with \(y_i\in \Sigma ,y_j\ne y_{j+1}\) and \(a_j\ge 1\), for \(i\in [q],j\in [q-1]\). If a letter x appears multiple times in S, we could use \(x^{(i)}\) to denote the i-th copy of it (reading from left to right). Finally, a subsequence of S is a string obtained by deleting some letters in S.

2.1 The LLDS problem

A subsequence \(S'\) of S is a letter-duplicated subsequence (LDS) of S if it is in the form of \(x_1^{d_1}x_2^{d_2}\ldots x_k^{d_k}\), with \(x_i\in \Sigma ,x_j\ne x_{j+1}\) and \(d_i\ge 2\), for \(i\in [k],j\in [k-1]\). We call each \(x_i^{d_i}\) in \(S'\) a letter-duplicated block (LD-block, for short). For instance, let \(S=abcacabcb\), then \(S_1=aaabb\), \(S_2=ccbb\) and \(S_3=ccc\) are all letter-duplicated subsequences of S, where aaa and bb in \(S_1\), cc and bb in \(S_2\), and ccc in \(S_3\) all form the corresponding LD-blocks. Certainly, we are interested in the longest ones — which gives us the longest letter-duplicated subsequence (LLDS) problem.

As a warm-up, we solve this problem by dynamic programming. We first have the following observation.

Observation 1

Suppose that there is an optimal LLDS solution for a given sequence S of length n, in the form of \(x_1^{d_1} x_2^{d_2} \ldots x_k^{d_k}\). Then it is possible to decompose it into a generalized LD-subsequence \(y_1^{e_1} y_2^{e_2} \ldots y_p^{e_p}\), which has the following properties:

  • \( 2 \le e_i \le 3\), for \(i\in [p]\),

  • \(p \ge k\),

  • \(y_j\) does not have to be different from \(y_{j+1}\), for \(j\in [p-1]\).

The proof is straightforward: For any natural number \(\ell \ge 2\), we can decompose it as \(\ell = \ell _1 + \ell _2 + \ldots + \ell _z \ge 2\), such that \(2 \le \ell _j \le 3\) for \(1 \le j \le z\). Consequently, for every \(d_i>3\), we could decompose it into a sum of 2’s and 3’s. Then, clearly, given a generalized LD-subsequence, we could easily obtain the corresponding LD-subsequence by combining \(y_i^{e_i}y_{i+1}^{e_{i+1}}\) when \(y_i=y_{i+1}\).

We now design a dynamic programming algorithm for LLDS. Let L(i) be the length of the optimal LLDS solution for S[1..i]. The recurrence for L(i) is as follows.

$$\begin{aligned} L(0)&= 0, \\ L(1)&= 0, \\ L(i)&= \max {\left\{ \begin{array}{ll} L(i-x-1) + 2 &{} x = \min \{x| S[i-x] = S[i] \}, x\in (0,i-1] \\ L(i-x) + 1 &{} x = \min \{x| S[i-x] = S[i] \}, x\in (0,i-1] \\ L(i-1) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Note that the step involving \(L(i-x)+1\) is essentially a way to handle a generalized LD-subsequence of length 3 (by keeping \(S[i-x]\) for the next level computation) and cannot be omitted following the above observation. For instance, if \(S=dabcdd\) then without that step we would miss the optimal solution ddd.

The value of the optimal LLDS solution for S can be found in L(n). For the running time, for each S[x] we just need to scan S to find the closest S[i] such that \(S[x]=S[i]\). With this information, the table L can be filled in linear time. With a simple augmentation, the actual sequence corresponding to L(n) can also be found in linear time. Hence LLDS can be solved in O(n) time.

2.2 The variants of LLDS

In this paper, we focus on the following variations of the LLDS problem.

Definition 1

Constrained Longest Letter-Duplicated Subsequence (\(LLDS+\) for short)

Input: A sequence S with length n over an alphabet \(\Sigma \) and an integer \(\ell \).

Question: Does S contain a letter-duplicated subsequence \(S'\) with length at least \(\ell \) such that all letters in \(\Sigma \) appear in \(S'\)?

Definition 2

Feasibility Testing (FT for short)

Input: A sequence S with length n over an alphabet \(\Sigma \).

Question: Does S contain a letter-duplicated subsequence \(S''\) such that all letters in \(\Sigma \) appear in \(S''\)?

For LLDS+ we are really interested in the optimization version, i.e., to maximize \(\ell \). Note that, though looking similar, FT and the decision version of LLDS+ are different: if there is no feasible solution for FT, certainly there is no solution for LLDS+; but even if there is a feasible solution for FT, computing an optimal solution for LLDS+ could still be non-trivial.

Finally, let d be the maximum number of times a letter in \(\Sigma \) appears in S. Then, we can represent the corresponding versions for LLDS+ and FT as LLDS+(d) and FT(d) respectively.

It turns out that (the decision version of) LLDS+(d) and FT(d) are both NP-complete when \(d\ge 4\). When \(d=3\), FT(3) can be decided in O(n) time using a simple yet interesting property of the problem, on top of that we can solve LLDS+(3) also in O(n) time using dynamic programming. We present the details in the next two sections. In Sect. 5, we will consider an extra version of LLDS, Weighted-LDS, where the weight of an LD-block is an arbitrary positive function.

3 Hardness for LLDS+(d) and FT(d) when \(d\ge 4\)

3.1 FT(4) is NP-complete

The idea is to reduce a special version of SAT, which we call \((3^+,1,2^-)\)-SAT, to FT(4). We first show that is NP-complete; in this version all variables appear positively in 3-CNF clauses (i.e., clauses containing exactly 3 positive literals), and each variable appears exactly once in total; moreover, the negation of the variables appear in 2-CNF clauses (i.e., clauses containing 2 negative literals), possibly many times (in fact, at least 5 times when the graph has a leaf node as can be seen next). A valid truth assignment for an \((3^+,1,2^-)\)-SAT instance \(\phi \) is one which makes \(\phi \) True; moreover, each 3-CNF clause has exactly one true literal.

The reduction was folklore and a sketch of proof was posted in the internet https://cs.stackexchange.com/questions/16634; here we give a formal proof.

Theorem 1

\((3^+,1,2^-)\)-SAT is NP-complete.

Proof

As the problem is easily seen to be in NP, let us focus more on the reduction from 3-COLORING. In 3-COLORING, given a graph \(G=(V,E)\), one needs to assign one of the 3 colors to each of the vertex \(u\in V\) such that for any edge \((u,v)\in E\), u and v are giving different colors.

For each vertex u, we use \(u_1,u_2\) and \(u_3\) to denote the 3 colors, then, obviously, we have the 3-CNF clause \((u_1\vee u_2\vee u_3)\). Therefore, the conjunction of positive 3-CNF clauses is

$$\begin{aligned} C^+=\bigwedge _{u\in V}(u_1\vee u_2 \vee u_3). \end{aligned}$$

We have two kinds of 2-CNF clauses. First, for each \(u\in V\), we have a type-1 2-CNF clause (which means that color-i and color-j cannot be both used to color u:

$$\begin{aligned} \overline{u_i\wedge u_j}=(\bar{u}_i\vee \bar{u}_j), \end{aligned}$$

for \(1\le i\ne j\le 3\). Then, for each edge \((u,v)\in E\), we have a type-2 2-CNF clause (which means that one cannot color u and v with the same color-i):

$$\begin{aligned} \overline{u_i\wedge v_i}=(\bar{u}_i\vee \bar{v}_i), \end{aligned}$$

for \(i=1,2,3\).

Let \(C^-\) be the conjunction of these 2-CNF clauses. Then \(\phi =C^+\wedge C^-\), and it is clear that G has a 3-coloring if and only if \(\phi \) has a valid truth assignment. We proceed to prove this statement.

If G has a 3-coloring and u is colored with color-i, then we assign \(u_i\leftarrow \) TRUE and \(u_j,u_k\leftarrow \) FALSE, where \(1\le i\ne j\ne k\le 3\). Then the 3-CNF clause on u in \(C^+\) is certainly satisfied, since there is exactly one true literal in it. For a type-1 2-CNF clause \((\bar{u}_i\vee \bar{u}_j), i\ne j\), since \(u_i\) is assigned TRUE, \(u_j\) must be assigned FALSE (so does \(u_k\)); consequently, for a type-1 2-CNF clause \((\bar{u}_j\vee \bar{u}_k), i\ne j\ne k\), since both \(u_j\) and \(u_k\) are assigned FALSE it is certainly satisfied. Finally, for an edge \((u,v)\in E\), since u is colored with color-i, v must be colored with a different color-j. Hence \(u_i\leftarrow \) TRUE, \(v_i\leftarrow \) FALSE and certainly the type-2 2-CNF clause \(\bar{u}_i\vee \bar{v}_i\) is satisfied.

For the reverse direction, if \(\phi \) has a truth assignment, for a 3-CNF clause \((u_1\vee u_2\vee u_3)\) in \(C^+\), the type-1 2-CNF clauses \((\bar{u}_i\vee \bar{u}_j)\) in \(C^-\), with \(1\le i\ne j\le 3\), would enforce that exactly one of \(u_1,u_2\) and \(u_3\) is assigned TRUE. On the other hand, the type-2 2-CNF clauses \((\bar{u}_i\vee \bar{v}_i)\) in \(C^-\), with \(1\le i\le 3\), would enforce that for an edge \((u,v)\in E\), \(u_i\) and \(v_i\) cannot both be assigned TRUE. Hence, if for each \(u_i\) assigned TRUE we color \(u\in V\) with color-i, then we have a valid 3-coloring for G.

The reduction obviously takes linear time. Hence the theorem is proven. \(\square \)

Fig. 1
figure 1

A simple example for the reduction from 3-COLORING to \((3^+,1,2^-)\)-SAT. The conjunction of positive 3-CNF clauses is \(C^+= F^+_1\wedge F^+_{2}\wedge F^+_{3}\wedge F^+_{4}=(u_1\vee u_2\vee u_3)\wedge (v_1\vee v_2\vee v_3)\wedge (w_1\vee w_2\vee w_3) \wedge (z_1\vee z_2\vee z_3)\)

In Fig. 1, we show an example for the above reduction. According to the given graph, the type-1 2-CNF formula involving u is \(F^-_{1}\wedge F^-_{2}\wedge F^-_{3}=(\bar{u}_1\vee \bar{u}_2)\wedge (\bar{u}_1\vee \bar{u}_3)\wedge (\bar{u}_2\vee \bar{u}_3)\); the type-2 2-CNF formula corresponding to the edge (uv) is \((\bar{u}_1\vee \bar{v}_1)\wedge (\bar{u}_2\vee \bar{v}_2)\wedge (\bar{u}_3\vee \bar{v}_3)\), and for type-2 2-CNF formula corresponding to edge (uw) is \((\bar{u}_1\vee \bar{w}_1)\wedge (\bar{u}_2\vee \bar{w}_2)\wedge (\bar{u}_3\vee \bar{w}_3)\), for convenience we could assume that these clauses are labelled from \(F^-_{4}\) to \(F^-_{9}\) respectively.

We next reduce \((3^+,1,2^-)\)-SAT to FT(4). Given the input \(\phi \) for \((3^+,1,2^-)\)-SAT over 3n variables \(x_1,x_2,\ldots ,x_{3n}\), we label its 3-CNF clauses as \(F^+_1\), \(F^+_2\), \(\ldots \), \(F^+_n\) and its 2-CNF clauses as \(F^-_1,F^-_2,\ldots ,F^-_m\) (for 2-CNF clauses we must always list all type-1 clauses before type-2 ones). (In Fig. 1, the vertices are labelled alphabetically which can be interpreted as \(u_1=x_1,u_2=x_2,u_3=x_3,v_1=x_4\), etc.) For each variable \(x_i\), let \(L(x_i)\) be the list of type-1 2-CNF clauses containing \(\bar{x}_i\), each repeating twice consecutively, followed by type-2 2-CNF clauses, again each repeating twice. Clearly, each 2-CNF clause appears exactly 4 times in all these lists (see the arguments below). As an example, following Fig. 1, \(L(u_1)= F^-_{1}F^-_{1}\cdot F^-_{2}F^-_{2}\cdot F^-_{4}F^-_{4}\cdot F^-_{7}F^-_{7}\); \(L(u_2)= F^-_{1}F^-_{1}\cdot F^-_{3}F^-_{3}\cdot F^-_{5}F^-_{5}\cdot F^-_{8}F^-_{8}\); and \(L(u_3)= F^-_{2}F^-_{2}\cdot F^-_{3}F^-_{3}\cdot F^-_{6}F^-_{6}\cdot F^-_{9}F^-_{9}\).

Let \(F^+_i=(x_{i,1}\vee x_{i,2} \vee x_{i,3})\). We construct \(H_i=F^+_i\cdot L(x_{i,1})\cdot F^+_i\cdot L(x_{i,2})\cdot F^+_i\cdot L(x_{i,3})\cdot F^+_i\), where \(F^+_i\) appears 4 times as a letter in \(H_i\). Finally we construct a sequence H as

$$\begin{aligned} H=H_1\cdot g_1g_1\cdot H_2\cdot g_2g_2\cdot H_3 \ldots g_{n-1}g_{n-1}\cdot H_n, \end{aligned}$$

where each separator \(g_\ell (1\le \ell \le n-1)\) appears exactly twice. Note that each \(F_i^{+}\) only appears 4 times in \(H_i\), each type-1 2-CNF clause \(F^-_{k}\) containing \(\bar{x}_i\) appears 4 times (twice consecutively in \(L(x_{i,j})\), and twice consecutively in \(L(x_{i,j'})\), with \(1\le j\ne j'\le 3\)); moreover, each type-2 2-CNF clause \(F^-_{k}\) involving an edge \((x_i,x_l)\) also appears 4 times in H: twice consecutively in \(H_i\) and twice consecutively in \(H_l\). Hence, except for \(g_\ell \), all letters in H appears 4 times.

We claim that \(\phi \) has a valid truth assignment if and only if H induces a feasible LDS which contains all \(F^+_j\) (\(j=1..n\)), all \(F^-_k\) (\(k=1..m\)) and all \(g_{\ell }g_{\ell }\) (\(\ell =1..n-1\)).

The forward direction, i.e., when \(\phi \) has a valid truth assignment, is straightforward and can be proved as follows. In this case, suppose exactly one of \(x_{i,1}\), \(x_{i,2}\) and \(x_{i,3}\) (say \(x_{i,j}\), \(1\le j\le 3\)) is assigned TRUE, then we delete \(L(x_{i,j})\) in \(H_i\) and also delete the other 2 copies \(F^+_i\) to form an LDS-block \(F^+_iF^+_i\). Consequently, \(H'_i\) is obtained from \(H_i\) by deleting \(L(x_{i,j})\) and the two other copies of \(F_i^{+}\) in \(H_i\) which are not adjacent to \(L(x_{i,j})\). Since all the 2-CNF clauses appear 4 times in H, either two each in \(L(x_{i,j})\) and \(L(x_{i,j'})\), with \(1\le j\ne j'\le 3\), or two each in \(H_i\) and \(H_l\) where \((x_i,x_l)\in E\)\(L(x_{i,j})\) and \(L(x_{l,j})\) cannot be both deleted as that would imply \(x_i\) and \(x_l\) having the same color in G. Hence, at least two of them would appear as an LDS-block of size 2 in a feasible LDS solution \(H'\), where

$$\begin{aligned} H'=H'_1\cdot g_1g_1\cdot H'_2\cdot g_2g_2\cdot H'_3 \ldots g_{n-1}g_{n-1}\cdot H'_n. \end{aligned}$$

As an example, suppose \(x_{1,1}\) is assigned TRUE — corresponding to that u is labelled with color-1 in G, then we have:

$$\begin{aligned} H_1 =~&F_1^{+}\cdot L(u_1) \cdot F_1^{+} \cdot L(u_2) \cdot F_1^{+}\cdot L(u_3) \cdot F_1^{+}\\ =~&F_1^{+}\cdot F^-_{1}F^-_{1}\cdot F^-_{2}F^-_{2}\cdot F^-_{4}F^-_{4}\cdot F^-_{7}F^-_{7} \\&F_1^{+}\cdot F^-_{1}F^-_{1}\cdot F^-_{3}F^-_{3}\cdot F^-_{5}F^-_{5}\cdot F^-_{8}F^-_{8}\\&F_1^{+}\cdot F^-_{2}F^-_{2}\cdot F^-_{3}F^-_{3}\cdot F^-_{6}F^-_{6}\cdot F^-_{9}F^-_{9}\cdot F_1^{+}. \end{aligned}$$

And corresponding to \(x_{1,1}\) (i.e., \(u_1\)) being assigned TRUE, \(L(u_1)\) and the two copies of \(F_1^{+}\) not adjacent to it are deleted to have

$$\begin{aligned} H'_1 =~&F_1^{+}\cdot F_1^{+} \cdot L(u_2) \cdot L(u_3)\\ =~&F_1^{+}F_1^{+}\cdot F^-_{1}F^-_{1}\cdot F^-_{3}F^-_{3}\cdot F^-_{5}F^-_{5}\cdot F^-_{8}F^-_{8}\\&\cdot F^-_{2}F^-_{2}\cdot F^-_{3}F^-_{3}\cdot F^-_{6}F^-_{6}\cdot F^-_{9}F^-_{9}. \end{aligned}$$

The reverse direction is slightly more tricky. We first show the following lemma.

Lemma 1

If H admits a feasible solution, then in \(H_i\) exactly two non-empty subsequences of \(L(x_{i,1}), L(x_{i,2})\) and \(L(x_{i,3})\) appear in the feasible solution \(H'_i\) (or, exactly one of the three is deleted from \(H_i\)).

Proof

By construction, the type-1 2-CNF clauses constructed over \(x_{i,1}\), \(x_{i,2}\) and \(x_{i,3}\) in \(H_i\) have a special property due to the connection with 3-coloring: such a clause \(F^-_{k}\) appears consecutively in exactly two of the three substrings \(L(x_{i,1}), L(x_{i,2})\) and \(L(x_{i,3})\); moreover, if \(F^-_{k}\) appears in \(L(x_{i,j})\) and \(L(x_{i,l})\) then it does not appear in \(L(x_{i,\ell })\), where \(1\le j\ne l\ne \ell \le 3\). If exactly one of the three non-empty subsequences of \(L(x_{i,1})\), \(L(x_{i,2})\) and \(L(x_{i,3})\), say of \(L(x_{i,1})\), appears in a feasible LDS solution \(H'_i\), then the type-1 2-CNF clause involving \(x_{i,2}\) and \(x_{i,3}\), say \(F^-_{k}\), would be missing in \(H'_i\), contradicting the assumption that H (also \(H_i\)) has a feasible solution.

On the other hand, due to the construction of \(H_i\), one cannot leave all the three non-empty subsequences of \(L(x_{i,1})\), \(L(x_{i,2})\) and \(L(x_{i,3})\) in a feasible LDS solution \(H'\). The reason is that \(F^+_i\) would not be forming an LDS-block if all the three non-empty subsequences are kept in a feasible LDS solution. \(\square \)

It remains to show, in the feasible solution \(H'_i\), how to identify which subsequence comes from \(L(x_{i,1})\), \(L(x_{i,2})\) and \(L(x_{i,3})\). This can be easily done by looking at the type-1 2-CNF clauses (which are always put at the beginning of these 3 lists). Following the example in Fig. 1, we just discuss one case when \(L(u_{1})\) is completely deleted, the other two are symmetric. When \(L(u_{1})\) is completely deleted (and two redundant copies of \(F^+_1\) are also deleted), without further deletion, the complete form of \(H'_1\) must be \(F^+_{1}F^+_{1}\cdot F^-_1F^-_1F^-_3F^-_3\ldots F^-_{2}F^-_{2}F^-_{3}F^-_{3}\ldots \). Consequently, even if only one copy of \(F^-_{3}F^-_{3}\) remains in a feasible \(H'_1\) we would still know that \(F^+_{1}F^+_{1}\) is obtained by deleting \(L(u_1)\) (or \(L(x_{i,1}\) in the general case).

With Lemma 1, the reverse direction can be proved as follows. If the LDS-block \(F^+_iF^+_i\) is formed by deleting \(L(x_{i,j})\), then we assign \(x_{i,j}\leftarrow \) TRUE (and the other two variables in \(F^+_i\) are assigned FALSE). Clearly, this gives a valid truth assignment for \(\phi \). We thus have the following theorem.

Theorem 2

FT(4) is NP-complete.

Since FT(4) is NP-complete, the optimization problem LLDS+(4) is certainly NP-hard.

Corollary 1

The optimization version of LLDS+(4) is NP-hard.

3.2 Inapproximability results

The results in the previous subsection essentially implies that the optimization version of LLDS+(d), \(d\ge 4\), does not admit any polynomial-time approximation algorithm (regardless of the approximation factor), since any such approximation would have to return a feasible solution. A natural direction to approach LLDS+ is to design a bicriteria approximation for LLDS+, where a factor-\((\alpha ,\beta )\) bicriteria approximation algorithm is a polynomial-time algorithm which returns a solution of length at least \({OPT}/\alpha \) and includes at least \(N/\beta \) letters, where \(N=|\Sigma |\) and OPT is the optimal solution value of LLDS+. We show that obtaining a bicriteria approximation algorithm for LLDS+ is no easier than approximating LLDS+ itself.

Theorem 3

If LLDS+\((d), d\ge 4\), admitted a factor-\((\alpha ,N^{1-\epsilon })\) bicriteria approximation for any \(\epsilon <1\), then LLDS+\((d), d\ge 6\), would also admit a factor-\(\alpha \) approximation, where N is the alphabet size.

Proof

Suppose that a factor-\((\alpha ,N^{1-\epsilon })\) bicriteria approximation algorithm \(\mathcal{A}\) exists. We construct an instance \(S^*\) for LLDS+(4) as follows. (Recall that S is the sequence we constructed from a \((3^+,1,2^-)\)-SAT instance \(\phi \) in the proof of Theorem 2.) In addition to \(\{F_i|i=1..m\}\cup \{g_j|j=1..n+1\}\) in the alphabet, we use a set of integers \(\{1,2,...,(m+n+1)^x-(m+n+1)\}\), where x is some integer to be determined. Hence,

$$\begin{aligned} \Sigma =\{F_i|i=1..m\}\cup \{g_j|j=1..n+1\}\cup \{1,2,...,(m+n+1)^x-(m+n+1)\}. \end{aligned}$$

We now construct \(S^*\) as

$$\begin{aligned} S^*&=1\cdot 2\ldots ((m+n+1)^x-(m+n+1))\cdot S\cdot ((m+n+1)^x-(m+n+1)) \\&\cdot ((m+n+1)^x-(m+n+1)-1)\ldots 2\cdot 1. \end{aligned}$$

Clearly, any bicriteria approximation for \(S^*\) would return an approximate solution for S as including any number in \(\{1,2,...,(m+n+1)^x-(m+n+1)\}\) would result in a solution of size only 2.

Notice that we have \(N=m+(n+1)+(m+n+1)^x-(m+n+1)=(m+n+1)^x\). In this case, the fraction of letters in \(\Sigma \) that is used to form such an approximate solution satisfies

$$\begin{aligned} \frac{m+(n+1)}{(m+n+1)^x}\le \frac{1}{N^{1-\epsilon }}, \end{aligned}$$

which means it suffices to choose \(x\ge \lceil 2-\epsilon \rceil =2\). \(\square \)

4 Linear time algorithms for LLDS+(d) and FT(d) when \(d=3\)

4.1 Solving the feasiblility testing version for \(d=3\)

For the Feasibility Testing version, as covered earlier, Theorem 2 implies that the problem is NP-complete when \(d\ge 4\). We next show that if \(d=3\), then the problem can be decided in linear time. For convenience, we also call an LD-block of length j a j-block. (We only focus on \(j=2,3\) in the following.) We first prove that the solution has an implicit linear structure.

Lemma 2

Given a string S over \(\Sigma \) such that each letter in S appears at most 3 times, if a feasible solution for FT(3) contains a 3-block then there is a feasible solution for FT(3) which only uses 2-blocks; moreover, for each letter a, its second occurrence \(a^{(2)}\) must be in the solution.

Proof

Suppose that \(S=\ldots a^{(1)}\ldots a^{(2)}\ldots a^{(3)}\ldots \), and \(a^{(1)}a^{(2)}a^{(3)}\) is a 3-block in a feasible solution for FT(3). (Recall that the superscript only indicates the appearance order of letter a.) Then we could replace \(a^{(1)}a^{(2)}a^{(3)}\) by either \(a^{(1)}a^{(2)}\) or \(a^{(2)}a^{(3)}\). The resulting solution is still a feasible solution for FT(3). In both cases, \(a^{(2)}\) appears in the feasible solution. \(\square \)

Lemma 2 implies that the FT(3) problem can be solved in O(n) time as follows. (Note that in the conference version [12], we did not have this lemma hence we can only solve it in \(O(n^2)\) time using 2-SAT.) We first re-number the letters in S by their second (or, middle) occurrence (from left to right) as \(c_1,c_2,...,c_n\). Let \(B_1(i)=c_i^{(1)}c_i^{(2)}\) and \(B_2(i)=c_i^{(2)}c_i^{(3)}\). Let F be a feasible solution (which is a subsequence of S).

Staring from \(i=1\), we put \(F:=B_1(i)=c_i^{(1)}c_i^{(2)}\) as the first 2-block. Then we loop through \(j=2\) to n as follows:

  • If F overlaps \(B_1(j)\) and \(B_2(j)\), then report ‘no solution’ and exit;

  • If F overlaps \(B_1(j)\), then \(F\leftarrow F\cdot B_2(j)\); otherwise, \(F\leftarrow F \cdot B_1(j)\).

By construction, the returned solution F is a feasible solution for FT(3) if the above algorithm does not exit with ‘no solution’. The reason is that when F is returned by the algorithm, it already contains all the n 2-blocks.

Theorem 4

Let S be a string of length n. FT(3) can be decided in O(n) time.

Theorem 4 immediately implies that LLDS+(3) has a factor \(-\)1.5 approximation as any feasible solution for FT(3) would be a factor\(-\)1.5 approximation for LLDS+(3). In the following, we show that LLDS+(3) can in fact be solved in linear time as well, after FT(3) is decided to have a solution.

4.2 The optimization version when \(d=3\)

We now extend the solution in the previous subsection to solve LLDS+(3), where the input is a sequence S over \(\Sigma \) of size n; moreover, each letter appears exactly three times in S (i.e., \(|S|=3n\)). (Note that this assumption is valid as if a letter only appears once then there is no feasible solution; and if a letter appears twice they must be put in a solution and all the other letters between them must be deleted.) Recall that we assume that there is always a feasible solution (which can be checked by Theorem 4).

Our idea is to use dynamic programming. Recall that the letters in S are re-numbered by their second (or, middle) occurrence (from left to right) as \(c_1,c_2,\ldots , c_n\).

Following the previous lemma, LLDS+(3) solution has a natural structure ordered according to \(c_i^{(2)}\)’s, under the assumption that there is a feasible solution, which enables us to use dynamic programming to solve LLDS+(3).

For each character \(c_i, 1\le i\le n\), let \(B_1(i):=c_i^{(1)}c_i^{(2)}\), \(B_2(i):=c_i^{(2)}c_i^{(3)}\), and \(B_3(i):=c_i^{(1)}c_i^{(2)}c_i^{(3)}\). We define D[ij] as the maximum value of LDS for the sequence of characters \((c_1,\ldots ,c_i)\) where the last LD-block containing \(c_i\) is \(B_j(i)\), for \(1\le j\le 3\); otherwise, if \(B_j(i)\) does not lead to a feasible solution, then \(D[i,j]\leftarrow -\infty \). Similarly, define D[i] as the maximum value of LDS for sequences of letters \((c_1,\ldots ,c_i)\), with

$$\begin{aligned} D[i]=\max _{j=1..3}D[i,j]. \end{aligned}$$

The tables \(D[-]\) and \(D[-,-]\) can be calculated in linear time. In fact, for D[ij] with \(1\le j\le 3\), we have

$$\begin{aligned} D[i,j]&= \max _{1\le k\le 3} {\left\{ \begin{array}{ll} D[i-1,k]+|B_j(i)| &{} \hbox { if}~B_k(i-1)~\hbox {doesn't~overlap}~B_j(i),\\ -\infty &{} \hbox { if}~B_k(i-1)~\hbox {overlaps}~B_j(i). \end{array}\right. } \end{aligned}$$

Note that the value of \(|B_j(i)| \) is either two or three. The optimal solution value for LLDS +(3) is D[n]. The actual solution can be easily retrieved in O(n) time as well.

Theorem 5

Given a string S of length n, where each letter appears exactly three times, the problem of LLDS+(3) can be solved in O(n) time.

Proof

Assume that FT(3) already has a feasible solution (which can be checked by Theorem 4), an optimal solution for LLDS+(3) is obtained by the above dynamic programming algorithm. The correctness follows from the fact that the algorithm scans all (middle) letters from left to right and the final solution must include all the letters. The O(n) running time is obvious: each D[ij] can be updated in O(1) time, hence D[i] too can be updated in O(1) time; moreover, we only have an O(n) number of such tables D[i] and D[ij] to maintain and update. \(\square \)

As a simple example, let \(S=112132323\), then \(D[1,1]=2\), \(D[1,2]=2\), \(D[1,3]=3\). Updating \(i=2\), we have \(D[2,1]=4\), \(D[2,2]=5\), \(D[2,3]=5\). Finally, after updating \(i=3\), we have \(D[3,1]=-\infty \), \(D[3,2]=6\), \(D[3,3]=-\infty \). The optimal solution 112233 is obtained according to \(D[3]=6\).

In the next section, we show that if the LD-blocks are arbitrarily positively weighted, then the problem can be solved in \(O(n^2)\) time. Note that the O(n) time algorithm in Sect. 2.1 assumes that the weight of any LD-block is its length, which has the property that \(\ell (s)=\ell (s_1)+\ell (s_2)\), where \(s=s_1s_2\), \(s_1\) and \(s_2\) are LD-blocks on the same letter x, and \(\ell (s)\) is the length of s (or the total number of letters of x in \(s_1\) and \(s_2\)).

5 A dynamic programming algorithm for weighted-LDS

Given the input string \(S=S[1...n]\), let \(w_x(\ell )\) be the weight of LD-block \(x^\ell , x\in \Sigma , 2\le \ell \le d\), where d is the maximum number of times a letter appears in S. Here, the weight can be thought of as a positive function of x and \(\ell \) and it does not even have to be increasing on \(\ell \). For example, it could be that \(w(aaa)=w_a(3)=8, w(aaaa)=w_a(4)=5\). Given \(w_x(\ell )\) for all \(x\in \Sigma \) and \(\ell \), we aim to compute the maximum weight letter-duplicated string (Weighted-LDS) using dynamic programming.

Define T(n) as the value of the optimal solution of S[1...n] which contains the character S[n]. Define w[ij] as the maximum weight LD-block \(S[j]^\ell \) (\(\ell \ge 2\)) starting at position i and ending at position j; if such an LD-block does not exist, then \(w[i,j]=0\). Notice that \(S[j]^\ell \) does not necessarily have to contain S[i] but it must contain S[j]. We have the following recurrence relation.

$$\begin{aligned} T(0)&= 0, \\ T(i)&= \max \limits _{S[y] \ne S[i]} {\left\{ \begin{array}{ll} T(y) + w[y+1, i] &{} if~w[y+1, i]>0,\\ 0 &{} {otherwise.} \end{array}\right. } \end{aligned}$$

The final solution value is \(\max \limits _{n} T(n)\). This algorithm clearly takes \(O(n^2)\) time, assuming w[ij] is given. We compute the table \(w[-,-]\) next.

  1. 1.

    For each pair of \(\ell \) (bounded by d, the maximum number of times a letter appears in S) and letter x, compute

    $$\begin{aligned} w_x'(\ell ) = \max {\left\{ \begin{array}{ll} w_x'(\ell -1)&{}\\ w_x(\ell ) &{} \end{array}\right. }, \end{aligned}$$

    with \(w_x'(1) = w_x(1)\). This can be done in \(O(d|\Sigma |)=O(n^2)\) time.

  2. 2.

    Compute the number of occurrence of S[j] in the range of [ij], N[ij]. Notice that \(i\le j\) and for the base case we have \(S[0] = \varepsilon \).

    $$\begin{aligned} N(0,0)&= 0, \\ N(0,j)&= N(0,k) + 1, ~~k = \max {\left\{ \begin{array}{ll} \{y|s[y]=s[j],1\le y < j\} &{} ~\\ 0 &{} \end{array}\right. } \end{aligned}$$

    And,

    $$\begin{aligned} N(i,j) = {\left\{ \begin{array}{ll} N(i-1, j), &{} {if~ s[i-1]\ne s[j]}\\ N(i-1,j)-1, &{} {if~ s[i-1] = [j]} \end{array}\right. } \end{aligned}$$

    This step takes \(O(n^2)\) time. Note that, in this step, the number of times S[j] appearing [ij] is computed based its appearance in \([i-1,j]\). Hence N(ij) is filled in a bottom-up manner.

  3. 3.

    Finally, we compute

    $$\begin{aligned} w[i, j] = {\left\{ \begin{array}{ll} w'_{s[j]} (N(i,j)), &{} if~N(i,j) \ge 2 \\ 0, &{} else \end{array}\right. } \end{aligned}$$

    This step also takes \(O(n^2)\) time. We thus have the following theorem.

Theorem 6

Let S be a string of length n over an alphabet \(\Sigma \) and d be the maximum number of times a letter appears in S. Given the weight function \(w_x(\ell )\) for \(x\in \Sigma \) and \(\ell \le d\), the maximum weight letter-duplicated subsequence (Weighted-LDS) of S can be computed in \(O(n^2)\) time.

We can run a simple example as follows. Let \(S=ababbaca\). Suppose the table \(w_x(\ell )\) is given as Table 1.

Table 1 Input table for \(w_x(\ell )\), with \(S=ababbaca\) and \(d=4\)

At the first step, \(w_x'(\ell )\) is the maximum weight of an LD-block made with x and of length at most \(\ell \). The corresponding table \(w_x'(\ell )\) can be computed as Table 2.

Table 2 Table \(w_x'(\ell )\), with \(S=ababbaca\) and \(d=4\)

At the end of the second step, we have Table 3 computed.

Table 3 Part of the table N[ij], with \(S=ababbaca\) and \(d=4\)

From Table 3, the table \(w[-,-]\) can be easily computed and we omit the details. For instance, \(w[1,-]=[0,0,10,16,16,20,0,20]\). With that, the optimal solution value can be computed as \(T(8)=36\), which corresponds to the optimal solution aabbaa.

6 Concluding remarks

Starting with the longest letter-duplicated subsequence problem (LLDS), which is polynomially solvable, we consider the constrained longest letter-duplicated subsequence (LLDS+) and the corresponding feasibility testing (FT) problems in this paper, where all letters in the alphabet must occur in the solutions. We parameterize the problems with d, which is the maximum number of times a letter appears in the input sequence. For convenience, we summarize the results one more time in the following table.

Table 4 Summary of results on LLDS+ and FT

We also consider the weighted version (without the ‘full-appearance’ constraint), for which we give a non-trivial \(O(n^2)\) time dynamic programming solution.

If we stick with the ‘full-appearance’ constraint, one direction is to consider an additional variant of the problem where the solution must be a subsequence of S, in the form of \(x_1^{d_1}x_2^{d_2}\ldots x_k^{d_k}\) with \(x_i\) being a subsequence of S with length at least 2, \(x_j\ne x_{j+1}\) and \(d_i\ge 2\) for all i in [k] and j in \([k-1]\). This was formally studied by Lafond et al. [10] recently. Intuitively, for many cases these variants could better capture the duplicated patterns in S. At this point, the NP-completeness results (similar to Theorem 2 and Corollary 1) would still hold with quite some modifications to the proofs in this paper. However, the running times of the dynamic programming algorithms there are much higher (in the range of \(O(n^4)\) to \(O(n^6)\)), which could be a burden for real applications. Note that, without the ‘full-appearance’ constraint, when \(x_i\) is a subsequence of S, the problem is a generalization of Kosowski’s longest square subsequence problem [8] and it is not surprising that it can be solved in polynomial time, even though the running time is much higher.