Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub

doi:10.1007/s00224-018-9881-2

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Open access
Published: 10 August 2018

Volume 63, pages 506–542, (2019)
Cite this article

Download PDF

You have full access to this open access article

Theory of Computing Systems Aims and scope Submit manuscript

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Download PDF

2075 Accesses
9 Citations
Explore all metrics

Abstract

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterised by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem. Therefore, we make an effort to keep the lower order terms of the complexities of our algorithms as small as possible.

On-Line Pattern Matching on Uncertain Sequences and Applications

Partially Local Multi-way Alignments

Article Open access 19 March 2018

On the Complexity of Constrained Sequences Alignment Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We study two well-known representations of uncertain texts: weighted sequences and profiles. A weighted sequence (also known as position weight matrix, PWM) for every position and every letter of the alphabet specifies the probability of occurrence of this letter at this position; see Table 1 for an example. A weighted sequence represents many different strings, each with the probability of occurrence equal to the product of probabilities of its letters at subsequent positions of the weighted sequence. Usually a threshold $\frac {1}{z}$ is specified, and one considers only strings that match the weighted sequence with probability at least $\frac {1}{z}$. A scoring matrix (or a profile) of length m is a matrix with m columns indexed by positions 1,…,m and σ rows corresponding to the alphabet. The score of a string of length m is the sum of scores in the scoring matrix of the subsequent letters of the string at the respective positions. A string is said to match a scoring matrix if its matching score is above a specified threshold Z.

Table 1 A weighted sequence X of length 4 over the alphabet Σ = {,}

Full size table

1.1 Weighted Pattern Matching and Profile Matching

First of all, we study the standard variants of pattern matching problems on weighted sequences and profiles, in which only the pattern or the text is an uncertain sequence. In the most popular formulation of the Weighted Pattern Matching problem, we are given a weighted sequence of length n, called a text, a solid (standard) string of length m, called a pattern, both over an alphabet of size σ, and a threshold probability$\frac {1}{z}$. We are asked to find all positions in the text where the fragment of length m represents the pattern with probability at least $\frac {1}{z}$. Each such position is called an occurrence of the pattern in the text; we also say that the fragment of the text and the pattern match. The Weighted Pattern Matching problem can be solved in $O(\sigma n \log m)$ time via the Fast Fourier Transform [7]. The average-case complexity of the WPM problem has also been studied and a number of fast algorithms have been presented for certain values of weight ratio$\frac {z}{m}$ [4, 5]. An indexing variant of the problem has also been considered [1, 2, 13, 14, 16, 19]; here, one is to preprocess a weighted text to efficiently answer pattern matching queries. The most efficient index [2] for a constant-sized alphabet uses O(nz) space, takes O(nz) time to construct and answers queries in optimal O(m + occ) time, where occ is the number of occurrences reported. A more general indexing data structure, which assumes z = O(1), was presented in [6]. A streaming variant of the Weighted Pattern Matching problem was considered very recently in [23].

In the classic Profile Matching problem, the pattern is an m × σ profile, the text is a solid string of length n, and our task is to find all positions in the text where the fragment of length m has score at least Z. A naïve approach to the Profile Matching problem works in O(nm + mσ) time. A broad spectrum of heuristics improving this algorithm in practice is known; for a survey, see [22]. However, all these algorithms have the same worst-case running time. One of the principal heuristic techniques, coming in different flavours, is lookahead scoring that consists in checking if a partial match could possibly be completed by the highest scoring letters in the remaining positions of the scoring matrix and, if not, pruning the naïve search. The Profile Matching problem can also be solved in $O(\sigma n \log m)$ time via the Fast Fourier Transform [24].

Our results

As the first result, we show how the lookahead scoring technique combined with a data structure for answering longest common extension (LCE) queries in a string can be applied to obtain simple and efficient algorithms for the standard pattern matching problems on uncertain sequences. For a weighted sequence, by R we denote the size of its list representation. In the case that σ = O(1), which often occurs in molecular biology applications, we have R = O(n). In the Profile Matching problem, we set M as the number of strings that match the scoring matrix with score above Z. In general M ≤ σ^m; however, we may assume that for practical data this number is actually much smaller. We obtain the following results:

Theorem 1.1

Profile Matching can be solved in $O(m\sigma + n \log M)$ time.

Theorem 1.2

Weighted Pattern Matching can be solved in $O(R+n \log z)$ time.

1.2 Profile Consensus and Multichoice Knapsack

Along the way to our most involved contribution, we study Profile Consensus, a consensus problem on uncertain sequences. Specifically, we are to check for the existence of a string that matches two scoring matrices, each above threshold Z. The Profile Consensus problem is essentially equivalent to the well-known Multichoice Knapsack problem (also known as the Multiple Choice Knapsack problem). In this problem, we are given n classes C₁,…,C_n of at most λ items each—N items in total—each item c characterised by a value v(c) and a weight w(c). The goal is to select one item from each class so that the sums of values and of weights of the items are below two specified thresholds, V and W. (In the more intuitive formulation of the problem, we require the sum of values to be above a specified threshold, but here we consider an equivalent variant in which both parameters are symmetric.) This problem generalises the (binary) Knapsack problem, in which we have λ = 2. The Multichoice Knapsack problem is widely used in practice, but most research concerns approximation or heuristic solutions; see [17] and references therein. As far as exact solutions are concerned, the classic meet-in-the middle approach by Horowitz and Sahni [12], originally designed for the (binary) Knapsack problem, immediately generalises to an $O^{*}(\lambda ^{\lceil {\frac {n}{2}\rceil }})$-time^{Footnote 1} solution for Multichoice Knapsack.

Several important problems can be expressed as special cases of the Multichoice Knapsack problem using folklore reductions (see [17]). This includes the Subset Sum problem, which, for a set of n integers, asks whether there is a subset summing up to a given integer Q, and the k-Sum problem which, for k classes of λ integers, asks to choose one element from each class so that the selected integers sum up to zero. These reductions give immediate hardness results for the Multichoice Knapsack problem and thus yield the same consequences for Profile Consensus. For the Subset Sum problem, as shown in [9, 11], the existence of an O^∗(2^εn)-time solution for every ε > 0 would violate the Exponential Time Hypothesis (ETH) [15, 20]. Moreover, the O^∗(2^n/2) running time, achieved in [12], has not been improved yet despite much effort. The 3-Sum conjecture [10] and the more general k-Sum conjecture state that the 3-Sum and k-Sum problems cannot be solved in O(λ^2−ε) time and $O(\lambda ^{\lceil {\frac {k}{2}} \rceil (1-\varepsilon )})$ time, respectively, for any ε > 0.

Our results

In the complexities of our algorithms, the instance size of Multichoice Knapsack is described by the number of classes n, the total number of items N = |C₁| + ⋯ + |C_n|, and the maximum size of a class $\lambda =\max \{|C_{1}|,\ldots ,|C_{n}|\}$. We also introduce additional parameters based on the number of solutions with feasible weight or value:

$$A_{V} = \left|\left\{(c_{1},\ldots,c_{n}):c_{i} \in C_{i}\text{ for all } i = 1,\ldots,n,\sum\limits_{i} v(c_{i}) \le V\right\}\right|, $$

that is, the number of choices of one element from each class that satisfy the value threshold,

$$A_{W} = \left|\left\{(c_{1},\ldots,c_{n}):c_{i} \in C_{i}\text{ for all } i = 1,\ldots,n,\sum\limits_{i} w(c_{i}) \le W\right\}\right|, $$

$A = \max (A_{V},A_{W})$, and $a=\min (A_{V},A_{W})$. We obtain the following result.

Theorem 1.3

Multichoice Knapsack can be solved in $O(N+\sqrt {a\lambda }\log A)$ time.

Note that a ≤ A ≤ λⁿ and thus the running time of our algorithm for Multichoice Knapsack is bounded by $O(N+n\lambda ^{(n + 1)/2}\log \lambda )$. Up to lower order terms (i.e., the factor $n\log \lambda =(\lambda ^{(n + 1)/2})^{o(1)}$), this matches the time complexities of the fastest known solutions for both Subset Sum (also binary Knapsack) and 3-Sum. Our parameters identify a new measure of difficulty for the Multichoice Knapsack problem. The main novel part of our algorithm for Multichoice Knapsack is an appropriate (yet intuitive) notion of ranks of partial solutions.

1.3 Weighted Consensus and General Weighted Pattern Matching

Analogously to the Profile Consensus problem, we define the Weighted Consensus problem. In the Weighted Consensus problem, given two weighted sequences of the same length, we are to check if there is a string that matches each of them with probability at least $\frac {1}{z}$. A routine to compare user-entered weighted sequences with existing weighted sequences in the database is used, e.g., in JASPAR,^{Footnote 2} a well-known database of PWMs. Finally, we study a general variant of pattern matching on weighted sequences. In the General Weighted Pattern Matching (GWPM) problem, both the pattern and the text are weighted. In the most common definition of the problem (see [3, 13]), we are to find all fragments of the text that give a positive answer to the Weighted Consensus problem with the pattern. The authors of [3] proposed an algorithm for the GWPM problem based on the weighted prefix table that works in $O(n z^{2} \log z + n\sigma )$ time. Solutions to these problems can be applied in transcriptional regulation: motif and regulatory module finding; and annotation of regulatory genomic regions.

Our results

For a weighted sequence, by λ let us denote the maximal number of letters with score at least $\frac {1}{z}$ at a single position (thus $\lambda \le \min (\sigma ,z)$). Our algorithm for the Multichoice Knapsack problem (covered in Section 1.2) yields time complexities $O(R+\sqrt {z\lambda }\log z)$ and $O(n\sqrt {z\lambda }\log z)$ for Weighted Consensus and GWPM, respectively. Using a tailor-made solution based on the same scheme, we obtain faster procedures as specified below.

Theorem 1.4

The General Weighted Pattern Matching problem can be solved in $O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))$ time, and the Weighted Consensus problem can be solved in $O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))$ time.

In particular, we obtain the following result for the practical case of σ = O(1).

Corollary 1.5

General Weighted Pattern Matching over a constant-sized alphabet can be solved in $O(n \sqrt {z} \log \log z)$ time.

We also provide a simple reduction from Multichoice Knapsack to Weighted Consensus, which lets us transfer the negative results to the GWPM problem.

Theorem 1.6

Weighted Consensus is NP-hard and cannot be solved in:

1.
O^∗(z^ε) timefor everyε > 0,unless the exponential time hypothesis (ETH) fails;
2.
O^∗(z^0.5−ε) timefor someε > 0,unless there is anO^∗(2^(0.5−ε)n)-timealgorithm for theSubset Sumproblem;
3.
$\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })$timefor someε > 0 andforn = O(1),unless the 3-Sumconjecture fails.

For the higher-order terms, our complexities match the conditional lower bounds; therefore, in the proofs of Theorems 1.3 and 1.4 we put significant effort to keep the lower order terms of the complexities as small as possible.

Finally, we analyse the complexity of the Multichoice Knapsack and General Weighted Pattern Matching problems in case of a large λ. This is a theoretical study that shows a possibility of improvement of the complexity for instances that do not originate from the Subset Sum and k-Sum problems.

Theorem 1.7

For every positive integerk = O(1),theMultichoice Knapsackproblem can be solvedin$O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})$time.

Theorem 1.8

Ifλ^2k− 1 ≤ z ≤ λ^2k+ 1forsome positive integerk = O(1),then theWeighted Consensusproblem can be solvedin$O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )$time,and theGWPMproblem can be solvedin$O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )$time.

A preliminary version of this research appeared as [18].

1.4 Structure of the Paper

We start with Preliminaries, where we recall basic notions on classic strings and formalise the model of computation. The following four sections describe our algorithms: in Section 3 for Profile Matching; in Section 4 for Weighted Pattern Matching; in Section 5 for Profile Consensus; and in Section 6 for Weighted Consensus and GWPM. In Section 7 we present conditional lower bounds for the GWPM problem based on the special cases of Multichoice Knapsack. Finally, in Section 8 we perform a multivariate analysis of Profile Consensus and GWPM and present improved solutions in the case that $\frac {\log a}{\log \lambda }$ is a constant other than an odd integer.

2 Preliminaries

Let Σ = {1,…,σ} be an alphabet. A stringS over Σ is a finite sequence of letters from Σ. By Σ^m we denote the set of strings of length m over Σ. We denote the length of S by |S| and, for 1 ≤ i ≤|S|, the i-th letter of S by S[i]. By S[i..j] we denote the string S[i]⋯S[j] called a factor of S (if i > j, then the factor is an empty string). A factor is called a prefix if i = 1 and a suffix if j = |S|. For two strings S and T, we denote their concatenation by S ⋅ T (ST in short).

For a string S of length n, by LCE(i,j) = lcp(S[i..n],S[j..n]) we denote the length of the longest common prefix of suffixes S[i..n] and S[j..n]. This value lets us easily determine the longest common prefix $\mathit {lcp}(S[i\mathinner {..} i^{\prime }],S[j\mathinner {..} j^{\prime }])$ of any two factors starting at positions i and j, respectively. The following fact specifies a well-known efficient data structure answering LCE queries; see [8] for details.

Fact 2.1

Let S be a string of length n over an integer alphabet of size σ = n^O(1). After O(n)-time preprocessing, in O(1) time one can compute LCE(i,j) for any indices i,j.

The Hamming distance between two strings X and Y of the same length, denoted by d_H(X,Y ), is the number of positions where the strings differ.

2.1 Model of Computations

For problems on weighted sequences, we assume the word-RAM model with word size $w = {\Omega }(\log n + \log z)$ and integer alphabet of size σ = n^O(1). We consider the log-probability model of representations of weighted sequences, that is, we assume that probabilities in the weighted sequences and the threshold probability $\frac {1}{z}$ are all of the form $c^{\frac {p}{2^{dw}}}$, where c and d are constants and p is an integer that fits in a constant number of machine words. Additionally, the probability 0 has a special representation. The only operations on probabilities in our algorithms are multiplications and divisions, which can be performed exactly in O(1) time in this model. Our solutions to the Multichoice Knapsack problem only assume the word-RAM model with word size $w={\Omega }(\log S+\log a)$, where S is the sum of integers in the input instance; this does not affect the O^∗ running time.

3 Profile Matching

In the Profile Matching problem, we consider a scoring matrix (a profile) P of size m × σ. For i ∈{1,…,m} and j ∈{1,…,σ}, we denote the integer score of the letter j at the position i by P[i,j]. The matching score of a string S of length m with the matrix P is

$$\text{Score}(S,P) = {\sum}_{i = 1}^{m} P[i,S[i]]. $$

If Score(S,P) ≥ Z for an integer thresholdZ, then we say that the string Smatches the matrixPabove thresholdZ. We denote the family of strings S that match P above threshold Z by M_Z(P).

For a string T and a scoring matrix P, we say that Poccurs inTat positioniwith thresholdZ if T[i..i + m − 1] matches P above threshold Z. Then Occ_Z(P,T) is the set of all positions where P occurs in T. These notions let us define Profile Matching:

3.1 Solution to Profile Matching

For a scoring matrix P, the heavy string of P, denoted H(P), is constructed by choosing at each position the heaviest letter, that is, the letter with the maximum score (breaking ties arbitrarily). In other words, H(P) is a string that matches P with the maximum score.

Observation 3.1

If Score(S,P) ≥ Z for a string S of length m and an m × σ scoring matrix P, then $d_H(\textbf {H}(P),S) \le \left \lfloor {\log |\textbf {M}_{Z}(P)|} \right \rfloor $.

Proof

Let d = d_H(H(P),S). We can construct 2^d strings of length |S| that match P with a score above Z by taking either of the letters S[j] or H(P)[j] at each position j such that S[j]≠H(P)[j]. Hence, $2^{d} \le |\textbf {M}_{Z}(P)|$, which concludes the proof. □

Our solution for the Profile Matching problem works as follows. We first construct $P^{\prime } = \textbf {H}(P)$ and the data structure for lcp-queries in $P^{\prime }T$. Let the variable s store the matching score of $P^{\prime }$. In the p-th step, we calculate the matching score of T[p..p + m − 1] by iterating through subsequent mismatches between $P^{\prime }$ and T[p..p + m − 1] and making adequate updates in the matching score s. The mismatches are found using lcp-queries: If $P^{\prime }[i]$ is aligned against T[j], we compute ${\Delta } = \mathit {lcp}(P^{\prime }[i\mathinner {..} m],T[j\mathinner {..} n])$. Then, $P^{\prime }[i\mathinner {..} i+{\Delta }-1]=T[j\mathinner {..} j+{\Delta }-1]$, but $P^{\prime }[i+{\Delta }]\ne T[j+{\Delta }]$ yields a mismatch (assuming i + Δ ≤ m and j + Δ ≤ n). To locate the next mismatch, we need to repeat the procedure above with i and j increased by Δ + 1. This process terminates when the score drops below Z or when all the mismatches have been found. In the end, we include p in Occ_Z(P,T) if the final matching score is above Z. A pseudocode is given in the ProfileMatching(P, T, Z) procedure.

We obtain the following result.

Theorem 1.1

Profile Matching can be solved in $O(m\sigma + n \log M)$ time.

Proof

Let us bound the time complexity of the presented algorithm. The heavy string $P^{\prime }$ can be computed in O(mσ) time. The data structure for lcp-queries in $P^{\prime }T$ can be constructed in O(n + m) time by Fact 2.1. Finally, for each position p in the text T we will consider at most $\left \lfloor {\log M} \right \rfloor + 1$ mismatches between $P^{\prime }$ and T, as afterwards the score $s^{\prime }$ drops below Z due to Observation 3.1. □

4 Weighted Pattern Matching

A weighted sequenceX = X[1]⋯X[n] of length |X| = n over alphabet Σ is a sequence of sets of pairs of the form $X[i] = \{(j,\ \pi ^{(X)}_{i}(j))\ :\ j \in {\Sigma }\}$. Here, $\pi _{i}^{(X)}(j)$ is the occurrence probability of the letter j at the position i ∈{1,…,n}. These values are non-negative and sum up to 1 for a given i. For all our algorithms, it is sufficient that the probabilities sum up to at most 1 for each position. Also, the algorithms sometimes produce auxiliary weighted sequences with sum of probabilities being smaller than 1 on some positions.

We denote the maximum number of letters occurring at a single position of the weighted sequence (with non-zero probability) by λ and the total size of the representation of a weighted sequence by R. The standard representation consists of n lists with up to λ elements each, so R = O(nλ). However, the lists can be shorter in general. Also, if the threshold probability $\frac {1}{z}$ is specified, at each position of a weighted sequence it suffices to store letters with probability at least $\frac {1}{z}$, and clearly there are at most z such letters for each position. This reduction can be performed in linear time, so we shall always assume that λ ≤ z. Moreover, the assumption that Σ is an integer alphabet of size σ = n^O(1) lets us assume without loss of generality that the entries $(j,\pi ^{(X)}_{i}(j))$ in the lists representing X[i] are ordered by increasing j: if this is not the case, we can simultaneously sort these lists in linear time.

The probability of matching of a string S with a weighted sequence X, |S| = |X| = n, is

$$\P(S,X) = {\prod}_{i = 1}^{n} \pi^{(X)}_{i}(S[i]).$$

We say that a string S matches a weighted sequence X with probability at least $\frac {1}{z}$, denoted by $S \approx _{\frac {1}{z}} X$, if $\P (S,X) \!\ge \! \frac {1}{z}$. We also denote $\textbf {M}_{z}(X)=\{S\in {\Sigma }^{n} : \P (S,X)\ge \frac {1}{z}\}$.

Given a weighted sequence T, by T[i..j] we denote a weighted sequence, called a factor of T, equal to T[i]⋯T[j] (if i > j, then the factor is empty). We say that a string P of length moccurs in T at position i if P matches the factor T[i..i + m − 1]. The set of positions where P occurs in T is denoted by $\mathit {Occ}_{\frac {1}{z}}(P,T)$.

4.1 Weighted Sequences versus Profiles

As shown below, profiles and weighted sequences are essentially equivalent objects.

Fact 4.1

1.
Given a weighted sequence X of length n over an alphabet of size σ and a probability $\frac {1}{z}$, one can construct in O(nσ) time an n × σ profile P and a threshold Z such that M_Z(P) = M_z(X).
2.
Given an m × σ profile P and a threshold Z, one can construct in O(mσ) time a weighted sequence X and a probability $\frac {1}{z}$ such that M_z(X) = M_Z(P).

Proof

Given a weighted sequence X, one can construct an equivalent profile P setting $P[i,s]=-\log \pi ^{(X)}_{i}(s)$ for each position i and character s. If $\pi ^{(X)}_{i}(s)= 0$, we set $P[i,s]=\infty $ (which can be replaced by a sufficiently large finite value after we fix the threshold Z). The profile P satisfies M_Z(P) = M_z(X) for $Z = \log z$.

To construct an inverse mapping, we need to normalise the scores first. For this, we construct a normalised profile $P^{\prime }$ setting $P^{\prime }[i,s] := P[i,s]+\log ({\sum }_{s \in {\Sigma }} 2^{-P[i,s]})$. As a result, we have $\textbf {M}_{Z}(P)=\textbf {M}_{Z^{\prime }}(P^{\prime })$ for $Z^{\prime } = Z + {\sum }_{i = 1}^{m} \log ({\sum }_{s \in {\Sigma }} 2^{-P[i,s]})$. Now, we can build an equivalent weighted sequence X by setting $\pi ^{(X)}_{i}(s)= 2^{-P^{\prime }[i,s]}$. Note that

$$\sum\limits_{s\in {\Sigma}}\pi^{(X)}_{i}(s) = \sum\limits_{s\in {\Sigma}} 2^{-P[i,s]-\log({\sum}_{s^{\prime}\in {\Sigma}} 2^{-P[i,s^{\prime}]})}=\left( \sum\limits_{s\in {\Sigma}} 2^{-P[i,s]}\right)\cdot 2^{-\log({\sum}_{s\in {\Sigma}} 2^{-P[i,s]})}= 1$$

holds as required. Moreover, $\textbf {M}_{z}(X)=\textbf {M}_{Z^{\prime }}(P^{\prime })=\textbf {M}_{Z}(P)$ for $z = 2^{Z^{\prime }}$. □

In the light of Fact 4.1, it may seem that the results for profiles and weighted sequences should coincide. However, we use different parameters to study the complexity of the algorithmic problems in these models: for profiles this is the number |M_Z(P)| of matching strings, while for weighted sequence this is the inverse z of the threshold probability $\frac {1}{z}$. These parameters are related by the following observation:

Observation 4.2

A weighted sequence X satisfies |M_z(X)|≤ z for every threshold.

However, the bound |M_z(X)|≤ z is not tight in general, which gives more power to algorithms parameterised by z. Moreover, z is a part of the input (as opposed to |M_Z(P)| for profiles). Furthermore, it is natural to consider a common threshold probability $\frac {1}{z}$ for multiple weighted sequences, e.g., factors of a weighted text T as in Weighted Pattern Matching.

A more technical difference lies in the representation of profiles and weighted sequences, which we have chosen consistently with the literature. A profile is stored as a dense m × σ matrix, while in a weighted sequence of the same length we do not explicitly keep entries with $\pi ^{(X)}_{i}(s)= 0$, so the input size R can be smaller than m ⋅ σ. This allows for faster algorithms—because reading the input takes less time—but at the same time poses some challenges—because $\pi ^{(X)}_{i}(s)$ cannot be accessed in constant time, unless σ = O(1) or we allow randomisation. This is illustrated below in case of the Weighted Pattern Matching problem and also in Section 6.

4.2 Solution to Weighted Pattern Matching

The approach from our solution to Profile Matching can be used for Weighted Pattern Matching. In a natural way, we extend the notion of a heavy string to weighted sequences. This lets us restate Observation 3.1 in the language of probabilities instead of scores.

Observation 4.3

If a string P matches a weighted sequence X of the same length with probability at least $\frac {1}{z}$, then $d_H(\textbf {H}(X),P) \le \left \lfloor {\log z} \right \rfloor $.

Comparing to the solution to Profile Matching, we compute the heavy string of the text instead of the pattern. An auxiliary variable α stores the matching probability between a factor of H(T) and the corresponding factor of T; it is updated when we move to the next position of the text. The rest of the algorithm is basically the same as previously; see the pseudocode of WeightedPatternMatching(P, T, $\frac {1}{z}$).

Implementation for large alphabets

The algorithm above takes $O(n\log z)$ time for σ = O(1). In the general case, we need to efficiently implement the following operations on the weighted sequence:

finding the letter with the maximum probability at a given position,
computing the probability of a given letter at a given position.

For a weighted sequence in the standard list representation, we can compute the maximum-probability letter at each position in O(R) time which lets us perform the former operation in O(1) time. We also explicitly store the probabilities of the heaviest letters so that $\pi ^{(T)}_{j}(T^{\prime }[j])$ can be retrieved in constant time for any index j.

To implement the latter operation for an arbitrary character, we store each T[j] in a weight-balanced binary tree [21], with the weight of $(s,\pi ^{(T)}_{j}(s))$ equal to $\pi ^{(T)}_{j}(s)$. As a result, any $\pi ^{(T)}_{j}(s)$ can be retrieved in $O(-\log \pi ^{(T)}_{j}(s))=O(\log z)$ time. During the course of the p-th step of the algorithm, $\alpha ^{\prime }$ is a product of some probabilities including all the retrieved probabilities $\pi ^{(T)}_{j}(s)$ with $s \ne T^{\prime }[j]$. The while loop is executed only when $\alpha ^{\prime }\ge \frac {1}{z}$, so the product of these probabilities (excluding the one retrieved in the final iteration) is at least $\frac {1}{z}$. Consequently, the overall retrieval time in the p-th step is $O(\log z)$.

This way, we can implement the algorithm in $O(R+n \log z)$ time.

Theorem 1.2

Weighted Pattern Matching can be solved in $O(R+n \log z)$ time.

Remark 4.4

In the same complexity one can solve GWPM with a solid text.

5 Profile Consensus and Multichoice Knapsack

Let us start with a precise statement of the Multichoice Knapsack problem.

For a fixed instance of Multichoice Knapsack, we say that S is a partial choice if |S ∩ C_i|≤ 1 for each class. The set D = {i : |S ∩ C_i| = 1} is called its domain. For a partial choice S, we define $v(S) = {\sum }_{c \in S} v(c)$ and $w(S) = {\sum }_{c \in S} w(c)$.

5.1 Profile Consensus versus Multichoice Knapsack

As shown below, Profile Consensus and Multichoice Knapsack are essentially equivalent problems.

Fact 5.1

1.
Consider an instance of Profile Consensus with two m × σ profiles P, Q and a common threshold Z. In O(mσ) time one can construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, A_V = |M_Z(P)|, and A_W = |M_Z(Q)|.
2.
Consider an instance of Multichoice Knapsack with n classes of at most λ items each. In O(nλ) time one can construct an equivalent instance of Profile Consensus with two n × λ profiles P, Q and a common threshold Z such that |M_Z(P)| = A_V and |M_Z(Q)| = A_W.

Proof

Given an instance (P,Q,Z) of the Profile Consensus problem, we construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, denoted c_i,j for 1 ≤ i ≤ m and 1 ≤ j ≤ σ, each with value v(c_i,j) = −P[i,j] and weight w(c_i,j) = −Q[i,j]. We set both thresholds to V = W = −Z. It is straightforward to verify that the constructed instance satisfies the required conditions.

This construction is easily reversible if V = W and the size of each class is λ. In general, we add dummy items (with infinite or very large weight and value), decrease the weight of each item by $\frac {1}{n}(W-V)$, and decrease the weight threshold to V . □

The only technical difference between Multichoice Knapsack and Profile Consensus is that the profiles are stored as dense m × σ matrices while the classes in Multichoice Knapsack can be of different size so the input size N can be smaller than the number of classes n times the bound λ on the class size.

Below, we formulate our results in the more established language of Multichoice Knapsack.

5.2 Overview of the Solution

The classic O(2^n/2)-time solution to the Knapsack problem [12] is based on a meet-in-the-middle approach. The set D = {1,…,n} is partitioned into two domains D₁,D₂ of size roughly n/2, and for each D_i, all partial choices S are generated and ordered by v(S). This reduces the problem to an instance of Multichoice Knapsack with two classes, which is solved using a folklore linear-time solution (described for completeness in Section 5.5).

The meet-in-the-middle approach to Knapsack generalises directly to a solution to Multichoice Knapsack. The partition may be chosen as to balance the number of partial choices in each domain, and so the worst-case time complexity is $O(\sqrt {Q\lambda })$, where $Q={\prod }_{i = 1}^{n} |C_{i}|$ is the number of choices.

Our aim in this section is to replace Q with the parameter a (which never exceeds Q). The overall running time is going to be $O(N+\sqrt {a\lambda }\log A)$.

Again, we will partition the set of classes into two groups, for each group we will generate a subset of all partial choices, and then we will check if two partial choices can be joined into a feasible solution. However, several questions arise with this approach in order to obtain the desired complexity:

(1)
How to partition the set of classes?
(2)
In what order should the partial choices be generated?
(3)
How many partial choices should be generated, given that the value of the parameter a is not known in advance?

As for question (1), we consider all partitions of the form D = {1,…,j}∪{j + 1,…,n} for 1 ≤ j ≤ n. This results in an extra O(n) factor in the time complexity. However, in Section 5.7 we introduce preprocessing which reduces the general case to the case when $n=O\left (\frac {\log A}{\log \lambda }\right )$.

A natural idea to deal with question (2) is to consider only partial choices with small values v(S) or w(S). This is close to our actual solution, which is based on a notion of ranks of partial choices that we introduce in Section 5.3.

Finally, to tackle question (3), we generate the partial choices batch-wise until either a solution is found or we can certify that it does not exist. The idea of this step is presented also in Section 5.3, while the generation procedure is detailed in Section 5.4. While dealing with these issues, a careful implementation is required to avoid several further extra factors in the running time.

In the end, we show that the number of partial choices that need to be generated is indeed $O(\sqrt {a\lambda })$. Our final solution to Multichoice Knapsack is presented in Section 5.6 without the instance size reduction and in Section 5.8 using the reduction.

5.3 Ranks of Partial Choices

For a partial choice S, we define rank_v(S) as the number of partial choices $S^{\prime }$ with the same domain for which $v(S^{\prime })\le v(S)$. We symmetrically define rank_w(S). For simplicity, if c ∈ C_i, we denote rank_v(c) = rank_v({c}) and rank_w(c) = rank_w({c}). Ranks are introduced as an analogue of match probabilities in weighted sequences. Probabilities are multiplicative, while for ranks we have submultiplicativity:

Fact 5.2

If S = S₁ ∪ S₂ is a decomposition of a partial choice S into two disjoint subsets, then rank_v(S₁)rank_v(S₂) ≤ rank_v(S) (and same for rank_w).

Proof

Let D₁ and D₂ be the domains of S₁ and S₂, respectively. For every partial choices $S^{\prime }_{1}$ over D₁ and $S^{\prime }_{2}$ over D₂ such that $v(S^{\prime }_{1}) \le v(S_{1})$ and $v(S^{\prime }_{2}) \le v(S_{2})$, we have $v(S^{\prime }_{1} \cup S^{\prime }_{2})=v(S^{\prime }_{1})+v(S^{\prime }_{2})\le v(S)$. Hence, $S^{\prime }_{1}\cup S^{\prime }_{2}$ must be counted while determining rank_v(S). □

For 0 ≤ j ≤ n, let L_j be the list of partial choices with domain {1,…,j} ordered by value v(S), and for ℓ > 0 let L_j[ℓ] be the ℓ-th element of L_j. Analogously, for 1 ≤ j ≤ n + 1, we define R_j as the list of partial choices over {j,…,n} ordered by v(S), and for r > 0, R_j[r] as the r-th element of R_j. If any of the partial choices L_j[ℓ], R_j[r] does not exist, we assume that its value is $\infty $.

The following two observations yield a decomposition of each choice into a single item and two partial solutions of a small rank. Observe that we do not need to know A_V in order to check if the ranks are sufficiently large.

Lemma 5.3

Letℓand r be positive integers such thatv(L_j[ℓ]) + v(R_j+ 1[r]) > Vforeach 0 ≤ j ≤ n.For every choice S withv(S) ≤ V,there is an indexj ∈{1,…,n} anda decompositionS = L ∪{c}∪ Rsuchthatv(L) < v(L_j− 1[ℓ]),c ∈ C_j,andv(R) < v(R_j+ 1[r]).

Proof

Let S = {c₁,…,c_n} with c_i ∈ C_i and, for 0 ≤ i ≤ n, let S_i = {c₁,…,c_i}. If v(S_n− 1) < v(L_n− 1[ℓ]), we set L = S_n− 1, c = c_n, and R = ∅, satisfying the claimed conditions.

Otherwise, we define j as the smallest index i such that v(S_i) ≥ v(L_i[ℓ]), and we set L = S_j− 1, c = c_j, and R = S ∖ S_j. The definition of j implies v(L) < v(L_j− 1[ℓ]) and v(L ∪{c}) ≥ v(L_j[ℓ]). Moreover, we have v(L ∪{c}) + v(R) = v(S) ≤ V < v(L_j[ℓ]) + v(R_j+ 1[r]), and thus v(R) < v(R_j+ 1[r]). □

Fact 5.4

Let ℓ,r > 0. If v(L_j[ℓ]) + v(R_j+ 1[r]) ≤ V for some j ∈{0,…,n}, then ℓ ⋅ r ≤ A_V.

Proof

Let L and R be the ℓ-th and r-th entry in L_j and R_j+ 1, respectively. Note that v(L ∪ R) ≤ V implies rank_v(L ∪ R) ≤ A_V by definition of A_V. Moreover, rank_v(L) ≥ ℓ and rank_v(R) ≥ r (the equalities may be sharp due to draws). Now, Fact 5.2 yields the claimed bound. □

5.4 Generating Partial Choices of Small Rank

Note that L_j can be obtained by interleaving |C_j| copies of L_j− 1, where each copy corresponds to extending the choices from L_j− 1 with a different item. If we were to construct L_j having access to the whole L_j− 1, we could apply the following standard procedure. For each c ∈ C_j, we maintain an iterator on L_j− 1 pointing to the first element S on L_j− 1 for which S ∪{c} has not yet been added to L_j. The associated value is v(S ∪{c}). All iterators initially point at the first element of L_j− 1. Then the next element to append to L_j is always S ∪{c} corresponding to the iterator with minimum value. Having processed this partial choice, we advance the iterator (or remove it, once it has already scanned the whole L_j− 1). This process can be implemented using a binary heap H_j as a priority queue, so that initialisation requires O(|C_j|) time and outputting a single element takes $O(\log |C_{j}|)$ time. Each partial choice S ∈L_j is stored in O(1) space using a pointer to a partial choice $S^{\prime } \in \textbf {L}_{j-1}$ such that $S=S^{\prime } \cup \{c\}$ for some c ∈ C_j.

For r ≥ 0, let $\textbf {L}^{(i)}_{j}$ be the prefix of L_j of length $\min (i,|\textbf {L}_{j}|)$ and $\textbf {R}^{(i)}_{j}$ be the prefix of R_j of length $\min (i, |\textbf {R}_{j}|)$. A technical transformation of the procedure stated above leads to an online algorithm that constructs the prefixes $\textbf {L}^{(i)}_{j}$ and $\textbf {R}^{(i)}_{j}$, as shown in the following lemma. Along with each reported partial choice S, the algorithm also computes w(S).

Lemma 5.5

AfterO(N)-timeinitialisation, one can computeL₁[i],…,L_n[i] knowing$\textbf {L}^{(i-1)}_{1},\ldots ,\textbf {L}^{(i-1)}_{n}$in$O(n\log \lambda )$time.Symmetrically, one can constructR₁[i],…,R_n[i] from$\textbf {R}^{(i-1)}_{1},\ldots ,\textbf {R}^{(i-1)}_{n}$inthe same time complexity.

Proof

Our online algorithm is going to use the same approach as the offline computation of lists $\textbf {L}_{j}^{(i)}$. The order of computations will be different, though.

At each step, for j = 1 to n we shall extend lists $\textbf {L}^{(i-1)}_{j}$ with a single element (unless the whole L_j has already been generated) from the top of the heap H_j. We keep an invariant that each iterator in H_j always points to an element that is already in $\textbf {L}^{(i-1)}_{j-1}$ or to L_j− 1[i]: the first element that has not been yet added to L_j− 1, which is represented by the top of the heap H_j− 1.

We initialise the heaps as follows: we introduce H₀ which represents the empty choice ∅ with v(∅) = 0. Next, for j = 1,…,n we build the heap H_j representing |C_j| iterators initially pointing to the top of H_j− 1. The initialisation takes O(N) time in total since a binary heap can be constructed in time linear in its size.

At each step, the lists $\textbf {L}^{(i-1)}_{j}$ are extended for consecutive values j from 1 to n. Since $\textbf {L}^{(i-1)}_{j-1}$ is extended before $\textbf {L}^{(i-1)}_{j}$, by the invariant, all iterators in H_j point to the elements of $\textbf {L}^{(i)}_{j-1}$ while we compute L_j[i]. We take the top of H_j and move it to $\textbf {L}^{(i)}_{j}$. Next, we advance the corresponding iterator and update its position in the heap H_j. After this operation, the iterator might point to the top of H_j− 1. If H_j− 1 is empty, this means that the whole list L_j− 1 has already been generated and traversed by the iterator. In this case, we remove the iterator.

This way we indeed simulate the previous offline solution. A single phase makes O(1) operations on each heap H_j. The running time is bounded by $O({\sum }_{j} \log |C_{j}|)=O(n\log \lambda )$ at each step of the algorithm. □

5.5 Multichoice Knapsack for n = 2 Classes

Let us recall the final processing of the meet-in-the-middle solution to the Knapsack problem [12]. We formulate it in terms of Multichoice Knapsack with two classes.

An item c ∈ C_j is irrelevant if there is another item $c^{\prime }\in C_{j}$ that dominatesc, i.e., such that $v(c) \ge v(c^{\prime })$ and $w(c) \ge w(c^{\prime })$. Observe that removing an irrelevant item leads to an equivalent instance of the Multichoice Knapsack problem, and it may only decrease the parameters A_V and A_W.

Lemma 5.6

TheMultichoice Knapsackproblem can be solved inO(N) timeifn = 2 andthe elements c ofC₁andC₂aresorted byv(c).

Proof

Since the items of C₁ and C₂ are sorted by v(c), a single scan through these items lets us remove all irrelevant elements. Next, for each c₁ ∈ C₁ we compute c₂ ∈ C₂ such that v(c₂) ≤ V − v(c₁) but otherwise v(c₂) is largest possible. As we have removed irrelevant elements from C₂, this item also minimises w(c₂) among all elements satisfying v(c₂) ≤ V − v(c₁). Hence, if there is a feasible solution containing c₁, then {c₁,c₂} is feasible. If we process elements c₁ by non-decreasing values v(c₁), the values v(c₂) do not increase, and thus the items c₂ can be computed in O(N) time in total. □

5.6 Multichoice Knapsack Parameterised by a

Combining the procedures of Lemmas 5.5 and 5.6 with the combinatorial results of Section 5.3, we obtain the first algorithm for Multichoice Knapsack parameterised by a.

Proposition 5.7

Multichoice Knapsack can be solved in $O(n(\lambda +\sqrt {a\lambda })\log \lambda )$ time.

Proof

Below, we give an algorithm working in $O(n(\lambda +\sqrt {A_{V}\lambda })\log \lambda )$ time. The final solution runs it in parallel on the original instance and on the instance with v and V swapped with w and W , waiting until at least one of them terminates.

We increment an integer r starting from 1, maintaining $\ell =\left \lceil {\frac {r}{\lambda }} \right \rceil $ and the lists $\textbf {L}_{j}^{(\ell )}$ and $\textbf {R}_{j + 1}^{(r)}$ for 0 ≤ j ≤ n, as long as v(L_j[ℓ]) + v(R_j+ 1[r]) ≤ V for some j (or until all the lists have been completely generated). By Fact 5.4, we stop at $r=O(\sqrt {A_{V} \lambda })$ and due to Lemma 5.5, the process takes $O(n\sqrt {A_{V} \lambda }\log \lambda )$ time.

According to Lemma 5.3, every feasible solution S admits a decomposition S = L ∪{c}∪ R with $L\in \textbf {L}_{j-1}^{(\ell )}$, c ∈ C_j, and $R\in \textbf {R}_{j + 1}^{(r)}$ for some index j. We consider all possibilities for j. For each of them we will reduce searching for S to an instance of the Multichoice Knapsack problem with 2 classes of $O(\sqrt {A_{V}\lambda })$ items. By Lemma 5.6, these instances can be solved in $O(n\sqrt {A_{V}\lambda })$ time in total.

The items of the j-th instance are going to belong to classes $\textbf {L}_{j-1}^{(\ell )}\odot C_{j}$ and $\textbf {R}_{j + 1}^{(r)}$, where $\textbf {L}_{j-1}^{(\ell )}\odot C_{j} = \{L\cup \{c\} : L\in \textbf {L}_{j-1}^{(\ell )} , c\in C_{j}\}$. The set $\textbf {L}_{j-1}^{(\ell )}\odot C_{j}$ is constructed by merging |C_j|≤ λ sorted lists, each of size $\ell =O(1+\sqrt {A_{V}/\lambda })$. This takes $O((\lambda +\sqrt {A_{V}\lambda })\log \lambda )$ time, which results in $O(n(\lambda +\sqrt {A_{V} \lambda })\log \lambda )$ time over all indices j.

Clearly, each feasible solution of the constructed instances represents a feasible solution of the initial instance, and by Lemma 5.3, every feasible solution of the initial instance has its counterpart in one of the constructed instances. □

5.7 Preprocessing to Reduce Instance Size

In order to improve the running time for Multichoice Knapsack, we develop two reductions and run them as preprocessing to the procedure of Proposition 5.7. First, we observe that items c with rank_v(c) > A_V or rank_w(c) > A_W cannot belong to any feasible solution. Moreover their removal results in λ ≤ a, which lets us hide the $O(n\lambda \log \lambda )$ term in the running time. Our second reduction decreases the number of classes n to $O\left (\frac {\log A}{\log \lambda }\right )$. For this, we repeatedly remove irrelevant items (as defined in Section 5.5) and merge small classes into their Cartesian product (so that the class sizes are more balanced).

For each class C_i, let $v_{\min }(i) = \min \{v(c) : c\in C_{i}\}$. Also, let $V_{\min } = {\sum }_{i = 1}^{n} v_{\min }(i)$; note that $V_{\min }$ is the smallest possible value v(S) of a choice S. We symmetrically define $w_{\min }(i)$ and $W_{\min }$.

Lemma 5.8

Given an instance I of theMultichoice Knapsackproblem, one can compute inO(N) timean equivalent instance$I^{\prime }$with$N^{\prime } \le N$,$n^{\prime }=n$,$A_{V}^{\prime }= A_{V}$,$A_{W}^{\prime }= A_{W}$,and$\lambda ^{\prime }\le \min (\lambda ,a)$.

Proof

From each class C_i we remove all items c such that $V_{\min }+v(c)-v_{\min }(i)>V$ or $W_{\min }+w(c)-w_{\min }(i)>W$. Afterwards, for each item c ∈ C_i one can obtain a choice S such that c ∈ S and v(S) ≤ V (or w(S) ≤ W) by choosing the elements with the minimal value (minimal weight, respectively) in all the remaining classes. □

Our second preprocessing consists of several steps. First, we quickly reduce the number of classes to $n=O(\log A)$.

Lemma 5.9

Given an instance I of theMultichoice Knapsackproblem, one can compute in linear time an equivalentinstance$I^{\prime }$with$N^{\prime }\le N$,$A_{V}^{\prime }\le A_{V}$,$A_{W}^{\prime }\le A_{W}$,$\lambda ^{\prime }\le \lambda $,and$n^{\prime } \le 2\log A$.

Proof

Observe that if a class C_i contains an item c for which both $v(c)=v_{\min }(i)$ and $w(c)=w_{\min }(i)$, then we can greedily include it in the solution S. Hence, we can remove such a class, setting $V := V- v_{\min }(i)$ and $W := W- w_{\min }(i)$. We execute this reduction rule exhaustively, which clearly takes O(N) time in total and may only decrease the parameters A_V and A_W. After the reduction, the minima $v_{\min }(i)$ and $w_{\min }(i)$ must be attained by distinct items of every class C_i.

We shall prove that now we can either find out that A ≥ 2^n/2 or that we are dealing with a NO-instance. To decide which case holds, let us define Δ_V(i) as the difference between the second smallest value in the multiset {v(c) : c ∈ C_i} and $v_{\min }(i)$. We set ${\Delta }_{V}^{\text {mid}}$ as the sum of the $\left \lceil {\frac {n}{2}} \right \rceil $ smallest values Δ_V(i) for 1 ≤ i ≤ n; we define ${\Delta }_{W}^{\text {mid}}$ analogously.

Claim 1

If $V_{\min } + {\Delta }_{V}^{\text {mid}} \le V$, then A_V ≥ 2^n/2; if $W_{\min } + {\Delta }_{W}^{\text {mid}} \le W$, then A_W ≥ 2^n/2; otherwise, we are dealing with a NO-instance.

Proof

First, assume that $V_{\min } + {\Delta }_{V}^{\text {mid}} \le V$. This means that there is a choice S with v(S) ≤ V containing at least $\frac {n}{2}$ items c such that rank_v(c) ≥ 2. Hence, Fact 5.2 yields $\text{rank} _{v}(S)\ge 2^{\left \lceil {n/2} \right \rceil }$ and consequently A_V ≥ 2^n/2, as claimed. Symmetrically, if $W_{\min } + {\Delta }_{W}^{\text {mid}} \le W$, then A_W ≥ 2^n/2.

Now, suppose that there is a feasible solution S. As no class contains a single item minimising both v(c) and w(c), there are at least $\left \lceil {\frac {n}{2}} \right \rceil $ classes for which S contains an item not minimising v(c), or at least $\left \lceil {\frac {n}{2}} \right \rceil $ classes for which S contains an item not minimising w(c). Without loss of generality, we assume that the former holds. Let D be the set of at least $\left \lceil {\frac {n}{2}} \right \rceil $ classes i satisfying the condition. If c ∈ C_i does not minimise v(c), then $v(c)\ge v_{\min }(i)+{\Delta }_{V}(i)$. Consequently, $V\ge v(S) = V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i)$. However, observe that $ {\sum }_{i\in D} {\Delta }_{V}(i) \ge {\Delta }_{V}^{\text {mid}}$, so $V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}$, as claimed. □

The conditions from the claim can be verified in O(N) time using a linear-time selection algorithm to compute ${\Delta }_{V}^{\text {mid}}$ and ${\Delta }_{W}^{\text {mid}}$. If any of the first two conditions holds, we return the instance obtained using our reduction. Otherwise, we output a dummy NO-instance. □

In the improved reduction we use two basic steps. The first one is expressed in the following lemma.

Lemma 5.10

Consider a class of items in an instance of theMultichoice Knapsackproblem. In linear time, we can remove someirrelevant items from the class so that the resulting class C satisfies$\max (\text{rank} _{v}(c),\text{rank} _{w}(c)) > \frac {1}{3} |C|$foreach itemc ∈ C.

Proof

First, note that using a linear-time selection algorithm, we can determine for each item c whether $\text{rank} _{v}(c)\le \frac {1}{3}|C|$ and whether $\text{rank} _{w}(c)\le \frac {1}{3}|C|$. If there is no item satisfying both conditions, we keep C unaltered. Otherwise, we have an item which dominates at least $|C|-\text{rank} _{v}(c)-\text{rank} _{w}(c) \ge \frac {1}{3}|C|$ other items. We scan through all items in C and remove those dominated by c. Next, we repeat the algorithm. The running time of a single phase is clearly linear, and since |C| decreases geometrically, the total running time is also linear. □

The second reduction step decreases the number of classes by replacing two distinct classes C_i, C_j with their Cartesian product C_i × C_j, assuming that the value (weight) of a pair (c_i,c_j) is the sum of values (weights) of c_i and c_j. This clearly leads to an equivalent instance of the Multichoice Knapsack problem, does not alter the parameters A_V, A_W, and decreases n. On the other hand, N and λ may increase; the latter happens only if |C_i|⋅|C_j| > λ.

These two reduction rules let us implement our preprocessing procedure.

Lemma 5.11

Given an instance I of theMultichoice Knapsackproblem, one cancompute in$O(N+\lambda \log A)$timean equivalent instance$I^{\prime }$with$A_{V}^{\prime }\le A_{V}$,$A_{W}^{\prime }\le A_{W}$,$\lambda ^{\prime }\le \lambda $,and$n^{\prime }=O\left (\frac {\log A}{\log \lambda }\right )$.

Proof

First, we apply Lemma 5.9 to make sure that $n\le 2\log A$ and $N = O(\lambda \log A)$. We may now assume that λ ≥ 3⁶, as otherwise we already have $n = O\left (\frac {\log A}{\log \lambda }\right )$.

Throughout the algorithm, whenever there are two distinct classes of size at most $\sqrt {\lambda }$, we shall replace them with their Cartesian product. This may happen only n − 1 times, and a single execution takes O(λ) time, so the total running time needed for this part is $O(\lambda \log A)$.

Furthermore, for every class that we get in the input instance or obtain as a Cartesian product, we apply Lemma 5.10. The total running time spent on this is also $O(\lambda \log A)$.

Having exhaustively applied these reduction rules, we are guaranteed that we have $\max (\text{rank} _{v}(c),\text{rank} _{w}(c))>\frac {1}{3}\sqrt {\lambda }\ge \lambda ^{\frac {1}{3}}$ for items c from all but one class. Without loss of generality, we assume that the classes satisfying this condition are C₁,…,C_k.

Recall that $v_{\min }(i)$ and $w_{\min }(i)$ are defined as minimum values and weights of items in class C_i and that $V_{\min }$ and $W_{\min }$ are their sums over all classes. For 1 ≤ i ≤ k, we define Δ_V(i) as the difference between the $\left \lceil {\lambda ^{\frac {1}{3}}}\right \rceil $-th smallest value in the multiset {v(c) : c ∈ C_i} and $v_{\min }(i)$. Next, we define ${\Delta }_{V}^{\text {mid}}$ as the sum of the $\left \lceil {\frac {k}{2}} \right \rceil $ smallest values Δ_V(i). Symmetrically, we define Δ_W(i) and ${\Delta }_{W}^{\text {mid}}$. We shall prove a claim analogous to that in the proof of Lemma 5.9.

Claim 2

If $V_{\min } + {\Delta }_{V}^{\text {mid}}\le V$, then $A_{V} \ge \lambda ^{\frac {1}{6} k}$; if $W_{\min } + {\Delta }_{W}^{\text {mid}}\le W$, then $A_{W} \ge \lambda ^{\frac {1}{6} k}$; otherwise, we are dealing with a NO-instance.

Proof

First, suppose that $V_{\min } + {\Delta }_{V}^{\text {mid}}\le V$. This means that there is a choice S with v(S) ≤ V which contains at least $\frac {k}{2}$ items c with $\text{rank} _{v}(c)\ge \lambda ^{\frac {1}{3}}$. By Fact 5.2, the rank of this choice is at least $\lambda ^{\frac {1}{6} k}$, so $A_{V} \ge \lambda ^{\frac {1}{6} k}$, as claimed. The proof of the second case is analogous.

Now, suppose that there is a feasible solution S = {c₁,…,c_n}. For 1 ≤ i ≤ k, we have $\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}$ or $\text{rank} _{w}(c_{i}) \ge \lambda ^{\frac {1}{3}}$. Consequently, $\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}$ holds for at least $\left \lceil {\frac {k}{2}} \right \rceil $ classes or $\text{rank} _{w}(c_{i})\ge \lambda ^{\frac {1}{3}}$ holds for at least $\left \lceil {\frac {k}{2}} \right \rceil $ classes. Without loss of generality, we assume that the former holds. Let D be the set of (at least $\left \lceil {\frac {k}{2}} \right \rceil $) classes i satisfying the condition. For each i ∈ D, we clearly have $v(c_{i})\ge v_{\min }(i)+{\Delta }_{V}(i)$, while for each i∉D, we have $v(c_{i})\ge v_{\min }(i)$. Consequently, $V\ge v(S) \ge V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i) \ge V_{\min } + {\Delta }_{V}^{\text {mid}}$. Hence, $V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}$, which concludes the proof. □

The condition from the claim can be verified using a linear-time selection algorithm: first, we apply it for each class to compute Δ_V(i) and Δ_W(i), and then, globally, to determine ${\Delta }_{V}^{\text {mid}}$ and ${\Delta }_{W}^{\text {mid}}$. If one of the first two conditions holds, we return the instance obtained through the reduction. It satisfies $A \ge \lambda ^{\frac {1}{6} k}$, i.e., $n \le 1+k \le 1 + 6\frac {\log A}{\log \lambda }$. Otherwise, we construct a dummy NO-instance. □

5.8 Main Result

We apply the preprocessing of the previous section to arrive at our final algorithm.

Theorem 1.3

Multichoice Knapsack can be solved in $O(N+\sqrt {a\lambda }\log A)$ time.

Proof

Before running the algorithm of Proposition 5.7, we apply the reductions of Lemmas 5.8 and 5.11. With this order of reductions, we already have λ ≤ a during the execution of Lemma 5.11, so the $O(\lambda \log A)$ term is dominated by $O(\sqrt {a\lambda }\log A)$. □

6 Weighted Consensus and General Weighted Pattern Matching

The Weighted Consensus problem is formally defined as follows.

Due to Facts 4.1 and 5.1, the Weighted Consensus problem is essentially equivalent to Multichoice Knapsack. The only difference is that we study Multichoice Knapsack with respect to unknown parameters a and A, whereas in Weighted Consensus we know the parameter z. By Observation 4.2, these values for equivalent instances satisfy a ≤ A ≤ z, so Theorem 1.3 immediately yields:

Proposition 6.1

Weighted Consensus can be solved in $O(R+\sqrt {z\lambda }\log z)$ time.

In Sections 6.2 and 6.3 we show that the $O(\log z)$ term can be reduced to $O(\log \lambda + \log \log z)$. Such an improvement is possible because the bound a ≤ A ≤ z is not tight in general.

If two weighted sequences admit a consensus, we write $X \approx _{\frac {1}{z}} Y$ and say that X matches Y with probability at least $\frac {1}{z}$. With this definition of a match, we extend the notion of an occurrence and the notation $\mathit {Occ}_{\frac {1}{z}}(P,T)$ to arbitrary weighted sequences.

In the case of the GWPM problem, it is more useful to provide an oracle that finds witness strings that correspond to the respective occurrences of the pattern. Such an oracle, given $i \in \mathit {Occ}_{\frac {1}{z}}(P,T)$, computes a string that matches both P and T[i..i + m − 1].

6.1 Reduction to Weighted Consensus on Short Sequences

The GWPM problem clearly can be reduced to n + m − 1 instances of Weighted Consensus. This leads to a naïve $O(nR + n\sqrt {z\lambda }\log z)$-time algorithm. In this subsection, we remove the first term in this complexity.

Our solution applies the tools developed in Section 4 for Weighted Pattern Matching and uses an observation that is a consequence of Observation 4.3.

Observation 6.2

If X and Y are weighted sequences that match with threshold $\frac {1}{z}$, then $d_H(\textbf {H}(X),\textbf {H}(Y)) \le 2\left \lfloor {\log z} \right \rfloor $. Moreover, there exists a consensus string S such that S[i] = H(X)[i] = H(Y )[i] unless H(X)[i]≠H(Y )[i].

Proof

The fact that $X \approx _{\frac {1}{z}} Y$ means that there exists a string P such that $P \approx _{\frac {1}{z}} X$ and $P \approx _{\frac {1}{z}} Y$. Let the set A₁ represent the positions of mismatches between H(X) and P and the set A₂ represent the positions of mismatches between H(Y ) and P. By Observation 4.3, $|A_{1}|,|A_{2}| \le \left \lfloor {\log z} \right \rfloor $. Let A be the set of mismatches between H(X) and H(Y ). We have $A \subseteq A_{1} \cup A_{2}$ and thus $|A| \le 2\left \lfloor {\log z} \right \rfloor $. Finally, observe that for each i ∈ A ∖ (A₁ ∪ A₂) we may replace P[i] with H(X)[i] = H(Y )[i] to obtain a string S such that $S \approx _{\frac {1}{z}} X$ and $S \approx _{\frac {1}{z}} Y$ and S[i] = H(X)[i] = H(Y )[i] unless i ∈ A. □

The algorithm starts by computing $P^{\prime }=\textbf {H}(P)$ and $T^{\prime }=\textbf {H}(T)$ and the data structure for lcp-queries in $P^{\prime }T^{\prime }$. We try to match P against every factor T[p..p + m − 1] of the text. Following Observation 6.2, we check if $d_H(T^{\prime }[p\mathinner {..} p+m-1], P^{\prime })$$ \le 2\left \lfloor {\log z} \right \rfloor $. If not, then we know that no match is possible. Otherwise, let D be the set of positions of mismatches between $T^{\prime }[p\mathinner {..} p+m-1]$ and $P^{\prime }$. Assume that we store $\alpha = {\prod }_{j = 1}^{m} \pi ^{(T)}_{p+j-1}(T^{\prime }[p+j-1])$ and $\beta = {\prod }_{j = 1}^{m} \pi ^{(P)}_{j}(P^{\prime }[j])$. Now, we only need to check what happens at the positions in D. If D = ∅, it suffices to check if $\alpha \ge \frac {1}{z}$ and $\beta \ge \frac {1}{z}$.

Otherwise, we construct two weighted sequences X and Y by selecting only the positions from D in T[p..p + m − 1] and in P. In O(|D|) time we can compute $\alpha ^{\prime }={\prod }_{j\notin D} \pi ^{(T)}_{p+j-1}(T^{\prime }[p+j-1])$ and $\beta ^{\prime } = {\prod }_{j \notin D} \pi ^{(P)}_{j}(P^{\prime }[j])$. We multiply the probabilities of all letters at the first position in X by $\alpha ^{\prime }$ and in Y by $\beta ^{\prime }$. It is clear that $X\approx _{\frac {1}{z}} Y$ if and only if $T[p\mathinner {..} p+m-1]\approx _{\frac {1}{z}} P$.

Thus, we reduced the GWPM problem to at most n − m + 1 instances of the problem of Weighted Consensus for sequences of length $O(\log z)$. If we memorise the solutions to all those instances together with the underlying sets of mismatches D, we can also implement the oracle for the GWPM problem with O(m)-time queries. We obtain the following reduction.

Lemma 6.3

TheGWPMproblem and the computation of its oracle can be reducedin$O(R + (n-m + 1)\log z)$timeto at mostn − m + 1 instancesof theWeighted Consensusproblem for weighted sequences of length$O(\log z)$.

By Proposition 6.1, each of the resulting instances of Weighted Consensus can be solved in $O(\lambda \log z + \sqrt {z\lambda }\log z)=O(\sqrt {z\lambda }\log z)$ time (due to z ≥ λ).

Proposition 6.4

GWPMproblem can be solved in$O(n\sqrt {z\lambda }\log z)$time.An oracle for theGWPMproblem using$O(n \log z)$spaceand supporting queries inO(m) timecan be computed within the same time complexity.

In the remainder of this section, we design a tailor-made solution which lets us improve the $O(\log z)$ factors in Propositions 6.1 and 6.4 to $O(\log \log z + \log \lambda )$.

6.2 Reduction to Short Dissimilar Weighted Consensus

Let us notice that in the previous section we actually reduced GWPM to instances of Weighted Consensus that satisfy an additional dissimilarity requirement, as stated in the following problem.

In the SDWC problem, we further require an ordering of letters according to their probabilities. This assumption is trivial if σ = O(1); otherwise, we use the preprocessing of Section 5.7 to expedite sorting. The following result refines Lemma 6.3.

Lemma 6.5

TheGWPMproblem and the computation of its oracle can be reducedin$O(R + (n-m + 1)\lambda \log z)$timeto at mostn − m + 1 instancesofSDWC.

Proof

The reduction of Section 6.1 in $O(R + (n-m + 1)\log z)$ time results in n − m + 1 dissimilar instances of length at most $2\log z$. However, the characters are not ordered by non-increasing probabilities. Before we sort them, we apply Lemma 5.11 in order to reduce the length to $O(\frac {\log z}{\log \lambda })$; this takes $O(\lambda \log z)$ time. Note that both removing irrelevant characters and merging two positions into their Cartesian product preserves the property that the probabilities at each position sum up to at most one, so the resulting instance of Multichoice Knapsack can be interpreted back as an instance of Weighted Consensus. Finally, we sort the probabilities in $O(\lambda \log \lambda )$ time per position, i.e., in $O(\lambda \log z)$ time per instance of SDWC. □

6.3 Solving Short Dissimilar Weighted Consensus

6.3.1 Overview

We follow the same general meet-in-the-middle scheme as the algorithm for Multichoice Knapsack presented in Proposition 5.7. The latter relies on Lemma 5.3, whose analogue in terms of weighted sequences and probabilities is much simpler.

Observation 6.6

Consider weighted sequences X and Y of length n and $z,z_{\ell },z_{r}\in \mathbb {R}_{+}$ such that z_ℓ ⋅ z_r ≥ z. Any S ∈M_z(X) ∩M_z(Y ) admits a decomposition S = L ⋅ c ⋅ R, where:

$\P (L, X[1\mathinner {..} |L|])\ge \frac {1}{z_{\ell }}$,
c is a single letter,
$\P (R, X[n-|R|+ 1\mathinner {..} n])\ge \frac {1}{z_{r}}$.

Motivated by this formulation, we employ a notion of $\frac {1}{z}$-solid prefixes of a weighted sequence X—strings S such that $S \approx _{\frac {1}{z}} X[1\mathinner {..} |S|]$—and a symmetric notion of $\frac {1}{z}$-solid suffixes. By Observation 4.2, the number of $\frac {1}{z}$-solid prefixes of weighted sequence X of length n is at most nz. A direct application of the approach of Proposition 5.7, using solid prefixes and suffixes as partial choices, would result in generating up to nz_ℓ solid prefixes and nz_r solid suffixes of X. Recall that, in case of SDWC, $n = O(\log z)$.

However, $\frac {1}{z}$-solid prefixes have more structure than prefix partial choices of rank at most z. We exploit this structure by introducing a notion of light $\frac {1}{z}$-solid prefixes, that is, $\frac {1}{z}$-solid prefixes that end with a non-heavy letter in X, that are the key ingredient in our solution. We show that the number of light $\frac {1}{z}$-solid prefixes of X is at most z. Our algorithm for SDWC applies this fact to limit the number of generated $\frac {1}{z_{\ell }}$-solid prefixes and $\frac {1}{z_{r}}$-solid suffixes to z_ℓ and z_r, respectively.

The following subsections correspond to subsequent subsections of Section 5:

In Section 6.3.2 (corresponds to Section 5.3) we show the O(z) bound on the number of light $\frac {1}{z}$-solid prefixes (or sufixes) and prove a decomposition property for them that is similar to Observation 6.6 (but more complex).
Section 6.3.3 (corresponds to Section 5.4) contains an algorithm for generating light $\frac {1}{z^{\prime }}$-solid prefixes of X that are simultaneously $\frac {1}{z}$-solid prefixes of Y. Intuitively, light solid prefixes of a given length k ≤ n can be obtained from light solid prefixes of any length smaller than k by extending them with any character. This gives O(nλ) lists of solid prefixes to be merged by probabilities which multiplies the complexity by $O(\log (n\lambda )) = O(\log \log z + \log \lambda )$.
Section 6.3.4 (corresponds to Section 5.5) shows how to compute a solution based on sorted lists of common solid prefixes and suffixes of lengths summing up to n.
Section 6.3.5 (corresponds to Section 5.6) implements the meet-in-the-middle approach. Because of the more complicated decomposition property this part of the algorithm is the most complex. It consists of $O(\log n)=O(\log \log z)$ phases.

6.3.2 Combinatorics of Light Solid Prefixes (Counterpart of Section 5.3)

We define a light $\frac {1}{z}$-solid prefix of a weighted sequence X as a $\frac {1}{z}$-solid prefix S of length k such that k = 0 or S[k]≠H(X)[k].

We say that a string P is a maximal $\frac {1}{z}$-solid prefix of a weighted sequence X if P is a $\frac {1}{z}$-solid prefix of X and no string $P^{\prime } = Ps$, for s ∈Σ, is a $\frac {1}{z}$-solid prefix of X. Maximal solid prefixes have following simple property, originally due to Amir et al. [1].

Fact 6.7 ([1])

A weighted sequence has at most z maximal $\frac {1}{z}$-solid prefixes, that is, $\frac {1}{z}$-solid prefixes which cannot be extended to any longer $\frac {1}{z}$-solid prefix.

Fact 6.7 lets us bound the number of light solid prefixes.

Fact 6.8

A weighted sequence has at most z different light $\frac {1}{z}$-solid prefixes.

Proof

We show a pair of inverse mappings between the set of maximal $\frac {1}{z}$-solid prefixes of a weighted sequence X and the set of light $\frac {1}{z}$-solid prefixes of X. If P is a maximal $\frac {1}{z}$-solid prefix of X, then we obtain a light $\frac {1}{z}$-solid prefix by removing all trailing letters of P that are heavy letters at the corresponding positions in X. For the inverse mapping, we extend each light $\frac {1}{z}$-solid prefix by heavy letters as long as the prefix is $\frac {1}{z}$-solid. □

With this notion and its symmetric counterpart, light$\frac {1}{z}$-solid suffixes, we can state a stronger version of Observation 6.6. Note that this is where the dissimilarity is crucial.

Lemma 6.9

Consider an instance$(X,Y,\frac {1}{z})$of theSDWCproblem, and letz_ℓ,z_r ≥ 1 bereal numbers such thatz_ℓ ⋅ z_r ≥ z.If$X \approx _{\frac {1}{z}} Y$,then every consensus string S can be decomposed intoS = L ⋅ c ⋅ C ⋅ Rsuch that the followingconditions hold for someU,V ∈{X,Y }:

L is a light$\frac {1}{z_{\ell }}$-solidprefix of U,
c is a single letter,
all letters of C are heavy in V,
R is a light$\frac {1}{z_{r}}$-solidsuffix of V.

Proof

We set L as the longest proper prefix of S which is a $\frac {1}{z_{\ell }}$-solid prefix of both X and Y , and we define k := |L|. Note that L is a light $\frac {1}{z_{\ell }}$-solid prefix of X or Y , because H(X) and H(Y ) are dissimilar. If k = n − 1, we conclude the proof setting c = S[n] and C = R to empty strings.

Otherwise, we have $\P (S[1\mathinner {..} k + 1],V[1\mathinner {..} k + 1])<\frac {1}{z_{\ell }}$ for V = X or V = Y. Since $\P (S,V)\ge \frac {1}{z}$ and z_ℓ ⋅ z_r ≥ z, this implies $\P (S[k + 2\mathinner {..} n],V[k + 2\mathinner {..} n])\ge \frac {1}{z_{r}}$, i.e., that S[k + 2..n] is a $\frac {1}{z_{r}}$-solid suffix of V . We set c = S[k + 1], C as the longest prefix of S[k + 2..n] composed of letters heavy in V , and R as the remaining suffix of S[k + 2..n]. Then R is clearly a light $\frac {1}{z_{r}}$-solid suffix of V . □

6.3.3 Generating Solid Prefixes (Counterpart of Section 5.4)

We say that a string P is a common $\frac {1}{z}$-solid prefix (suffix) of weighted sequences X and Y if it is a $\frac {1}{z}$-solid prefix (suffix) of both X and Y. Let $(X,Y,\frac {1}{z})$ be an instance of the SDWC problem. A standard representation of a common $\frac {1}{z}$-solid prefix P of length k of X and Y is a triple (P,p₁,p₂) such that p₁ and p₂ are the probabilities p₁ = P(P,X[1..k]) and p₂ = P(P,Y [1..k]).

If σ is constant, the string P can be directly represented using $O(\log z)$ bits due to $|P|=O(\log z)$. Otherwise, P is written using variable-length encoding so that a letter that occurs at a given position with probability p in X has a representation that consists of $O(\log \frac {1}{p})$ bits. For every position i, the encoding can be constructed by assigning subsequent integer identifiers to letters according non-increasing order of $\pi _{i}^{(X)}(c)$. Note that an instance of SDWC problem provides us with the desired sorted order of letters. This lets us store a $\frac {1}{z}$-solid prefix using $O(\log z)$ bits: we concatenate the variable-length representations of its letters and we store a bit mask of size $O(\log z)$ that stores the delimiters between the representations of single letters.

In either case, our assumptions on the model of computations imply that the standard representation takes constant space. Moreover, constant time is sufficient to extend a common $\frac {1}{z}$-solid prefix by a given letter. An analogous representation can be used also to store common $\frac {1}{z}$-solid suffixes.

The following observation describes longer light solid prefixes in terms of shorter ones.

Observation 6.10

Let P be a non-empty light $\frac {1}{z}$-solid prefix of X. If one removes its last letter and then removes all the trailing letters which are heavy at the respective positions in X, then a shorter light $\frac {1}{z}$-solid prefix of X is obtained.

We build upon Observation 6.10 to derive an efficient algorithm for generating light solid prefixes.

Lemma 6.11

Let$(X,Y,\frac {1}{z})$bean instance of theSDWCproblem andlet$z^{\prime }\le z$.The standard representations of all common$\frac {1}{z}$-solidprefixes of X and Y being light$\frac {1}{z^{\prime }}$-solidprefixes of X, sorted first by their length andthen by the probabilities in X, can be generatedin$O(z^{\prime } (\log \log z+\log \lambda )+\log ^{2} z)$time.

Proof

For k ∈{0,…,n}, let B_k be a list of the requested solid prefixes of length k sorted by their probabilities p₁ in X. Fact 6.8 guarantees that ${\sum }_{k = 0}^{n} |\textbf {B}_{k}| \le z^{\prime }$.

We compute the lists B_k for subsequent lengths k. We start with B₀ containing the empty string with its probabilities p₁ = p₂ = 1. To compute B_k for k > 0, we use Observation 6.10. For a given i ∈{0,…,k − 1}, we iterate over all elements (P,p₁,p₂) of B_i ordered by the non-increasing probabilities p₁ and try to extend each of them by the heavy letters in X at positions i + 1,…,k − 1 and by the letter s at position k. We process the letters s ordered by ${\pi }_{k}^{(X)}(s)$, ignoring the first one (H(X)[k]) and stopping as soon as we do not get a $\frac {1}{z^{\prime }}$-solid prefix of X.

More precisely, with $X^{\prime }=\textbf {H}(X)$, we compute

$$p^{\prime}_{1}:=p_{1} \cdot \overset{k-1}{\underset{j=i + 1}{\prod}} \pi^{(X)}_{j}(X^{\prime}[j]) \cdot \pi^{(X)}_{k}(s)\quad\text{and}\quad p^{\prime}_{2}:=p_{2} \cdot \overset{k-1}{\underset{j=i + 1}{\prod}} \pi^{(Y)}_{j}(X^{\prime}[j]) \cdot \pi^{(Y)}_{k}(s),$$

check if $p^{\prime }_{1} \ge \frac {1}{z^{\prime }}$ and $p^{\prime }_{2} \ge \frac {1}{z}$, and, if so, insert $(P \cdot X^{\prime }[i + 1\mathinner {..} k-1] \cdot s,p^{\prime }_{1},p^{\prime }_{2})$ at the beginning of a new list L_i,s, indexed both by the letter s and by the length i of the shorter light $\frac {1}{z^{\prime }}$-solid prefix. When we encounter an element (P,p₁,p₂) of B_i and a letter s for which $p^{\prime }_{1} < \frac {1}{z^{\prime }}$, we proceed to the next element of B_i. If this happens for the heaviest letter s≠H(X)[k], we stop considering the current list B_i and proceed to B_i− 1. The final step consists in merging all the kλ lists L_i,s in the order of probabilities in X; the result is B_k.

Let us analyse the time complexity of the k-th step of the algorithm. If an element (P,p₁,p₂) and letter s that we consider satisfy $p^{\prime }_{1} \ge \frac {1}{z^{\prime }}$, this accounts for a new light $\frac {1}{z^{\prime }}$-solid prefix of X. Hence, in total (over all steps) we consider $O(z^{\prime })$ such elements. Note that some of these elements may be discarded due to the condition on $p^{\prime }_{2}$.

For each inspected element (P,p₁,p₂), we also consider at most one letter s for which $p^{\prime }_{1}$ is not sufficiently large. If this is not the only letter considered for this element, such a candidate can be charged to the previously considered letter. The opposite situation may happen once for each list B_i, which may give O(k) additional operations in the k-th step, $O(\log ^{2} z)$ in total.

Thanks to the order in which the lists are considered, we can store products of probabilities ${\prod }_{j=i + 1}^{k-1} \pi ^{(X)}_{j}(X^{\prime }[j])$, ${\prod }_{j=i + 1}^{k-1}\pi ^{(Y)}_{j}(X^{\prime }[j])$ and factors $X^{\prime }[i + 1\mathinner {..} k-1]$ so that the representation of each subsequent light $\frac {1}{z^{\prime }}$-solid prefix of length k is computed in O(1) time. Finally, the merging step in the k-th phase takes $O(|\textbf {B}_{k}|\log (k\lambda )) = O(|\textbf {B}_{k}| (\log \log z+\log \lambda ))$ time if a binary heap of O(kλ) elements is used.

The time complexity of the whole algorithm is

$$O\left( \log^{2} z + {\sum}_{k = 1}^{n}|\textbf{B}_{k}| (\log \log z+\log \lambda)\right).$$

By the already mentioned Fact 6.8, this is $O(\log ^{2} z+z^{\prime } (\log \log z+\log \lambda ))$. □

6.3.4 Merging Solid Prefixes with Suffixes (Counterpart of Section 5.5)

Next, we provide an analogue of Lemma 5.6.

Lemma 6.12

Let L and R be lists containing, for somek ∈{0,…,n},standard representations of common$\frac {1}{z}$-solidprefixes of length k and common$\frac {1}{z}$-solidsuffixes of lengthn − kof X and Y, respectively. If the elements of the lists are sorted according tonon-decreasing probabilities in X and Y, respectively, one can check inO(|L| + |R|) timewhether the concatenation of any$\frac {1}{z}$-solidprefix from L and$\frac {1}{z}$-solidsuffix from R yields a consensus string S for X and Y.

Proof

First, we filter out dominated elements of the lists, i.e., elements (P,p₁,p₂) such that there exists another element $(P^{\prime },p_{1}^{\prime },p_{2}^{\prime })$ with $p_{1}^{\prime } \ge p_{1}$ and $p_{2}^{\prime } \ge p_{2}$. This can be done in linear time. After this operation, the list R is ordered according to non-increasing probabilities in X, so we reverse the list so that now both both lists are ordered with respect to the non-decreasing probabilities in X.

For every element (P,p₁,p₂) of L, we compute the leftmost element $(P^{\prime },p^{\prime }_{1},p^{\prime }_{2})$ of R such that $p_{1} p^{\prime }_{1} \ge \frac {1}{z}$. This element maximises $p^{\prime }_{2}$ among all elements satisfying the latter condition. Hence, it suffices to check if $p_{2} p^{\prime }_{2} \ge \frac {1}{z}$, and if so, report the result $S=PP^{\prime }$. As the lists are ordered by p₁ and $p^{\prime }_{1}$, respectively, all such elements can be computed in O(|L| + |R|) total time. □

6.3.5 Merge-in-the-Middle Implementation (Counterpart of Section 5.6)

In this section, we solve the SDWC problem based on Lemma 6.9. We generate all candidates for L ⋅ c and R using Lemma 6.11, and we apply a divide-and-conquer procedure to fill this with C. Our procedure works for fixed U,V ∈{X,Y }; the algorithm repeats it for all four choices.

Let L_i denote a list of all common $\frac {1}{z}$-solid prefixes of X and Y obtained by extending a light $\frac {\sqrt {\lambda }}{\sqrt {z}}$-solid prefix of U of length i − 1 by a single letter s at position i, and let R_i denote a list of all common $\frac {1}{z}$-solid suffixes of X and Y of length n − i + 1 that are light $\frac {1}{\sqrt {z\lambda }}$-solid suffixes of V. We assume that the lists L_i and R_i are sorted according to the probabilities in U and V, respectively. We assume that L_n+ 1 = ∅, whereas R_n+ 1 contains only a representation of an empty string.

The following lemma shows how to compute the lists L_i and R_i and bounds their total size. In case of σ = O(1) it is a direct consequence of Lemma 6.11. Otherwise, one needs to exercise caution when computing the lists L_i.

Lemma 6.13

The total size of listsL_iandR_ifori ∈{1,…,n + 1} is$O(\sqrt {z \lambda })$;they can be computed in$O(\sqrt {z\lambda } (\log \log z+\log \lambda ))$time.

Proof

$O(\sqrt {z\lambda } (\log \log z+\log \lambda ))$-time computation of the lists R_i is directly due to Lemma 6.11. As for the lists L_i, we first compute in $O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}(\log \log z+\log \lambda )\right )$ time the lists of all light $\frac {\sqrt {\lambda }}{\sqrt {z}}$-solid prefixes of U, sorted by the lengths of strings and then by the probabilities in U, again using Lemma 6.11. Then for each length i − 1 and for each letter s at the i-th position, we extend all these prefixes by a single letter. This way we obtain λ lists for a given i − 1 that can be merged according to the probabilities in U to form the list L_i. Generation of the auxiliary lists takes $O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}\cdot \lambda \right )=O(\sqrt {z\lambda })$ time in total, and merging them using a binary heap takes $O(\sqrt {z\lambda }\log \lambda )$ time. This way we obtain an $O(\sqrt {z\lambda } (\log \log z+\log \lambda ))$-time algorithm. □

Let $\textbf {L}^{*}_{a,b}$ be a list of common $\frac {1}{z}$-solid prefixes of X and Y of length b obtained by taking a common $\frac {1}{z}$-solid prefix from L_i for some i ∈{a,…,b} and extending it by b − i letters that are heavy at the respective positions in V. Similarly, $\textbf {R}^{*}_{a,b}$ is a list of common $\frac {1}{z}$-solid suffixes of length n − a + 1 obtained by taking a common $\frac {1}{z}$-solid suffix from R_i for some i ∈{a,…,b} and prepending it by i − a letters that are heavy in V. Again, we assume that each of the lists $\textbf {L}^{*}_{a,b}$ and $\textbf {R}^{*}_{a,b}$ is sorted according to the probabilities in U and V, respectively.

A basic interval is an interval [a,b] represented by its endpoints 1 ≤ a ≤ b ≤ n + 1 such that 2^j divides a − 1 and $b=\min (n + 1,a + 2^{j}-1)$ for some integer j called the layer of the interval. For every $j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil $, there are ${\Theta }\left (\frac {n}{2^{j}}\right )$ basic intervals in the j-th layer and they are pairwise disjoint.

Example 6.14

For n = 7, the basic intervals are [1,1], …, [8,8], [1,2], [3,4], [5,6], [7,8], [1,4], [5,8], [1,8].

Lemma 6.15

The total size of the lists$\textbf {L}^{*}_{a,b}$and$\textbf {R}^{*}_{a,b}$forall basic intervals [a,b] is$O(\sqrt {z\lambda }\log \log z)$andthey can all be constructed in$O(\sqrt {z\lambda }(\log \log z+\log \lambda ))$time.

Proof

We compute all the lists $\textbf {L}^{*}_{a,b}$ and $\textbf {R}^{*}_{a,b}$ for basic intervals [a,b] of subsequent layers $j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil $. For j = 0, we have $\textbf {L}^{*}_{a,a} = \textbf {L}_{a}$ and $\textbf {R}^{*}_{a,a} = \textbf {R}_{a}$. All these lists can be computed in $O(\sqrt {z\lambda }(\log \log z+\log \lambda ))$ time via Lemma 6.13.

Suppose that we wish to compute $\textbf {L}^{*}_{a,b}$ for a < b at layer j (the computation of $\textbf {R}^{*}_{a,b}$ is symmetric). Take c = a + 2^j− 1 − 1. Let us iterate through all the elements (P,p₁,p₂) of the list $\textbf {L}^{*}_{a,c}$, extend each string P by H(V )[c + 1..b], and multiply the probabilities p₁ and p₂ by

$$\overset{b}{\underset{i=c + 1}{\prod}} \pi^{(X)}_{i}(\textbf{H}(V)[i]) \quad\text{and}\quad \overset{b}{\underset{i=c + 1}{\prod}} \pi^{(Y)}_{i}(\textbf{H} (V)[i]),$$

respectively. If a common $\frac {1}{z}$-solid prefix is obtained, it is inserted at the end of an auxiliary list L. The resulting list L is merged with $\textbf {L}^{*}_{c + 1,b}$ according to the probabilities in U; the result is $\textbf {L}^{*}_{a,b}$.

Thus, we can compute $\textbf {L}^{*}_{a,b}$ in time proportional to the sum of lengths of $\textbf {L}^{*}_{a,c}$ and $\textbf {L}^{*}_{c + 1,b}$. (Note that the necessary products of probabilities can be computed in $O(n) = O(\log z)$ total time.) For every $j = 1,\ldots ,\left \lceil {\log n} \right \rceil $, the total length of the lists from the j-th layer does not exceed the total length of the lists from the (j − 1)-th layer. By Lemma 6.13, the lists at the 0-th layer have size $O(\sqrt {z\lambda })$. The conclusion follows from the fact that $\log n = O(\log \log z)$. □

Finally, we are ready to apply a divide-and-conquer approach to solve the SDWC problem:

Lemma 6.16

The SDWC problem can be solved in $O(\sqrt {z\lambda } (\log \log z + \log \lambda ))$ time.

Proof

The algorithm goes along Lemma 6.9, considering all choices of U and V . For each of them, we proceed as follows.

First, we compute the lists L_i, R_i for all i = 1,…,n and $\textbf {L}^{*}_{a,b}$, $\textbf {R}^{*}_{a,b}$ for all basic intervals. By Lemmas 6.13 and 6.15, this takes $O(\sqrt {z\lambda } (\log \log z+\log \lambda ))$ time.

Note that, in order to find out if there is a feasible solution, it suffices to attempt joining a common $\frac {1}{z}$-solid prefix from L_j with a common $\frac {1}{z}$-solid suffix from R_k for some indices 1 ≤ j < k ≤ n + 1 by heavy letters of V at positions j + 1,…,k − 1. We use a recursive routine to find such a pair of indices j,k ∈ [a,b] which has positive length and therefore can be decomposed into two basic subintervals [a,c] and [c + 1,b]. Then either j ≤ c < k, or both indices j, k belong to the same interval [a,c] or [c + 1,b]. To check the first case, we apply the algorithm of Lemma 6.12 to $L = \textbf {L}^{*}_{a,c}$ and $R = \textbf {R}^{*}_{c + 1,b}$. The remaining two cases are solved by recursive calls for the subintervals. The recursive routine is called first for the basic interval [1,n + 1].

The computations performed by the routine for the basic intervals at the j-th level take at most the time proportional to the total size of lists $\textbf {L}^{*}_{a,b}$, $\textbf {R}^{*}_{a,b}$ at the (j − 1)-th level. Lemma 6.15 shows that the total size of the lists at all levels is $O(\sqrt {z\lambda } \log \log z)$. Consequently, the whole recursive procedure works in $O(\sqrt {z\lambda } \log \log z)$ time. Together with the computation of the lists, this gives $O(\sqrt {z\lambda } (\log \log z+\log \lambda ))$ time in total. □

Lemma 6.16 combined with Lemma 6.5 provides an efficient solution for General Weighted Pattern Matching. It also gives a solution to Weighted Consensus (which is a special case of GWPM with n = m). Note that $\lambda \log z = O(\sqrt {z\lambda } \log z)$ due to z ≥ λ.

Theorem 1.4

The General Weighted Pattern Matching problem can be solved in $O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))$ time, and the Weighted Consensus problem can be solved in $O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))$ time.

7 Conditional Hardness of GWPM

The following reduction from Multichoice Knapsack to Weighted Consensus immediately yields that any significant improvement in the dependence on z and λ in the running time of our algorithm would lead to breaking long-standing barriers for special cases of Multichoice Knapsack.

Lemma 7.1

Given an instance I of theMultichoice Knapsackproblem with n classesC₁,…,C_nofmaximum sizeλ,in linear time one can construct an equivalentinstance of theWeighted Consensusproblemwith$z=O({\prod }_{i = 1}^{n}|C_{i}|)$andsequences of lengthO(n) overalphabet of sizeλ.

Proof

We construct a pair of weighted sequences X,Y of length n over alphabet Σ = {1,…,λ}. Let $C_{i} = \{c_{i,1},\ldots ,c_{i,|C_{i}|}\}$. Intuitively, choosing letter j at position i will correspond to taking c_i,j to the solution S.

Without loss of generality, we assume that weights and values are non-negative. Otherwise, we may subtract $v_{\min }(i)$ from v(c_i,j) and $w_{\min }(i)$ from w(c_i,j) for each item c_i,j, as well $V_{\min }$ from V and $W_{\min }$ from W .

We set M to the smallest power of two such that $M\ge \max (n, V, W)$. For j ∈{1,…,|C_i|}, we set:

$$p_{i}^{(X)}(j) = -\frac{\left\lceil {M\log|C_{i}|} \right\rceil + v(c_{i,j})}{M}, \quad p_{i}^{(Y)}(j)=-\frac{\left\lceil {M\log|C_{i}|} \right\rceil +w(c_{i,j})}{M}.$$

We then define $\log \pi _{i}^{(X)}(j) = p_{i}^{(X)}(j)$ and $\log \pi _{i}^{(Y)}(j) = p_{i}^{(Y)}(j)$ for j ∈Σ. Moreover, we set

$$\log z_{X} = \frac{1}{M} \left( V + \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil\right), \quad \log z_{Y} = \frac{1}{M}\left( W + \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil\right).$$

The following claim holds.

Claim 3

${\sum }_{j = 1}^{|C_{i}|} \pi _{i}^{(X)}(j)\le 1$, ${\sum }_{j = 1}^{|C_{i}|} \pi _{i}^{(Y)}(j)\le 1$, and $\max (z_{X},z_{Y}) \le 4{\prod }_{i = 1}^{n}|C_{i}|$.

Proof

As for the first inequality, we have:

$$\sum\limits_{j = 1}^{|C_{i}|} \pi_{i}^{(X)}(j)\ =\ \sum\limits_{j = 1}^{|C_{i}|} 2^{-\left\lceil {M\log|C_{i}|} \right\rceil/M} 2^{-v(c_{i,j})/M}\ \le\ \sum\limits_{j = 1}^{|C_{i}|} 2^{-\log |C_{i}|}\ =\ \sum\limits_{j = 1}^{|C_{i}|} \frac{1}{|C_{i}|} \le 1. $$

The second inequality is analogous. Finally, by the choice of M, we have

$$\max(z_{X},z_{Y})\ \le\ 2^{\frac{1}{M}(\max(V,W)+n)}\overset{n}{\underset{i = 1}{\prod}}|C_{i}|\ \le\ 4\overset{n}{\underset{i = 1}{\prod}}|C_{i}|.$$

□

This way, for a string P of length n, we have

$$\begin{array}{@{}rcl@{}} \log \P(P,X)=-\frac{1}{M}\left( \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil+\sum\limits_{i = 1}^{n} v(c_{i,P[i]})\right) \ge -\log z_{X} \\ \Longleftrightarrow \sum\limits_{i = 1}^{n} v(c_{i,P[i]}) \le V, \end{array} $$

$$\begin{array}{@{}rcl@{}} \log \P(P,Y)=-\frac{1}{M}\left( \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil+\sum\limits_{i = 1}^{n} w(c_{i,P[i]})\right) \ge -\log z_{Y} \\ \Longleftrightarrow \sum\limits_{i = 1}^{n} w(c_{i,P[i]}) \le W. \end{array} $$

Thus, P is a solution to the constructed instance of the Weighted Consensus problem with two threshold probabilities, $\frac {1}{z_{X}}$ and $\frac {1}{z_{Y}}$, if and only if S = {c_i,j : P[i] = j} is a solution to the underlying instance of the Multichoice Knapsack problem. To have a single threshold $z=\max (z_{X},z_{Y})$, we append an additional position n + 1 with symbol 1 only, with $p_{n + 1}^{(X)}(1)= 0$ and $p_{n + 1}^{(Y)}(1)=\log z_{Y} - \log z_{X}$ provided that z_X ≥ z_Y, and symmetrically otherwise.

If one wants to make sure that the probabilities at each position sum up to exactly one, two further letters can be introduced, one of which gathers the remaining probability in X and has probability 0 in Y, and the other gathers the remaining probability in Y, and has probability 0 in X. □

For completeness, let us recall the folklore reductions that show that Subset Sum and 3-Sum are special cases of Multichoice Knapsack. To express an instance of Subset Sum with integers a₁,…,a_n and threshold R as an instance of Multichoice Knapsack, we introduce n classes of two items each, which correspond to taking and omitting the respective elements. The first item has value a_i and weight − a_i, while for the other these are both 0. The thresholds are V = R and W = −R.

Similarly, given an instance of 3-Sum with classes a_1,1,…,a_1,λ, a_2,1,…,a_2,λ, and a_3,1,…,a_3,λ, we can create an instance of Multichoice Knapsack with the same three classes of items with values a_i,j and weights − a_i,j. The thresholds are V = W = 0.

Theorem 1.6

Weighted Consensus is NP-hard and cannot be solved in:

1.
O^∗(z^ε) timefor everyε > 0,unless the exponential time hypothesis (ETH) fails;
2.
O^∗(z^0.5−ε) timefor someε > 0,unless there is anO^∗(2^(0.5−ε)n)-timealgorithm for theSubset Sumproblem;
3.
$\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })$timefor someε > 0 andforn = O(1),unless the 3-Sumconjecture fails.

Proof

We use Lemma 7.1 to derive algorithms for the Multichoice Knapsack problem based on hypothetical solutions for Weighted Consensus. Subset Sum is a special case of Multichoice Knapsack with λ = 2, i.e., ${\prod }_{i}|C_{i}|= 2^{n}$. Hence, an O^∗(z^o(1))-time solution for Weighted Consensus would yield an O^∗(2^o(n))-time algorithm for Subset Sum, which contradicts ETH by the results of Etscheid et al. [9] and Gurari [11]. Similarly, an O^∗(z^0.5−ε)-time solution for Weighted Consensus would yield an O^∗(2^(0.5−ε)n)-time algorithm for Subset Sum. Moreover, 3-Sum is a special case of Multichoice Knapsack with n = 3 and ${\prod }_{i}|C_{i}|=\lambda ^{3}$. Hence, an $\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })$-time solution for Weighted Consensus with n = O(1) yields an $\tilde {O}(\lambda + \lambda ^{1.5 + 0.5-\varepsilon })=\tilde {O}(\lambda ^{2-\varepsilon })$-time algorithm for 3-Sum. □

Nevertheless, it might still be possible to improve the dependence on n in the GWPM problem. For example, one may hope to achieve $\tilde {O}(nz^{0.5-\varepsilon }+z^{0.5})$ time for λ = O(1).

8 Multivariate Analysis of Multichoice Knapsack and GWPM

In Section 5, we gave an $O(N+a^{0.5}\lambda ^{0.5}\log A)$-time algorithm for the Multichoice Knapsack problem. Improvement of either exponent to 0.5 − ε would result in a breakthrough for the Subset Sum and 3-Sum problems, respectively. Nevertheless, this does not refute the existence of faster algorithms for some particular values (a,λ) other than those emerging from instances of Subset Sum or 3-Sum. Indeed, in this section we show an algorithm that is superior if $\frac {\log a}{\log \lambda }$ is a constant other than an odd integer. We also argue that it is optimal (up to lower order terms) for every constant $\frac {\log a}{\log \lambda }$ unless the k-Sum conjecture fails.

We analyse the running times of algorithms for the Multichoice Knapsack problem expressed as O(n^O(1) ⋅ T(a,λ)) for some function T monotone with respect to both arguments. The algorithm of Theorem 1.3 proves that achieving $T(a,\lambda )=\sqrt {a\lambda }$ is possible. On the other hand, if we assume that Subset Sum does not admit an O^∗(2^(0.5−ε)n)-time solution, then we immediately get that we cannot have T(a,2) = O(a^0.5−ε) for any ε ≥ 0. Similarly, the 3-Sum conjecture implies that T(λ³,λ) = O(λ^2−ε) is impossible. While this already refutes the possibility of having T(a,λ) = O(a^0.5λ^0.5−ε) across all arguments (a,λ), such a bound may still hold for some special cases covering an infinite number of arguments. For example, we may potentially achieve T(a,λ) = O((aλ)^0.5−ε) = O(λ^1.5−ε) for a = λ².

Before we prove that this is indeed possible, let us see the consequences of the conjectured hardness of 3-Sum and, in general, (2k − 1)-Sum. For a non-negative integer k, the (2k − 1)-Sum conjecture refutes T(λ^2k− 1,λ) = O(λ^k−ε). By monotonicity of T with respect to the first argument, we conclude that T(λ^c,λ) = O(λ^k−ε) is impossible for c ≥ 2k − 1. On the other hand, monotonicity with respect to the second argument shows that $T(\lambda ^{c},\lambda )=O(\lambda ^{c\frac {k}{2k-1}-\varepsilon })$ is impossible for c ≤ 2k − 1. The lower bounds following from (2k − 1)-Sum and (2k + 1)-Sum turn out to meet at $c = 2k-1+\frac {1}{k + 1}$; see Figure 1.

Consequently, we have some room between the lower and the upper bound of $\sqrt {a \lambda }$. In the aforementioned case of a = λ², the upper bound is $\lambda ^{\frac {3}{2}}$, compared to the lower bound of $\lambda ^{\frac {4}{3}-\varepsilon }$. Below, we show that the upper bound can be improved to meet the lower bound. More precisely, we show an algorithm whose running time is $O(N + (a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\log \lambda \cdot n^{k})$ for every positive integer k. Note that $a^{\frac {k + 1}{2k + 1}}+\lambda ^{k} = \lambda ^{c\frac {k + 1}{2k + 1}}+ \lambda ^{k}$, so for 2k − 1 ≤ c ≤ 2k + 1 the running time indeed matches the lower bounds up to the n^k term.

Due to Lemma 5.11, the extra n^k term reduces to $O((\frac {\log A}{\log \lambda })^{k})$. Finally, we study the complexity of the GWPM problem.

8.1 Algorithm for Multichoice Knapsack

Let us start by discussing the bottleneck of the algorithm of Theorem 1.3 for large λ. The problem is that the size of the classes does not let us partition every choice S into a prefix L and a suffix R with ranks both $O(\sqrt {A_{V}})$. Lemma 5.3 leaves us with an extra letter c between L and R, and in the algorithm we append it to the prefix (while generating $\textbf {L}_{j-1}^{(\ell )}\odot C_{j}$).

We provide a workaround based on reordering of classes. Our goal is to make sure that items with large rank appear only in a few leftmost classes. For this, we guess the classes of the k items with largest rank (in a feasible solution) and move them to the front. Since this depends on the sought feasible solution, we shall actually verify all $\binom {n}{k}$ possibilities.

Now, our solution considers two cases: For j > k, the reordering lets us assume $\text{rank} _{v}(c)< \ell ^{\frac {1}{k}}$, so we do not need to consider all items from C_j. For j ≤ k, on the other hand, we exploit the fact that $|\textbf {L}_{j-1}^{(\ell )}\odot C_{j}|\le \lambda ^{j}$, which at most λ^k.

The underlying combinatorial foundation is formalised as a variant of Lemma 5.3:

Lemma 8.1

Letℓand r be positive integers such thatv(L_j[ℓ]) + v(R_j+ 1[r]) > Vforevery 0 ≤ j ≤ n.Letk ∈{1,…,n} andsuppose that S is a choice withv(S) ≤ Vsuchthat rank_v(S ∩ C_i) ≥ rank_v(S ∩ C_j) fori ≤ k < j.There is an indexj ∈{1,…,n} anda decompositionS = L ∪{c}∪ Rsuchthat$L\in \textbf {L}_{j-1}^{(\ell )}$,$R\in \textbf {R}_{j + 1}^{(r)}$,c ∈ C_j,and either$\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}$orj ≤ k.

Proof

We claim that the decomposition constructed in the proof of Lemma 5.3 satisfies the extra condition on rank_v(c) if j > k. Let S = {c₁,…,c_n} and S_i = {c₁,…,c_i}. Obviously rank_v(c_i) ≥ 1 for k < i < j and, by the extra assumption, rank_v(c_i) ≥ rank_v(c) for 1 ≤ i ≤ k. Hence, Fact 5.2 yields rank_v(S_j− 1) ≥ rank_v(c)^k. Simultaneously, we have v(S_j− 1) < v(L_j− 1[ℓ]), so rank_v(S_j− 1) < ℓ. Combining these inequalities, we immediately get the claimed bound. □

Theorem 1.7

For every positive integerk = O(1),theMultichoice Knapsackproblem can be solvedin$O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})$time.

Proof

As in the proof of Theorem 1.3, we actually provide an algorithm whose running time depends on A_V rather than a. Moreover, Lemmas 5.8 and 5.11 let us assume that $n=O(\frac {\log A}{\log \lambda })$.

We first guess the k positions where items with largest ranks rank_v are present in the solution S and move these positions to the front. This gives $\binom {n}{k}=O((\frac {\log A}{\log \lambda })^{k})$ possible selections. For each of them, we proceed as follows.

We increment an integer r starting from 1, maintaining $\ell =\lceil r^{\frac {k}{k + 1}}\rceil $ and all the lists $\textbf {L}_{j}^{(\ell )}$ and $\textbf {R}_{j + 1}^{(r)}$ for 0 ≤ j ≤ n, as long as v(L_j[ℓ]) + v(R_j+ 1[r]) ≤ V for some j. By Fact 5.4, we stop with $r=O(A_{V}^{\frac {k + 1}{2k + 1}})$ and thus the total time of this phase is $O(A_{V}^{\frac {k + 1}{2k + 1}}\log A)$ due to the online procedure of Lemma 5.5.

By Lemma 8.1, every feasible solution S for some j admits a decomposition S = L ∪{c}∪ R, where $L\in \textbf {L}_{j-1}^{(\ell )}$, $R\in \textbf {R}_{j + 1}^{(r)}$, c ∈ C_j, and either $\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}$ or j ≤ k; we consider all possibilities for j. For each of them, we shall reduce searching for S to an instance of the Multichoice Knapsack problem with $N^{\prime }=O(A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})$ and $n^{\prime }= 2$. By Lemma 5.6, these instances can be solved in $O((A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\frac {\log A}{\log \lambda })$ time in total.

For j ≤ k, the items of the j-th instance are going to belong to classes $\textbf {L}_{j-1}^{(\ell )}\odot C_{j}$ and $\textbf {R}_{j + 1}^{(r)}$. The set $\textbf {L}_{j-1}^{(\ell )}\odot C_{j}$ can be sorted by merging |C_j| sorted lists of size at most λ^j− 1 each, i.e., in $O(\lambda ^{k} \log \lambda )$ time. On the other hand, for j > k, we take $\{L\cup \{c\} : L\in \textbf {L}_{j-1}^{(\ell )} , c\in C_{j}, \text{rank} _{v}(c)\le \ell ^{\frac {1}{k}}\}$ and $\textbf {R}_{j + 1}^{(r)}$. The former set can be constructed by merging at most $\min (\ell ^{\frac {1}{k}},\lambda )=\min (O(r^{\frac {1}{k + 1}}),\lambda )$ sorted lists of size $\ell =O(r^{\frac {k}{k + 1}})$ each, i.e., in $O(r\log \lambda )=O(A_{V}^{\frac {k + 1}{2k + 1}}\log \lambda )$ time.

Summing up over all indices j, this gives $O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A)$ time for a single selection of the k positions with largest ranks, and $O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A (\frac {\log A}{\log \lambda })^{k})$ in total.

Clearly, each solution of the constructed instances represents a solution of the initial instance, and by Lemma 8.1, every feasible solution of the initial instance has its counterpart in one of the constructed instances.

Before we conclude the proof, we need to note that the optimal k does not need to be known in advance. To deal with this issue, we try consecutive integers k and stop the procedure if Fact 5.4 yields that A_V > λ^2k+ 1, i.e., if r is incremented beyond λ^k+ 1. If the same happens for the other instance of the algorithm (operating on rank_w instead of rank_v), we conclude that a > λ^2k+ 1, and thus we shall better use larger k. The running time until this point is $O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})$ due to Lemma 5.5. On the other hand, if r ≤ λ^k+ 1, the algorithm behaves as if a ≤ λ^2k+ 1, i.e., runs in $O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})$ time. This workaround (considering all smaller values k) adds extra $O(\lambda ^{k}\log \lambda (\frac {\log A}{\log \lambda })^{k-1})$ to the time complexity for the optimal value k, which is less than the upper bound on the running time we have for this value k. □

8.2 Algorithm for General Weighted Pattern Matching

If we are to bound the complexity in terms of A only, the running time becomes

$${O(N+ {(A^{\frac{k + 1}{2k + 1}}+\lambda^{k})}\log A (\tfrac{\log A}{\log \lambda})^{k})}.$$

Assumptions that A ≤ λ^2k+ 1 and k = O(1) let us get rid of the $(\frac {\log A}{\log \lambda })^{k}$ term, which can be bounded by (2k + 1)^k = O(1).

Corollary 8.2

Letk = O(1) bea positive integer such thatA ≤ λ^2k+ 1.TheMultichoice Knapsackproblem can be solvedin$O(N+ {(A^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log \lambda )$time.

This leads to the following result for General Weighted Pattern Matching:

Theorem 1.8

Ifλ^2k− 1 ≤ z ≤ λ^2k+ 1forsome positive integerk = O(1),then theWeighted Consensusproblem can be solvedin$O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )$time,and theGWPMproblem can be solvedin$O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )$time.

As we noted at the beginning of this section, Lemma 7.1 implies that any improvement of the dependence of the running time on z or λ by z^ε (equivalently, by λ^ε) wound contradict the k-Sum conjecture.

Notes

The O^∗ notation suppresses factors polynomial with respect to the instance size, whereas the $\tilde {O}$ notation ignores factors polylogarithmic with respect to the instance size (encoded in binary).
http://jaspar.genereg.net

References

Amir, A., Chencinski, E., Iliopoulos, C. S., Kopelowitz, T., Zhang, H.: Property matching and weighted matching. Theor. Comput. Sci. 395(2-3), 298–310 (2008). https://doi.org/10.1016/j.tcs.2008.01.006
Article MathSciNet MATH Google Scholar
Barton, C., Kociumaka, T., Pissis, S. P., Radoszewski, J.: Efficient index for weighted sequences. In: Grossi, R., Lewenstein, M. (eds.) Combinatorial Pattern Matching, CPM 2016, LIPIcs, vol. 54, pp. 4:1–4:13. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.CPM.2016.4. Dagstuhl, Germany (2016)
Barton, C., Liu, C., Pissis, S. P.: Linear-time computation of prefix table for weighted strings & applications. Theor. Comput. Sci. 656, 160–172 (2016). https://doi.org/10.1016/j.tcs.2016.04.029
Article MathSciNet MATH Google Scholar
Barton, C., Liu, C., Pissis, S. P.: On-line pattern matching on uncertain sequences and applications. In: Chan, T.H., Li, M., Wang, L. (eds.) Combinatorial optimization and applications, COCOA 2016, LNCS, vol. 10043, pp. 547–562. https://doi.org/10.1007/978-3-319-48749-6_40. Springer, Berlin (2016)
Barton, C., Liu, C., Pissis, S.P.: Fast average-case pattern matching on weighted sequences. To appear in the International Journal of Foundations of Computer Science (2017)
Biswas, S., Patil, M., Thankachan, S. V., Shah, R.: Probabilistic threshold indexing for uncertain strings. In: E. Pitoura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, I. Manolescu, K. Stefanidis (eds.) 19th International Conference on Extending Database Technology, EDBT 2016, pp. 401–412. OpenProceedings.org. https://doi.org/10.5441/002/edbt.2016.37 (2016)
Christodoulakis, M., Iliopoulos, C. S., Mouchard, L., Tsichlas, K.: Pattern matching on weighted sequences. In: Algorithms and Computational Methods for Biochemical and Evolutionary Networks, CompBioNets 2004, KCL publications (2004)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on strings. Cambridge University Press, Cambridge (2007). https://doi.org/10.1017/cbo9780511546853
Book MATH Google Scholar
Etscheid, M., Kratsch, S., Mnich, M., Röglin, H.: Polynomial kernels for weighted problems. In: G.F. Italiano, G. Pighizzini, D. Sannella (eds.) Mathematical Foundations of Computer Science, MFCS 2015, Part II, LNCS, vol. 9235, pp. 287–298. Springer. https://doi.org/10.1007/978-3-662-48054-0_24 (2015)
Gajentaan, A., Overmars, M. H.: On a class of O(n ²) problems in computational geometry. Comput. Geom. 5, 165–185 (1995). https://doi.org/10.1016/0925-7721(95)00022-2
Article MathSciNet MATH Google Scholar
Gurari, E. M.: Introduction to the theory of computation. Computer Science Press (1989)
Horowitz, E., Sahni, S.: Computing partitions with applications to the knapsack problem. J. ACM, 21(2), 277–292 (1974). https://doi.org/10.1145/321812.321823
Article MathSciNet MATH Google Scholar
Iliopoulos, C.S., Makris, C., Panagis, Y., Perdikuri, K., Theodoridis, E., Tsakalidis, A.K.: The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundamenta Informaticae 71 (2-3), 259–277 (2006). http://content.iospress.com/articles/fundamenta-informaticae/fi71-2-3-07
MathSciNet MATH Google Scholar
Iliopoulos, C. S., Rahman, M. S.: Faster index for property matching. Inf. Process. Lett. 105(6), 218–223 (2008). https://doi.org/10.1016/j.ipl.2007.09.004
Article MathSciNet MATH Google Scholar
Impagliazzo, R., Paturi, R.: On the complexity of k-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). https://doi.org/10.1006/jcss.2000.1727
Article MathSciNet MATH Google Scholar
Juan, M. T., Liu, J. J., Wang, Y. L.: Errata for “Faster index for property matching”. Inf. Process. Lett. 109(18), 1027–1029 (2009). https://doi.org/10.1016/j.ipl.2009.06.009
Article MATH Google Scholar
Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack problems. Springer. https://doi.org/10.1007/978-3-540-24777-7(2004)
Kociumaka, T., Pissis, S. P., Radoszewski, J.: Pattern matching and consensus problems on weighted sequences and profiles. In: S. Hong (ed.) Algorithms and Computation, ISAAC 2016, LIPIcs, vol. 64, pp. 46:1–46:12. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.ISAAC.2016.46 (2016)
Kopelowitz, T.: The property suffix tree with dynamic properties. Theor. Comput. Sci. 638, 44–51 (2016). https://doi.org/10.1016/j.tcs.2016.02.033
Article MathSciNet MATH Google Scholar
Lokshtanov, D., Marx, D., Saurabh, S.: Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105, 41–72 (2011). http://bulletin.eatcs.org/index.php/beatcs/article/view/92
MathSciNet MATH Google Scholar
Mehlhorn, K.: Nearly optimal binary search trees. Acta Inform. 5, 287–295 (1975). https://doi.org/10.1007/BF00264563
Article MathSciNet MATH Google Scholar
Pizzi, C., Ukkonen, E.: Fast profile matching algorithms - A survey. Theor. Comput. Sci. 395(2-3), 137–157 (2008). https://doi.org/10.1016/j.tcs.2008.01.015
Article MathSciNet MATH Google Scholar
Radoszewski, J., Starikovskaya, T. A.: Streaming k-mismatch with error correcting and applications. In: A. Bilgin, M.W. Marcellin, J. Serra-Sagristȧ, J.A. Storer (eds.) Data Compression Conference, DCC 2017, pp. 290–299. IEEE. https://doi.org/10.1109/DCC.2017.14 (2017)
Rajasekaran, S., Jin, X., Spouge, J. L.: The efficient computation of position-specific match scores with the fast Fourier transform. J. Comput. Biol. 9(1), 23–33 (2002). https://doi.org/10.1089/10665270252833172
Article Google Scholar

Download references

Acknowledgments

This work was supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.

Author information

Authors and Affiliations

Institute of Informatics, University of Warsaw, Warsaw, Poland
Tomasz Kociumaka & Jakub Radoszewski
Department of Informatics, King’s College London, London, UK
Solon P. Pissis

Authors

Tomasz Kociumaka
View author publications
You can also search for this author in PubMed Google Scholar
Solon P. Pissis
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Radoszewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jakub Radoszewski.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this research appeared at the 27th International Symposium on Algorithms and Computation (ISAAC 2016).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Kociumaka, T., Pissis, S.P. & Radoszewski, J. Pattern Matching and Consensus Problems on Weighted Sequences and Profiles. Theory Comput Syst 63, 506–542 (2019). https://doi.org/10.1007/s00224-018-9881-2

Download citation

Published: 10 August 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s00224-018-9881-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

Abstract

Similar content being viewed by others

On-Line Pattern Matching on Uncertain Sequences and Applications

Partially Local Multi-way Alignments

On the Complexity of Constrained Sequences Alignment Problems

1 Introduction

1.1 Weighted Pattern Matching and Profile Matching

Our results

Theorem 1.1

Theorem 1.2

1.2 Profile Consensus and Multichoice Knapsack

Our results

Theorem 1.3

1.3 Weighted Consensus and General Weighted Pattern Matching

Our results

Theorem 1.4

Corollary 1.5

Theorem 1.6

Theorem 1.7

Theorem 1.8

1.4 Structure of the Paper

2 Preliminaries

Fact 2.1

2.1 Model of Computations

3 Profile Matching

3.1 Solution to Profile Matching

Observation 3.1

Proof

Theorem 1.1

Proof

4 Weighted Pattern Matching

4.1 Weighted Sequences versus Profiles

Fact 4.1

Proof

Observation 4.2

4.2 Solution to Weighted Pattern Matching

Observation 4.3

Implementation for large alphabets

Theorem 1.2

Remark 4.4

5 Profile Consensus and Multichoice Knapsack

5.1 Profile Consensus versus Multichoice Knapsack

Fact 5.1

Proof

5.2 Overview of the Solution

5.3 Ranks of Partial Choices

Fact 5.2

Proof

Lemma 5.3

Proof

Fact 5.4

Proof

5.4 Generating Partial Choices of Small Rank

Lemma 5.5

Proof

5.5 Multichoice Knapsack for n = 2 Classes

Lemma 5.6

Proof

5.6 Multichoice Knapsack Parameterised by a

Proposition 5.7

Proof

5.7 Preprocessing to Reduce Instance Size

Lemma 5.8

Proof

Lemma 5.9

Proof

Claim 1

Proof

Lemma 5.10

Proof

Lemma 5.11

Proof

Claim 2

Proof

5.8 Main Result

Theorem 1.3

Proof

6 Weighted Consensus and General Weighted Pattern Matching

Proposition 6.1