Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.


Introduction
We study two well-known representations of uncertain texts: weighted sequences and profiles.A weighted sequence (also known as uncertain sequence or position weight matrix, PWM) for every position and every letter of the alphabet specifies the probability of occurrence of this letter at this position; see Fig. 1 for an example.A weighted sequence represents many different strings, each with the probability of occurrence equal to the product of probabilities of its letters at subsequent positions of the weighted sequence.Usually a threshold 1  z is specified, and one considers only strings that match the weighted sequence with probability at least 1 z .A scoring matrix (or a profile) of length m is an m × σ matrix.The score of a string of length m is the sum of scores in the scoring matrix of the subsequent letters of the string at the respective positions.A string is said to match a scoring matrix if its matching score is above a specified threshold Z.
Weighted Pattern Matching and Profile Matching First of all, we study the standard variants of pattern matching problems on weighted sequences and profiles, in which only the pattern or the text is an uncertain sequence.In the best-known formulation of the Weighted Pattern Matching problem, we are given a weighted sequence of length n, called a text, a solid (standard) string of length m, called a pattern, both over an alphabet of size σ, and a threshold probability 1z .We are asked to find all positions in the text where the fragment of length m represents the pattern with probability at least 1  z .Each such position is called an occurrence of the pattern in the text; we also say that the fragment of the text and the pattern match.The Weighted Pattern Matching problem can be solved in O(σn log m) time via the Fast Fourier Transform [5].In a more general indexing variant of the problem, considered in [1,12], one can preprocess a weighted text in O(nz 2 log z) time to report all occ occurrences of a given solid pattern of length m in O(m + occ) time.(A similar indexing data structure, which assumes z = O(1), was presented in [4].)Very recently, the index construction time was reduced to O(nz) for constant-sized alphabets [2].
In the classic Profile Matching problem, the pattern is an m × σ profile, the text is a solid string of length n, and our task is to find all positions in the text where the fragment of length m has a score above a specified threshold Z.A naive approach to the Profile Matching problem works in O(nm + mσ) time.A broad spectrum of heuristics improving this algorithm in practice is known; for a survey see [16].One of the principal techniques, coming in different flavours, is lookahead scoring that consists in checking if a partial match could possibly be completed by the following highest scoring letters in the scoring matrix and, if not, pruning the naive search.The Profile Matching problem can also be solved in O(σn log m) time via the Fast Fourier Transform [17].
Weighted Consensus and Profile Consensus As our most involved contribution, we study a general variant of pattern matching on weighted sequences and the consensus problems on uncertain sequences, which are closely related to the Multichoice Knapsack problem.In the Weighted Consensus problem, given two weighted sequences of the same length, we are to check if there is a string that matches each of them with probability at least 1 z .A routine to compare user-entered weighted sequences with existing weighted sequences in the database is used, e.g., in JASPAR, a well-known database of PWMs [19].In the General Weighted Pattern Matching (GWPM) problem, both the pattern and the text are weighted.In the most common definition of the problem (see [3,12]), we are to find all fragments of the text that give a positive answer to the Weighted Consensus problem with the pattern.The authors of [3] proposed an algorithm for the GWPM problem based on the weighted prefix table that works in O(nz 2 log z + nσ) time.
In an analogous way to the Weighted Consensus problem, we define the Profile Consensus problem.Here we are to check for the existence of a string that matches both the scoring matrices above threshold Z.The Profile Consensus problem is actually a special case of the well-known (especially in practice) Multichoice Knapsack problem (also known as the Multiple Choice Knapsack problem).In this problem, we are given n classes C 1 , . . ., C n of at most λ items each-N items in total-each item c characterized by a value v(c) and a weight v(c).The goal is to select one item from each class so that the sums of values and of weights of the items are below two specified thresholds, V and W .(In the more intuitive formulation of the problem, we require the sum of values to be above a specified threshold, but here we consider an equivalent variant in which both parameters are symmetric.)The Multichoice Knapsack problem is widely used in practice, but most research concerns approximation or heuristic solutions; see [14] and references therein.As far as exact solutions are concerned, the classic meet-in-the middle approach by Horowitz and Sahni [11], originally designed for the (binary) Knapsack problem, immediately generalizes to an O * (λ ⌈ n 2 ⌉ )-time 1 solution for Multichoice Knapsack.Several important problems can be expressed as special cases of the Multichoice Knapsack problem using folklore reductions (see [14]).This includes the Subset Sum problem, which for a set of n integers asks whether there is a subset summing up to a given integer Q, and the k-Sum problem which, for k = O(1) classes of λ integers, asks to choose one element from each class so that the selected integers sum up to zero.These reductions give immediate hardness results for the Multichoice Knapsack problem, and they can be adjusted to yield the same consequences for Profile Consensus.For the Subset Sum problem, as shown in [7,10], the existence of an O * (2 εn )-time solution for every ε > 0 would violate the Exponential Time Hypothesis (ETH) [13,15].Moreover, the O * (2 n/2 ) running time, achieved in [11], has not been improved yet despite much effort.The 3-Sum conjecture [9] and the more general k-Sum conjecture state that the 3-Sum and k-Sum problems cannot be solved in O(λ 2−ε ) time and O(λ ⌈ k 2 ⌉(1−ε) ) time, respectively, for any ε > 0.

Our Results
As the first result, we show how the lookahead scoring technique combined with a data structure for answering longest common prefix queries in a string can be applied to obtain simple and efficient algorithms for the standard pattern matching problems on uncertain sequences.For a weighted sequence, by R we denote the size of its list representation, and by λ the maximal number of letters with score at least 1 z at a single position (thus λ ≤ min(σ, z)).In the Profile Matching problem, we set M as the number of strings that match the scoring matrix with score above Z.In general M ≤ σ m , however, we may assume that for practical data this number is actually much smaller.We obtain the following running times: • O(mσ + n log M ) for Profile Matching; • O(R log 2 log λ+n log z) deterministic and O(R+n log z) randomized (Las Vegas, failure with probability R −c for any given constant c) for Weighted Pattern Matching.
The more complex part of our study is related to the consensus problems and to the GWPM problem.Instead of considering Profile Consensus, we study the more general Multichoice Knapsack.We introduce parameters based on the number of solutions with feasible weight or value: is, the number of choices of one element from each class that satisfy the value threshold; , and a = min(A V , A W ). We obtain algorithms with the following complexities: for Weighted Consensus and O(n √ zλ(log log z + log λ)) for General Weighted Pattern Matching.
Since a ≤ A ≤ λ n , our running time for Multichoice Knapsack in the worst case matches (up to lower order terms) the time complexities of the fastest known solutions for both Subset Sum (also binary Knapsack) and 3-Sum.The main novel part of our algorithm for Multichoice Knapsack is an appropriate (yet intuitive) notion of ranks of partial solutions.We also provide a simple reduction from Multichoice Knapsack to Weighted Consensus, which lets us transfer the negative results to the GWPM problem.
• The existence of an O * (z ε )-time solution for Weighted Consensus for every ε > 0 would violate the Exponential Time Hypothesis.
For the higher-order terms our complexities match the conditional lower bounds; therefore, we put significant effort to keep the lower order terms of the complexities as small as possible.
Model of Computations For problems on weighted sequences, we assume the word RAM model with word size w = Ω(log n + log z) and σ = n O (1) .We consider the log-probability model of representations of weighted sequences, that is, we assume that probabilities in the weighted sequences and the threshold probability 1  z are all of the form c p 2 dw , where c and d are constants and p is an integer that fits in a constant number of machine words.Additionally, the probability 0 has a special representation.The only operations on probabilities in our algorithms are multiplications and divisions, which can be performed exactly in O(1) time in this model.Our solutions to the Multichoice Knapsack problem only assume the word RAM model with word size w = Ω(log S + log a), where S is the sum of integers in the input instance; this does not affect the O * running time.
Structure of the Paper We start with Preliminaries, where we formally introduce the problems and the main notions used throughout the paper.The following three sections describe our algorithms: in Section 3 for Profile Matching and Weighted Pattern Matching; in Section 4 for Profile Consensus; and in Section 5 for Weighted Consensus and General Weighted Pattern Matching.A tailor-made, yet more efficient algorithm for General Weighted Pattern Matching is presented in Section 6.We conclude with Section 7, where we introduce faster algorithms and matching lower bounds for Multichoice Knapsack and GWPM in the case that λ is large.

Preliminaries
Let Σ = {s 1 , s 2 , . . ., s σ } be an alphabet of size σ.A string S over Σ is a finite sequence of letters from Σ.We denote the length of S by |S| and, for 1 ≤ i ≤ |S|, the i-th letter of S by S[i].By S[i..j] we denote the string S[i] . . .S[j] called a factor of S (if i > j, then the factor is an empty string).A factor is called a prefix if i = 1 and a suffix if j = |S|.For two strings S and T , we denote their concatenation by S • T (ST in short).
For a string S of length n, by lcp(i, j) we denote the length of the longest common prefix of factors S[i.
.n] and S[j..n].The following fact specifies a known efficient data structure answering such queries.It consists of the suffix array with its inverse, the LCP table and a data structure for range minimum queries on the LCP table; see [6] for details.Fact 2.1.Let S be a string of length n over alphabet of size σ = n O (1) .After O(n)-time preprocessing, given indices i and j (1 ≤ i, j ≤ n) one can compute lcp(i, j) in O(1) time.
The Hamming distance between two strings X and Y of the same length, denoted by d H (X, Y ), is the number of positions where the strings have different letters.

Profiles
In the Profile Matching problem, we consider a scoring matrix (a profile) P of size m × σ.For i ∈ {1, . . ., m} and j ∈ {1, . . ., σ}, we denote the integer score of the letter s j at the position i by P [i, s j ].The matching score of a string S of length m with the matrix P is If Score(S, P ) ≥ Z for an integer threshold Z, then we say that the string S matches the matrix P with threshold Z.We denote the number of strings S that math P with threshold Z by NumStrings Z (P ).
For a string T and a scoring matrix P , we say that P occurs in T at position i with threshold Z if T [i..i + m − 1] matches P with threshold Z.We denote the set of all positions where P occurs in T by Occ Z (P, T ).These notions let us define the Profile Matching problem:

Profile Matching Problem
Input: A string T of length n, a scoring matrix P of size m × σ, and a threshold Z. Output: The set Occ Z (P, T ).Parameters: M = NumStrings Z (P ).

Weighted Sequences
A weighted sequence X = X [1] . . .X[n] of length |X| = n over alphabet Σ = {s 1 , s 2 , . . ., s σ } is a sequence of sets of pairs of the form: Here, π (X) i (s j ) is the occurrence probability of the letter s j at the position i ∈ {1, . . ., n}.These values are non-negative and sum up to 1 for a given i.
For all our algorithms, it is sufficient that the probabilities sum up to at most 1 for each position.Also, the algorithms sometimes produce auxiliary weighted sequences with sum of probabilities being smaller than 1 on some positions.
We denote the maximum number of letters occurring at a single position of the weighted sequence (with non-zero probability) by λ and the total size of the representation of a weighted sequence by R. The standard representation consists of n lists with up to λ elements each, so R = O(nλ).However, the lists can be shorter in general.Also, if the threshold probability 1 z is specified, at each position of a weighted sequence it suffices to store letters with probability at least 1 z , and clearly there are at most z such letters for each position.This reduction can be performed in linear time, so we shall always assume that λ ≤ z.
The probability of matching of a string S with a weighted sequence X, We say that a string S matches a weighted sequence X with probability at least 1 z , denoted by S ≈ 1 z X, if P(S, X) ≥ 1 z .Given a weighted sequence T , by T [i.
.j] we denote weighted sequence, called a factor of T , equal to T [i] . . .T [j] (if i > j, then the factor is empty).We then say that a string P occurs in T at position i if P matches the factor T [i..i + m − 1].We also say that P is a 1 z -solid factor of T at position i (a 1 z -solid prefix if i = 1 and a 1 z -solid suffix if j = n).We denote the set of all positions where P occurs in T by Occ 1 z (P, T ).

Weighted Pattern Matching Problem
Input: A string P of length m and a weighted sequence T of length n with at most λ letters at each position and R in total, and a threshold probability 1 z .Output: The set Occ 1 z (P, T ).

Profile Matching and Weighted Pattern Matching
In this section we present a solution to the Profile Matching problem.Afterwards, we show that it can be applied for Weighted Pattern Matching as well.
For a scoring matrix P , the heavy string of P , denoted H(P ), is constructed by choosing at each position the heaviest letter, that is, the letter with the maximum score (breaking ties arbitrarily).Intuitively, H(P ) is a string that matches P with the maximum score.Proof.Let d = d H (H(P ), S).We can construct 2 d strings of length |S| that match P with a score above Z by taking either of the letters S[j] or H(P )[j] at each position j such that S[j] = H(P )[j].Hence, 2 d ≤ M , which concludes the proof.
Our solution for the Profile Matching problem works as follows.We first construct P ′ = H(P ) and the data structure for finding lcp values between suffixes of P ′ and T .Let the variable s store the matching score of P ′ .In the p-th step, we calculate the matching score of T [p..p + m − 1] by iterating through subsequent mismatches between P ′ and T [p..p + m − 1] and making adequate updates in the matching score s ′ , which starts at s ′ = s.The mismatches are found using lcp-queries.This process terminates when the score s ′ drops below Z or when all the mismatches have been found.In the end, we include p in Occ Z (P, T ) if s ′ ≥ Z.A pseudocode of this approach is given below for completeness.
Basically the same approach can be used for Weighted Pattern Matching.In a natural way, we extend the notion of a heavy string to weighted sequences.Now we can restate Observation 3.1 in the language of probabilities instead of scores: Observation 3.3.If a string P matches a weighted sequence X of the same length with probability at least Comparing to the solution to Profile Matching, we compute the heavy string of the text instead of the pattern.An auxiliary variable α stores the matching probability between a factor of H(T ) and the corresponding factor of T ; it needs to be updated when we move to the next position of the text.In the implementation, we perform the following operations on a weighted sequence: • computing the probability of a given letter at a given position, • finding the letter with the maximum probability at a given position.
In the standard list representation, the latter can be performed on a single weighted sequence in O(1) time after O(R)-time preprocessing.We can perform the former in constant time if, in addition to the list representation, we store the letter probabilities in a dictionary implemented using perfect hashing [8].This way, we can implement the algorithm in O(n log z + R) time w.h.p.Alternatively, a deterministic dictionary [18] can be used to obtain a deterministic solution in O(R log 2 log λ + n log z) time.We arrive at the following result.

Profile Consensus as Multichoice Knapsack
Let us start with a precise statement of the Multichoice Knapsack problem.

Multichoice Knapsack Problem
Input: A set C of N items partitioned into n disjoint classes C i , each of size at most λ, two integers v(c) and w(c) for each item c ∈ C, and two thresholds V and W . Question: Does there exist a choice S (a set S ⊆ C such that |S ∩ C i | = 1 for each i) satisfying both c∈S v(c) ≤ V and c∈S w(c) ≤ W ? Parameters: A V and A W : the number of choices S satisfying c∈S v(c) ≤ V and c∈S w(c) ≤ W , respectively, as well as A = max(A V , A W ) and a = min(A V , A W ).
Indeed we see that the Profile Consensus problem reduces to the Multichoice Knapsack problem.For two m × σ scoring matrices, we construct n = m classes of λ = σ items each, with values equal to the scores of the letters in the first matrix and weights equal to the scores in the second matrix; both thresholds V and W are equal to Z.
For a fixed instance of Multichoice Knapsack, we say that S is a The classic O(2 n/2 )-time solution to the Knapsack problem [11] partitions D = {1, . . ., n} into two domains D i of size roughly n/2, and for each D i it generates all partial choices S ordered by v(S).Hence, it reduces the problem to an instance of Multichoice Knapsack with 2 classes.It is solved using the following lemma, proved below for completeness.Proof.Since the items of C 1 and C 2 are sorted by v(c), a single scan through these items lets us remove all irrelevant elements, that is, elements dominated by other elements in their class.Next, for each ) is largest possible.As we have removed irrelevant elements from C 2 , this item also minimizes w(c 2 ) among all elements satisfying v(c 2 ) ≤ V − v(c 1 ).Hence, if there is a feasible solution containing c 1 , then {c 1 , c 2 } is feasible.If we process elements c 1 by non-decreasing values v(c 1 ), the values v(c 2 ) do not increase, and thus the items c 2 can be computed in O(N ) time in total.
The same approach generalizes to Multichoice Knapsack.The partition is chosen to balance the number of partial choices in each domain, and the worst-case time complexity is Our aim in this section is to replace Q with the parameter a (which never exceeds Q).The overall running time is going to be O(N + √ aλ log A): an overhead of O(log A) appears.Two challenges arise once we adapt the meet-in-the-middle approach: how to restrict the set of partial choices to be generated so that a feasible solution is not missed, and how to define a partition D = D 1 ∪ D 2 to balance the number of partial choices generated for D 1 and D 2 .A natural idea to deal with the first issue is to consider only partial choices with small values v(S) or w(S).This is close to our actual solution, which is based on the notion of ranks of partial choices.Our approach to the second problem is to consider multiple partitions: those of the form D = {1, . . ., j} ∪ {j + 1, . . ., n} for 1 ≤ j ≤ n.This results in an extra O(n) factor in the time complexity.However, in Section 4.1 we introduce preprocessing that lets us assume that n = O( log A log λ ).While dealing with these two issues, some further effort is required to avoid few other extra terms in the running time.In case of our algorithm, this is only O(log λ), which stems from the fact that we need to keep partial solutions ordered by v(S).
For a partial choice S, we define rank V (S) as the number of partial choices S ′ with the same domain for which v(S ′ ) ≤ v(S).We symmetrically define rank W (S). Ranks are introduced as an analogue of probabilities in weighted sequences.Probabilities are multiplicative, while for ranks we have submultiplicativity: Fact 4.2.Assume that S = S 1 ∪ S 2 is a decomposition of a partial choice S into two disjoint subsets.Then rank V (S 1 ) rank V (S 2 ) ≤ rank V (S) (and same for rank W ).
Proof.Let D 1 and D 2 be the domains of S 1 and S 2 , respectively.For every partial choices S ′ 1 over D 1 and 1 ∪ S ′ 2 must be counted while determining rank V (S).
For 0 ≤ j ≤ n, let L j be the list of partial choices with domain {1, . . ., j} ordered by value v(S), and for Analogously, for 1 ≤ j ≤ n + 1, we define R j as the list of partial choices over {j, . . ., n} ordered by v(S), and for r > 0, The following two observations yield a decomposition of each choice into a single item and two partial solutions of a small rank.In particular, we do not need to know A V in order to check if the ranks are sufficiently large.Lemma 4.3.Let ℓ and r be positive integers such that V (ℓ) Rj+1 > V for each 0 ≤ j ≤ n.For every choice S with v(S) ≤ V , there is an index j ∈ {1, . . ., n} and a decomposition Ln−1 , we set L = S n−1 , c = c n , and R = ∅, satisfying the claimed conditions.
Otherwise, we define j as the smallest index i such that v(S i ) ≥ V (ℓ) Li , and we set L = S j−1 , c = c j , and R = S \ S j .The definition of j implies v(L) < V (ℓ) Rj+1 ≤ V for some j ∈ {0, . . ., n}, then ℓ • r ≤ A V .Proof.Let L and R be the ℓ-th and r-th entry in L j and R j+1 , respectively.Note that v(L ∪ R) ≤ V implies rank V (L ∪ R) ≤ A V by definition of A V .Moreover, rank V (L) ≥ ℓ and rank V (R) ≥ r (the equalities may be sharp due to draws).Now, Fact 4.2 yields the claimed bound.
Note that L j can be obtained by interleaving |C j | copies of L j−1 , where each copy corresponds to extending the choices from L j−1 with a different item.If we were to construct L j having access to the whole L j−1 , we could proceed as follows.For each c ∈ C j , we maintain an iterator on L j−1 pointing to the first element S on L j−1 for which S ∪ {c} has not yet been added to L j .The associated value is v(S ∪ {c}).All iterators initially point at the first element of L j−1 .Then the next element to append to L j is always S ∪ {c} corresponding to the iterator with minimum value.Having processed this partial choice, we advance the pointer (or remove it, once it has already scanned the whole L j−1 ).This process can be implemented using a binary heap H j as a priority queue, so that initialization requires O(|C j |) time and outputting a single element takes O(log |C j |) time.
For all r ≥ 0, let L (r) j be the prefix of L j of length min(r, |L j |) and R (r) j be the prefix of R j of length min(r, |R j |).A technical transformation of the procedure stated above leads to an online algorithm that constructs the prefixes L (r) j and R (r) j .Along with each reported partial choice S, the algorithm also computes w(S).Proof.Our online algorithm is going to use the same approach as the offline computation of lists L (r) j .The order of computations is going to be changed, though.
At each step, for j = 1 to n we shall extend lists L (i−1) j with a single element (unless the whole L j has been already generated) from the top of the heap H j .Note that this way each iterator in H j always points to an element that is already in L (i−1) j−1 or to the first element that has not been yet added to L j−1 , which is represented by the top of the heap H j−1 .
We initialize the heaps as follows: we introduce H 0 which represents the empty choice ∅ with v(∅) = 0. Next, for j = 1, . . ., n we build the heap H j representing |C j | iterators initially pointing to the top of H j−1 .The initialization takes O(N ) time in total since a binary heap can be constructed in time linear in its size.
At each step, the lists L (i−1) j are extended for consecutive values j from 1 to n.Since L , all iterators in H j point to the elements of L (i) j−1 while we compute L (i) j .We take the top of H j and move it to L (i) j .Next, we advance the corresponding iterator and update its position in the heap H j .After this operation, the iterator might point to the top of H j−1 .If H j−1 is empty, this means that the whole list L j−1 has already been generated and traversed by the iterator.In this case, we remove the iterator.
It is not hard to see that this way we indeed simulate the previous offline solution.A single phase makes O(1) operations on each heap H j .The running time is bounded by O( j log |C j |) = O(n log λ) at each step of the algorithm.
The reduction of the following lemma is presented in Section 4.1.Note that we may always assume that λ ≤ a ≤ A. Indeed, if we order the items c ∈ C i according to v(c), then only the first A V of them might belong to a choice S with v(S) ≤ V .
Proof.Below, we give an algorithm in O(N + √ A V λ log A) time.The final solution runs it in parallel on the original instance and on the instance with v and V swapped with w and W , waiting until at least one of them terminates.
We increment an integer r starting from 1, maintaining ℓ = r λ and the lists L j+1 for some index j.We consider all possibilities for j.For each of them we will reduce searching for S to an instance of the Multichoice Knapsack problem with ) time in total.The items of the j-th instance are going to belong to classes L (ℓ) Clearly, each feasible solution of the constructed instances represents a feasible solution of the initial instance, and by Lemma 4.3, every feasible solution of the initial instance has its counterpart in one of the constructed instances.

Proof of Lemma 4.6
Our reduction consists of two steps.Its implementation uses the following notions: For each class C i , let v min (i) = min{v(c) : c ∈ C i }.Also, let V min = n i=1 v min (i); note that V min is the smallest possible value v(S) of a choice S. We symmetrically define w min (i) and W min .First, we make sure that n = O(log A).
Observe that if some class C i contains a single item c for which both v(c) = v min (i) and w(c) = w min (i), then we can greedily include it in the solution S. Hence, we can remove such a class, setting V := V − v min (i) and W := W − w min (i).We execute this reduction rule exhaustively, which clearly takes O(N ) time in total and may only decrease the parameters A V and A W .After the reduction, each class contains at least two items.
We shall prove that now we can either find out that A ≥ 2 n/2 or that we are dealing with a NO-instance.To decide which case holds, let us define ∆ V (i) as the difference between the second smallest value in the multiset {v(c) : c ∈ C i } and v min (i).We set ; otherwise, we are dealing with a NO-instance.
Proof.First, assume that V min + ∆ mid V ≤ V .This means that there is a choice S with v(S) ≤ V containing at least n 2 items c such that rank V (c) ≥ 2. Fact 4.2 yields rank V (S) ≥ 2 ⌈n/2⌉ and consequently . Now, suppose that there is a feasible solution S. As no class contains a single item minimizing both v(c) and w(c), there are at least n 2 classes for which S contains an item not minimizing v(c), or at least n 2 classes for which S contains an item not minimizing w(c).Without loss of generality, we assume that the former holds.Let D be the set of at least n 2 classes i satisfying the condition.
V , as claimed.The conditions from the claim can be verified in O(N ) time using a linear-time selection algorithm to compute ∆ mid V and ∆ mid W .If any of the first two conditions holds, we return the instance obtained using our reduction.Otherwise, we output a dummy NO-instance.
Before we proceed with the second reduction, let us introduce an auxiliary notion.An item c ∈ C j is irrelevant if there is another item c ′ ∈ C j that dominates c, i.e., such that v(c) > v(c ′ ) and w(c) > w(c ′ ).Removing irrelevant items leads to an equivalent instance of the Multichoice Knapsack problem, and it may only decrease the parameters.Lemma 4.9.Consider a class of items in an instance of the Multichoice Knapsack problem.In linear time, we can remove some irrelevant items from the class so that the resulting class C satisfies max(rank V (c), rank W (c)) > 1  3 |C| for each item c ∈ C. Proof.First, note that using a linear-time selection algorithm, we can determine for each item c whether rank V (c) ≤ 1  3 |C| and whether rank W (c) ≤ 1 3 |C|.If there is no item satisfying both conditions, we keep C unaltered.Otherwise, we have an item which dominates at least |C| − rank V (c) − rank W (c) ≥ 1  3 |C| other items.We scan through all items in C and remove those dominated by c. Next, we repeat the algorithm.The running time of a single phase is clearly linear, and since |C| decreases geometrically, the total running time is also linear.
A straightforward way to decrease the number of classes is to replace two distinct classes C i , C j with their Cartesian product C i × C j , assuming that the weight of a pair (c i , c j ) is the sum of weights of c i and c j .This clearly leads to an equivalent instance of the Multichoice Knapsack problem, does not alter the parameters A V , A W , and decreases n.On the other hand N and λ may increase; the latter happens only if These two reduction rules let us implement our reduction procedure which constitutes the proof of Lemma 4.6.
Proof.First, we apply Lemma 4.8 to make sure that n ≤ 2 log A and N = O(λ log A).We may now assume that λ ≥ 3 6 , as otherwise we already have n = O( log A log λ ).Throughout the algorithm, whenever there are distinct classes of size at most √ λ, we shall replace them with their Cartesian product.This may happen only n − 1 times, and a single execution takes O(λ) time, so the total running time needed for this part is O(λ log A).
Furthermore, for every class that we get in the input instance or obtain as a Cartesian product, we apply Lemma 4.9.The total running time spent on this is also O(λ log A).
Having exhaustively applied these reduction rules, we are guaranteed that max(rank for items c from all but one class.Without loss of generality, we assume that the classes satisfying this condition are C 1 , . . ., C k . Recall that v min (i) and w min (i) are defined as minimum values and weights of items in class C i and that V min and W min are their sums over all classes.For 1 ≤ i ≤ k, we define ∆ V (i) as the difference between the λ 1 3 -th smallest value in the multiset {v(c) : c ∈ C i } and v min (i).Next, we define ∆ mid V as the sum of the k 2 smallest values ∆ V (i).Symmetrically, we define ∆ W (i) and ∆ mid W .We shall prove a claim analogous to that in the proof of Lemma 4.8.
; otherwise, we are dealing with a NO-instance.
Proof.First, suppose that V min + ∆ mid V ≤ V .This means that there is a choice S with v(S) ≤ V which contains at least k 2 items c with rank V (c) ≥ λ 1 3 .By Fact 4.2, the rank of this choice is at least λ 1 6 k , so A V ≥ λ 1 6 k , as claimed.The proof of the second case is analogous.Now, suppose that there is a feasible solution S = {c 1 , . . ., c n }.For 3 holds for at least k 2 classes.Without loss of generality, we assume that the former holds.Let D be the set of (at least k 2 ) classes i satisfying the condition.For each i ∈ D, we clearly have V , which concludes the proof.The condition from the claim can be verified using a linear-time selection algorithm: first, we apply it for each class to compute ∆ V (i) and ∆ W (i), and then, globally, to determine ∆ mid V and ∆ mid W .If one of the first two conditions hold, we return the instance obtained through the reduction.It satisfies A ≥ λ 1 6 k , i.e., n ≤ 1 + k ≤ 1 + 6 log A log λ .Otherwise, we construct a dummy NO-instance.

Weighted Consensus and General Weighted Pattern Matching
The Weighted Consensus problem is formally defined as follows.
Weighted Consensus Problem Input: Two weighted sequences X and Y of length n with at most λ letters at each position and R in total, and a threshold probability 1 z .Output: A string S such that S ≈ 1 z X and S ≈ 1 z Y or NONE if no such string exists.
If two weighted sequences satisfy the consensus, we write X ≈ 1 z Y and say that X matches Y with probability 1  z .With this definition of a match, we extend the notion of an occurrence and the notation Occ 1 z (P, T ) to arbitrary weighted sequences.
General Weighted Pattern Matching (GWPM) Problem Input: Two weighted sequences P and T of length m and n, respectively, with at most λ letters at each position and R in total, and a threshold probability 1 z .Output: The set Occ 1 z (P, T ).
In the case of the GWPM problem, it is more useful to provide an oracle that finds solid factors that correspond to particular occurrences of the pattern.Such an oracle, given i ∈ Occ 1 z (P, T ), computes a string that matches both P and T [i..i + m − 1].
We say that a string P is a maximal 1 z -solid prefix of a weighted sequence X if P is a 1 z -solid prefix of X and no string P ′ = P s, for s ∈ Σ, is a 1 z -solid prefix of X.Our algorithms rely on the following simple combinatorial observation, originally due to Amir et al. [1].

Fact 5.1 ([1]
).A weighted sequence has at most z different maximal 1  z -solid prefixes.The Weighted Consensus problem is actually a special case of Multichoice Knapsack.Namely, given an instance of the former, we can create an instance of the latter with n classes C i , each containing an item c i,s for every letter s which has non-zero probability at position i in both X and Y .We set v(c i,s ) = − log π This way, for a string P of length n, we have Thus, P is a solution to the constructed instance of the Weighted Consensus problem with two threshold probabilities, 1  zX and 1 zY , if and only if S = {c i,j : P [i] = j} is a solution to the underlying instance of the Multichoice Knapsack problem.To have a single threshold z = max(z X , z Y ), we append an additional position n + 1 with symbol 1 only, with p If one wants to make sure that the probabilities at each position sum up to exactly one, two further letters can be introduced, one of which gathers the remaining probability in X and has probability 0 in Y , and the other gathers the remaining probability in Y , and has probability 0 in X.Nevertheless, it might still be possible to improve the dependence on n in the GWPM problem.For example, one may hope to achieve Õ(nz 0.5−ε + z 0.5 ) time for λ = O(1).

Faster GWPM via Short Dissimilar Weighted Consensus
This section provides a faster solution for the General Weighted Pattern Matching problem.The key ingredient is an improved solution for the following Short Dissimilar Weighted Consensus problem: Short Dissimilar Weighted Consensus (SDWC) Problem Input: A threshold probability 1 z and two weighted sequences X and Y of length n ≤ 2 ⌊log z⌋ with at most λ ≤ z letters at each position and such that H(X) and H(Y ) are dissimilar, i.e., H(X) for each position i.Output: A string S such that S ≈ 1 z X and S ≈ 1 z Y or NONE if no such string exists.

Merge-in-the-Middle Implementation
In this section we apply Lemma 6.3 to solve the SDWC problem.We use Lemma 6.5 to generate all candidates for L • c and R, and we apply a divide-and-conquer procedure to fill this with C. Our procedure works for fixed U, V ∈ {X, Y }; the algorithm repeats it for all four choices.Let L i denote a list of all common 1 z -solid prefixes of X and Y obtained by extending a light √ λ √ z -solid prefix of U of length i − 1 by a single letter s at position i, and let R i denote a list of all common 1 z -solid suffixes of X and Y of length n − i + 1 that are light 1 √ zλ -solid suffixes of V .We assume that the lists L i and R i are sorted according to the probabilities in U and V , respectively.Lemma 6.6.The lists L i and R i for i ∈ {1, . . ., n + 1} can be computed in O( √ zλ(log log z + log λ)) time.Their total size is O( √ zλ).
Proof.O( √ zλ(log log z + log λ))-time computation of the lists R i is directly due to Lemma 6.5.As for the lists L i , we first compute in O( √ z √ λ (log log z + log λ)) time the lists of all light √ λ √ z -solid prefixes of U , sorted by the lengths of strings and then by the probabilities in U , again using Lemma 6.5.Then for each length i − 1 and for each letter s at the i-th position, we extend all these prefixes by a single letter.This way we obtain λ lists for a given i − 1 that can be merged according to the probabilities in U to form the list L i .Generation of the auxiliary lists takes O( Let L * a,b be a list of common 1 z -solid prefixes of X and Y of length b obtained by taking a common 1 zsolid prefix from L i for some i ∈ {a, . . ., b} and extending it by b − i letters that are heavy at the respective positions in V .Similarly, R * a,b is a list of common 1 z -solid suffixes of length n − a + 1 obtained by taking a common 1 z -solid suffix from R i for some i ∈ {a, . . ., b} and prepending it by i − a letters that are heavy in V .Again, we assume that each of the lists L * a,b and R * a,b is sorted according to the probabilities in U and V , respectively. A basic interval is an interval [a, b] represented by its endpoints 1 ≤ a ≤ b ≤ n + 1 such that 2 j | a − 1 and b = min(n + 1, a + 2 j − 1) for some integer j called the layer of the interval.For every j = 0, . . ., ⌈log n⌉, there are Θ( n 2 j ) basic intervals and they are pairwise disjoint.Example 6.7.For n = 7, the basic intervals are [1, 1], . . ., [8,8], [1,2], [3,4], [5,6], [7,8], [1,4], [5,8], [1,8].(Note that the necessary products of probabilities can be computed in O(n) = O(log z) total time.)For every j = 1, . . ., ⌈log n⌉, the total length of the lists from the j-th layer does not exceed the total length of the lists from the (j − 1)-th layer.By Lemma 6.6, the lists at the 0-th layer have size O( √ zλ).The conclusion follows from the fact that log n = O(log log z).
Next, we provide an analogue of Lemma 4.1.
O(λ k−ε ).By monotonicity of T with respect to the first argument, we conclude that T (λ c , λ) = O(λ k−ε ) is impossible for c ≥ 2k − 1.On the other hand, monotonicity with respect to the second argument shows that T (λ c , λ) = O(λ c k 2k−1 −ε ) is impossible for c ≤ 2k − 1.The lower bounds following from (2k − 1)-Sum and (2k + 1)-Sum turn out to meet at c = 2k − 1 + 1 k+1 ; see Fig. Consequently, we have some room between the lower and the upper bound of √ aλ.In the aforementioned case of a = λ 2 , the upper bound is λ Due to Lemma 4.6, the extra n k term reduces to O(( log A log λ ) k ), and if we measure the running time using A instead of a, it becomes a constant (k O(k) ).In particular, this lets us prove that the GWPM problem can be solved in O(n(z k+1 2k+1 + λ k ) log λ) time for any integer k = O(1), improving upon the solution of Section 5 unless z = λ ω (1) or z = λ c±o (1) for an odd integer c.

Algorithm for Multichoice Knapsack
Let us start by discussing the bottleneck of the algorithm of Theorem 4.7 for large λ.The problem is that the size of the classes does not let us partition every choice S into a prefix L and a suffix R with ranks both O( √ A V ).Lemma 4.3 leaves us with an extra letter c between L and R, and in the algorithm we append it to the prefix (while generating L (ℓ) j−1 ⊙ C j ).We provide a workaround based on reordering of classes.Our goal is to make sure that items with large rank appear only in a few leftmost classes.For this, we guess the classes of the k items with largest rank (in a feasible solution) and move them to the front.Since this depends on the sought feasible solution, we shall actually verify all n k possibilities.Now, our solution considers two cases: For j > k, the reordering lets us assume rank V (c) ≤ ℓ 1 k , so we do not need to consider all items from C j .For j ≤ k, on the other hand, we exploit the fact that |L Rj+1 > V for every 0 ≤ j ≤ n.Let k ∈ {1, . . ., n} and suppose that S is a choice with v(S) ≤ V such that rank V (S ∩ C i ) ≥ rank V (S ∩ C j ) for i ≤ k < j.There is an index j ∈ {1, . . ., n} and a decomposition S = L ∪ {c} ∪ R such that L ∈ L

Figure 1 :
Figure 1: A weighted sequence X of length 4 over the alphabet Σ = {a, b} * Work supported by the Polish Ministry of Science and Higher Education under the 'Iuventus Plus' program in 2015-2016 grant no 0392/IP3/2015/73.†The author is a Newton International Fellow.

Observation 3 . 1 .
If we have Score(S, P ) ≥ Z for a string S of length m and an m × σ scoring matrix P , then d H (H(P ), S) ≤ ⌊log M ⌋ where M = NumStrings Z (P ).

Theorem 3 . 2 .
Profile Matching problem can be solved in O(mσ + n log M ) time.Proof.Let us bound the time complexity of the presented algorithm.The heavy string P ′ can be computed in O(mσ) time.The data structure for lcp-queries in P ′ T can be constructed in O(n + m) time by Fact 2.1.Each query for lcp(P ′ [i..m], T [j..n]) can then be answered in constant time by a corresponding lcp-query in P ′ T , potentially truncated to the end of P ′ .Finally, for each position p in the text T we will consider at most ⌊log M ⌋ + 1 mismatches between P ′ and T , as afterwards the score s ′ drops below Z due to Observation 3.1.Procedure ProfileMatching(P , T , Z) m := |P |; n := |T |; Occ := ∅; P ′ := H(P ); Compute the data structure for lcp-queries in P ′ T ; s := m j=1 P [j, P ′ [j]]; for p := 1 to n − m + 1 do s ′ := s; i := 1; j := p; while s ′ ≥ Z and i ≤ m do ∆ := lcp(P ′ [i..m], T [j..n]); i := i + ∆ + 1; j := j + ∆ + 1;

Theorem 3 . 4 .
Weighted Pattern Matching can be solved in O(R + n log z) time with high probability by a Las-Vegas algorithm or in O(R log 2 log λ + n log z) time deterministically.Remark 3.5.In the same complexity one can solve the GWPM problem with a solid text.

Lemma 4 . 1 .
The Multichoice Knapsack problem can be solved in O(N ) time if n = 2 and the elements c of C 1 and C 2 are sorted by v(c).

Lemma 4 . 6 .
Given an instance I of the Multichoice Knapsack problem, one can compute in O(N + λ log A) time an equivalent instance Rj+1 ≤ V for some j (or until all the lists have been completely generated).By Fact 4.4, we stop at r = O( √ A V λ).Lemma 4.6 lets us assume that n = O( log A log λ ), so the running time of this phase is O(N + √ A V λ log A) due to Lemma 4.5.Due to Lemma 4.3, every feasible solution S admits a decomposition S = L ∪ {c} ∪ R with L ∈ L (ℓ) j−1 , c ∈ C j , and R ∈ R (r)

Lemma 4 . 8 .
Given an instance I of the Multichoice Knapsack problem, one can compute in linear time an equivalent instance and w(c i,s ) = − log π (Y ) i (s) for this item, whereas the thresholds are V = W = log z.It is easy to see that this reduction indeed yields an equivalent instance and that it can be implemented in linear time.By Fact 5.1, we have A ≤ z for this instance, so Theorem 4.7 yields the following result: Corollary 5.2.Weighted Consensus problem can be solved in O(R + √ zλ log z) time.
= log z Y − log z X provided that z X ≥ z Y , and symmetrically otherwise.

Lemma 6 . 8 .
The lists L * a,b and R * a,b for all basic intervals [a, b] use O( √ zλ log log z) space and can be constructed in O( √ zλ(log log z + log λ)) time.Proof.We compute all the lists L * a,b and R * a,b for consecutive layers j = 0, . . ., ⌈log n⌉ of basic intervals [a, b].For j = 0, we have L * a,a = L a and R * a,a = R a .Suppose that we wish to compute L * a,b for a < b at layer j (the computation of R * a,b is symmetric).Take c = a + 2 j−1 − 1.Let us iterate through all the elements (P, p 1 , p 2 ) of the list L * a,c , extend each string P by H(V )[c + 1..b], and multiply the probabilities p 1 and p 2 by V )[i]), respectively.If a common 1 z -solid prefix is obtained, it is inserted at the end of an auxiliary list L. The resulting list L is merged with L * c+1,b according to the probabilities in U ; the result is L * a,b .Thus, we can compute L * a,b in time proportional to the sum of lengths of L * a,c and L * c+1,b .

3 2 , 4 3
compared to the lower bound of λ −ε .Below, we show that the upper bound can be improved to meet the lower bound.More precisely, we show an algorithm whose running time is O(N + (ak+1 2k+1 + λ k ) log λ • n k ) for every positive integer k.Note that a k+1 2k+1 + λ k = λ c k+12k+1 + λ k , so for 2k − 1 ≤ c ≤ 2k + 1 the running time indeed matches the lower bounds up to the n k term.

1 ⊙ 3 : 7 . 1 .
C j | ≤ λ j , which at most λ k .Combinatorial foundation of this intuition is formalized as a variant of Lemma 4.Lemma Let ℓ and r be positive integers such that V