Partially Local Multi-way Alignments

Multiple sequence alignments are an essential tool in bioinformatics and computational biology, where they are used to represent the mutual evolutionary relationships and similarities between a set of DNA, RNA, or protein sequences. More recently they have also received considerable interest in other application domains, in particular in comparative linguistics. Multiple sequence alignments can be seen as a generalization of the string-to-string edit problem to more than two strings. With the increase in the power of computational equipment, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are considered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Here we introduce a general formal framework that gives raise to a classification of partially local alignment problems. This leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems.


Introduction
Intuitively, a multiple string alignment (MSA) is a rectangular arrangement of the characters that preserves the order of characters in each string, see Fig. 1 for an example. The order of the strings (rows) is arbitrary, while the columns preserve the order of the characters in the input strings. Each sequence is stretched to a common length by inserting gap characters in such a way that similarities between sequences are maximized. The concept of (multiple) alignment generalizes from biological sequences and string in general to sequences of arbitrary objects.
Global multiple alignments frequently appear as a key intermediate result in computational biology. Figure 1 shows an alignment of short peptide sequences from eleven different animal species. Algorithmic approaches to computing MSAs typically are phrased as optimization problems: given a scoring scheme for the similarities observed across an alignment column, usually evaluated as the sum of pairwise similarities of all pairs of characters (including gaps) in a column, the objective is to maximize total similarity or minimize total dissimilarity. Several variants of this problem have been shown to be NP-hard [15,27,40,42,51,77]. Multiple sequence alignments therefore are usually computed with the help of heuristic approximation algorithms, see [9] for a recent review.
Many of these methods rely on the exact solution of the 2-way alignment problems for some or all of the N (N −1)/2 pairs of input sequences, which can be computed by means of a simple dynamic programming recursions known as the Needleman-Wunsch algorithm [58] or its variants. Several conceptually different approaches are used in practise to combine the pairwise alignments to MSAs: Progressive approaches reuse pair wise alignment algorithms to combine pairwise alignments usually in a tree-like fashion [37]. Many of the software packages that are widely used in bioinformatics belong to this class, such as clustal [71], mafft [41], t-coffee [59]. Iterative methods employ additional optimization steps to improve partial alignments usually interleaved with a progressive scheme to add sequences or partial alignments [35,79]. An unusual member of this class is dialign [55], which uses near-perfect local seed alignments [16]. Consensus methods such as M-COFFEE [3] or MergeAlign [21] combine many-usually partially contradictory-alignments using voting-like procedures.
The maximum weighted trace problem [42] provides a combinatorial way to describe MSAs: given weights for pairs of positions i and j from distinct input sequences, the problem is to extract a subset with maximal total Below: semi-global/overhang alignments. The variables p 1 , p 2 and q 1 , q 2 denote the local start and end positions, respectively. They 1 ≤ p i ≤ q i ≤ n i for i = 1, 2 weight that forms a so-called trace. In essence, a trace [68] represents the matches and mismatches in the MSA. The maximum weighted trace problem serves as a convenient framework for consensus methods. The trace framework is limited to relatively simple scoring models and gaps usually are not scored at all. Practical solutions are often based on integer linear programming [44], which also makes it possible to include arbitrary gap scores. Many authors have used general optimization methods such as genetic algorithms or simulated annealing [20], which allow essentially arbitrary scoring models. Exact solutions for N -way alignments with a wide range of scoring models can be obtained by a straightforward generalization of the Needleman-Wunsch algorithm, albeit with space and time requirements scaling exponentially with the number sequences N [19]. In this contribution we focus on this class, mostly for ease of presentation.
Different variants of pairwise alignments are used in computational biology. The distinction between global, local, and semi-local pairwise alignments is standard material for introductory courses in algorithmic bioinformatics, Fig. 2. It is well known that all these problems are solved by slight variations of the same basis dynamic programming algorithms: the Needleman-Wunsch recursion [58] for the global problem and Smith-Waterman algorithm [72] for the local version. The key recursion step compares the scores of extensions of shorter alignments by a (mis)match, insertion, or deletion, N(i, j) := max{S i−1, j−1 + m(i, j); S i−1, j + γ ; S i, j−1 + γ }, where m(i, j) returns the match score for the positions i and j and γ is the gap score. In the local case, a character in one input sequence may not only be (mis)matched with a character in the other sequence, but it may also belong to a prefix or suffix that remains entirely unaligned. One say that is in the aligned or unaligned state, respectively. Since only prefixes or suffixes may remain unaligned, the transition from unaligned to aligned can occur only once. In the dynamic programming recursion this translates to an extra choice with score 0 signifying that the prefixes up to positions i and j remain unaligned [72]; thus one computes max{N(i, j); 0}. In addition, the two algorithms differ in the initialization and the entry of the S-matrix that harbours the final result, i.e., the score of the optimal global of local alignment: To account for the possibility of returning to the unaligned state, one seeks the index pair (i * , j * ) that maximizes and the S-matrix and regards the suffixes beyond i * and j * as unaligned.
Local alignments are of key practical importance in computational biology in a variety of different application scenarios. Most commonly they are not computed exactly but one relies on fast heuristics, in particular blast [6] or blat [38]. In high throughput sequencing [28], where huge numbers of short fragments are compared to a long reference with few differences expected in the best matches, index data structures such as suffix arrays are employed, as in our segemehl suite [36]. Local alignments also form the basis for whole genome alignments, for which a diverse set of integrated workflows are available, see e.g., [12,61].
The identification of conserved sequence elements, such as clusters of transcription factor binding sites or splice enhancers, has become a key task in comparative genomics, often referred to as phylogenetic footprinting. It can be formalized as a local multiple alignment problem [52,84]. In [60,63] we considered the problem of combining pre-computed local pairwise alignments to a single local MSA using a consistency approach similar to [22,56]. Alternatively, phylogenetic footprinting can be viewed as a motif discovery task, as in meme [10], which required motives to match without gaps. The substring parsimony problem [13,14,70] formalizes a more general model that also allows gaps. Schematic representation of a breakpoint alignment for a reference sequence (1), a prefix (2), and a suffix (3). Sequences (2) and (3) may or may not overlap Several variations of alignment problems share the same basic structure of the recursion but do not score some or all of the end-gaps. The simplest form of semi-global alignment is used primarily for homology search. It asks for a complete match of a small sequence in a potentially large database. The associated dynamic programming algorithm only differs in the initialization, setting scores S 0, j = 0 to allow cost-neutral deletions of the prefixes of the long, second sequence. Correspondingly, the optimal match is found as max k S n 1 ,k . Dedicated implementations exists for this task, e.g., gotohscan [34]. More general overlap alignments [39] allow free end gaps on all sequences and play a role e.g., in sequence assembly [65].
With the advances in available computing power, the exact dynamic programming algorithms have become feasible beyond pairwise alignments. The basic recursion for the simultaneous alignment of N sequences is straightforward generalization of recursion N and simply enumerates all 2 N − 1 possible patterns of gaps in the last column of a alignment ending at position i p in the pth sequences [19]. Despite the extra effort, 3-way alignments have at least occasionally found practical applications in computational biology [24,32,45,48]. In computational linguistics, 4-way alignments [73] have been used to align words from related natural languages, an approach that is feasible owing to the short sequence length.
Despite the importance of alignment problems in computational biology, and the need to distinguish global and local versions of the problem, there does not seem to be an accessible theory of partially local alignments beyond the pairwise case. The notion of "local multiple sequence alignment" typically found in the literature refers to the identification of short, conserved, usually gap free patterns [14,50]. This problem is also known as phylogenetic footprinting in applications to DNA sequences. In [74] the local multiple sequence alignment problem is phrased as finding substrings (with given minimal length in a minimal number of input sequences) to that the global alignment score of these subsequences is maximized. There are, furthermore, alignment problems of practical interest that involve complex mixtures of local and global alignments. Mitochondrial genome rearrangements, for example, can be accompanied by the duplication or loss of sequence in the vicinity of a breakpoint. A natural model for this problem considers the alignment of a reference (the ancestral state, 1 in Fig. 3) and the two sequences close to the breakpoint in the derived state (2 and 3 in Fig. 3) [4]. Beyond the breakpoint, the derived sequences are unrelated to the reference. Thus they are naturally treated as local on one side and global (anchored) at the other. Regions in which both derived fragments overlap are rewarded to increase the sensitivity for the detection of a potentially duplicated stretch of sequence. As it turns out, the corresponding alignment algorithm has been quite efficient in detecting previously undescribed tandem duplication random loss events in mitochondrial evolution [4].
In a variety of applications it is useful to consider a probabilistic version of alignments. In the pairwise case, the relation between score minimization and a corresponding "partition function version" is well understood [17,54,57,83] and in a more general context explained within the framework of Algebraic Dynamic Programming [30]. More recently probabilistic models were also studied systematically for local pairwise alignments [26]. Exact probabilistic algorithms beyond pairwise alignments seem to have received very little attention so far.
Most 3-way and 4-way alignment algorithms were designed with very specific application in mind and made no attempt to map the world of mixed global and local alignment problems. The alignment problem appearing in [4], covers just a very specific special case and pertains to a specialized question: how does mitochondrial DNA evolve close to genomic breakpoints? Mixtures of local and global alignments also arise naturally in the context of historical linguistics: Words from different but related languages can be compared by means of alignments [23,46,47,49,73]. However, some languages have specific morphological patterns such as prefixes or suffixes marking grammatical features, or use an inherited root only in a composition. In such cases, cross-language comparisons could naturally resort to alignment techniques that determine for each input sequence, whether it is to be used globally or weather  4 Example of a partially local alignment of words with the Indo-European root for "studeō". This form of alignment, we might be able to directly infer a cross-linguistic kind of stemming serving as the first step to infer a proto form. We show unaligned prefixes and suffixes in brackets. Depending on prior knowledge of the inflection structure or other peculiarities of a language, it is desirable to decide for each word individually whether it is to be aligned locally or globally. As common in graphical displays of alignments,signifies a gap only a substring is expected to match with the cognates from other languages. A discussion of this issue in terms of pairwise alignments can be found in [47]. Figure 4 gives a conceptual example.
The purpose of this contribution is to develop a concise formal framework in which partially local N -way multiple sequence alignments can be studied systematically. To this end we generalize the systematic differences and similarities of local and global alignment problems and show that they can be represented in a manner that already implies exact DP solutions for them. In our presentation we proceed stepwisely. First we introduce a compact notation for the global alignment problem and argue that (the generalizations of) semi-global and "overhang" alignments are better viewed as global alignments with modified scoring functions at the ends of the input sequences. On this basis we then give a complete classification of partially local problems and derive a DP algorithm to solve them. Subsequently we consider the problem from the point of grammars, deriving an unambiguous version that, albeit computationally even more expensive, is suitable for a full probabilistic treatment. We close with some comments on possible future developments.

Notation and Basic Properties
Let us start with a formal definition. For completeness of the presentation we also consider some of the basic properties of multiple alignments that are folklore. We include proofs where we could not find a convenient reference.

Definition 2.1
A global multiple sequence alignment (gMSA) of N sequences of objects X p of finite length p for 1 ≤ p ≤ N is a set of N strictly monotonic maps ϕ p : Proof The lower bound follows directly from strict monotonicity. The longest possible well-formed gMSAs correspond to the concatenation of the N sequences.
The index ϒ : Note that ϒ p (k) = 0 for 0 ≤ k < ϕ p (1) and thus in particular ϒ p (0) = 0. We write π here as a function of the row index in the alignment; hence it can be viewed as a vector of length L.

Lemma 2.4
Letπ and ϒ be the column pattern and index of a gMSA ϕ.
Proof For k = 1 we haveπ p (1) = 1 if and only if ϕ p (1) = 1 and thus if and only if ϕ −1 p (1) = ϒ p (1) = 1. Now consider an arbitrary column k. By definition we have The key implication of Lemma 2.4 is that bothπ and ϒ completely determine the gMSA. In the following it will be convenient to use the symbol X to denote an index set that identifies the sequences X p , i.e., p ∈ {1, . . . , |X |}. Furthermore, we denote by A(X ) the set of gMSAs that can be formed by the sequence X p . Note that A(X ) only depends on the lengths of the sequences.

Scoring Functions and Optimization
Given sequences (X p ) of input objects, not all MSAs are of equal interest. To distinguish different MSAs of the same sequences, a scoring function is used. We consider here only the case that the total score is the sum of contributions s(k) of the alignment column. Still there are many different ways how the column score s(k) depends on ϕ and X . Consider the score of the partial gMSA ϕ up to column k. This amounts to scoring an alignment of the prefixes X 1 [1, ϒ 1 (k)], X 2 [1, ϒ 2 (k)], . . . , X N [1, ϒ N (k)], or, in other words, the restriction of ϕ p to the intervals [1, ϒ p (k)] for 1 ≤ p ≤ N . We write S(k, ϒ) for this value, suppressing the explicit reference to the dependence of the score on the input sequences. Since columns are scored independently, it satisfies the recursion S(k, ϒ) = S(k − 1, ϒ) + score(k, ϒ, X ).
Similar to the exposition in [69], we fix a set of indices I ∈ N p=1 [0, p ] and a pattern π and define S π I as the best possible score of a partial alignment with its last column k constrained to the prescribed values ϒ(k) = I and π(k) = π , i.e., Let us assume that the column score depends only on the pattern π in the current column and the indices i 1 , i 2 , . . . , i p of the current column. Usually, the scoring function will only evaluate the characters and gaps observed in the current column. Subject to this constraint, the function score(π, I, X ) thus can be tailored freely to match the application at hand. For instance, there is no reason that prevents us from using a scoring function that also considers the adjacent positions in the input sequence, e.g., to score two columns with the identical character and gap patterns differently. Such scoring models are indeed often used in the context of protein alignments to incorporate e.g., knowledge on structural features [7,53,62]. There is also no need to require that the column score satisfies any other constraints, such as being composed as sum or average of pairwise comparisons, although this is often used in practical applications. There is also no need for the scoring function to satisfy any symmetries. Scores that are sensitive to permutations of the input order are e.g., used in comparative linguistics, where sound shifts between languages are directional [73], or in the analysis of ancient DNA to account for degradation by cytosine deamination in the ancient sequence only [64].
Since the last column has index I by assumption, the previous column has index I − π , where π is the pattern of the last column. Thus Here we write o for the pattern of an all-gap column, which does not occur in well-formed gMSAs. Before we proceed, it is instructive to consider the more complicated case of alignments with the naïve affine gap cost model. This scoring model cannot deal exactly with pairwisely affine gap costs [43]. It is, however, an approach that is frequently employed for N -way alignments. In this model, the score of a column depends not only on its own pattern, but also on the pattern in the previous column, since the continuation of runs of gaps is scored differently from initiating a new run of gaps. It does not account for the fact that pairs of gaps between two input sequences are not present in the corresponding pairwise alignment at all. We will briefly return to this issue below. Formally, the score of the optimal alignment is Using the same reasoning as above, yields the recursion The explicit reference to the pattern τ =π(k − 1) in the preceding column makes it possible to distinguish gap opening (π p = 0, τ p = 1) from gap extension (π p = τ p = 0). Equation (2.4) is the obvious generalization of Gotoh's algorithm [31] to more than two sequences. In general, the score contributions take the form score(τ, π, I, X ) depending on the gap pattern π of the current column, the gap pattern τ of the previous column, the multi-index I , and the actual values of the input X . We formally make the score explicitly dependent on I (rather than only the corresponding characters X I that are aligned) to allow for context-dependent scoring, which has been shown to have large benefits in particular for protein alignments [7,53,62]. We shall see below, furthermore, that the explicit dependence on I is also important when dealing with semi-global alignments. In the following we will, however, suppress the explicit reference to X to somewhat simplify the notation.

Sets of Global Alignments and Formal Grammars
Instead of directly focusing on the scores it is useful to consider the sets of gMSAs. First we observe that each index set I , by virtue of determining prefixes of sequences, also specifies the set A(X |I ) of gMSAs. The column pattern π , furthermore determines a unique alignment column. Therefore where ++ denotes the component-wise concatenation operation and c(X, I, π) denotes the alignment column with entries X p,I p if π p = 1 and gap otherwise. The base case A(I |X, o), for I = o the zero vector, is the empty alignment. More precisely, given a gMSA ϕ : N p=1 [1, J p ] → [1, L] and π p ∈ {0, 1}, we explainφ = ϕ ++ π as the functionφ : N p=1 [1, J p + π p ] → [1, L + 1] such that ϕ p (i) =φ p (i) for all i ≤ I p and ϕ p (I p + 1) = L + 1 for all p with π p = 1. Thus A(X |I ) = {ϕ ++ π |ϕ ∈ A(X |I − π)}.
In the case of alignment with naïve affine gap costs, the scoring distinguishes between opening and extension of gaps. This makes it necessary to consider for the extending column, not only its own pattern π but also the pattern τ at the end of the gMSA that is extended. This amounts to partitioning each set A(X |I ) of gMSAs in terms of the end pattern and yields the recursion (2.6) with base case A(X | o, 1), which again denotes the empty alignment. The scoring of the column c may now explicitly depend on the patterns τ of penultimate column and π of the last column. The empty alignment is assigned the end pattern 1 to enforce that leading gaps are score as gap-open. This partitioning is sufficient for affine gap costs in two species but does not allow exact affine gaps costs in general. The well-known problem is that after a double gap in pair of sequences the information about their previous alignment state is lost. We will return to exact affine gap cost model at the end of this section. In the same approximation, this partitioning supports also scoring functions that depend on overlapping bigrams, as used e.g., in blastR for alignments of RNA sequences [18] and in word alignments in phylolinguistics [11,66].
Note that in alignment model above it is not logically necessary to include the reference to τ also in the term c(X, I, τ → π). It is convenient, however, to use this expanded notation because it emphasizes the differences, e.g., between gap extension (τ i = π i = 0) and gap opening (τ i = 1, π i = 0) already in the symbol c(X, I, τ → π). Using the set-based notation here we obtain the recursions for the optimal alignment scores by optimizing of the scores of the alignment set A(X |I, π) and identifying the score of c(X, I, τ → π) with the scoring function score(τ, π, I, X ). A complete, formal proof of the correctness of Eq. (2.6) will be given in the following section in a more general setting.
This reasoning matches the formalism of algebraic dynamic programming (ADP) [30], which separates the recursion of the search space (here explicitly spelled out in terms of sets of gMSAs), the scoring algebra (here a simple addition of score contributions of prefix alignments and the score of the last column), and the choice function (optimization of the score in our case). The theory of ADP emphasizes that this separation of concerns is possible for a large class of dynamic programming problems. An important advantage for practical applications is that the ADP framework makes it possible to use the same implementation for forward recursion and backtracking steps. A recent extension even provides a generic way to construct outside algorithms and thus posterior probabilities [86]. For our purposes, the recursive construction of the state space is the most important issue.
A key idea of ADP is to describe the recursive construction of the search space in terms of a formal grammar G = (N , , P, S) comprising a finite set N of non-terminals, a finite alphabet disjoint from N usually called the terminals, a start symbol S ∈ N , and set P of production rules. Here we are only concerned with context-free grammars, where the production rules p ∈ P are of the form p : Q → α with Q ∈ N and α ∈ (N ∪ ) * . In our case, the set of terminals are the possible patterns of alignment columns π (or τ → π ) and the empty alignment ∅. For emphasis, we use angular brackets to denote terminals of the grammar that correspond to alignment columns. Note that each pattern corresponds to a different terminal. The non-terminals are formal variables that designate different types of alignments.
In the simplest case of additive scoring functions, there is only a single alignment type A and the formal language of gMSAs is produced by the simple grammar A → A π for all end patterns π as well as the termination step A → ∅. Since there is only a single non-terminal, the initialization step is trivial, i.e., A also serves as the start symbol.
In the case of naïve affine gap costs we need a more elaborate grammar that distinguishes the end patterns of the alignments. Hence we need to distinguish non-terminals A π referring to alignments with different end patterns. Again, a terminal corresponding to the empty alignment ∅ is needed to represent termination. The production are of the form A π → A τ τ → π and A π → ∅ o → π . Here we also need a start rule that accounts for the fact that the gMSA may have every π as an end pattern. This is encoded by S → A π for all patterns π .
More details on the ADP formalism for alignment problems can be found in [85]. We note that alignments correspond to regular expressions or linear grammars in the Chomsky hierarchy since terminals are emitted only on one side of a single non-terminal, here to the right. The computational complexity of the multiple alignment problem arises from the exponential increase in the number of non-terminals and thus the exponential number of production rules (in terms of the number of input sequences) that are necessary to specify the problem. The compact notation above, which uses the end patterns to index terminals and non-terminals will be very convenient to investigate the grammars and dynamic programming recursions associated with partially local alignment problems that are the main topic of this contribution. Before we proceed, however, we briefly consider the most commonly used scoring function for multiple alignment problems in some more detail.
Equation (2.6) is a good starting point to investigate just how general scoring functions may be. Let A ∈ A(X |I, λ), where λ now is an index that identifies a "subtype" of alignment. The gap pattern π is one possible choice that is relevant for naïve affine gap cost alignments. A more complex construction that supports arbitrary gap scores, is briefly discussed in the appendix. Denote the last column of A by c. Structurally, then A = B ++ c. The previous alignment B ∈ A(X |I − π, λ ) is of type λ and has sequence end points I − π determined by (I, λ) and the pattern π of the column c. The class of scoring functions we are interested in here assumes additivity There are, however, very few restrictions on the column score score(c, I, λ , λ ), which in particular can model score increments for extending a particular gap pattern or a pattern of shapes. In particular, thus, the formalism includes affine gap scores in both the naïve version and the exact form used e.g., in the alignment of alignments problem [43]. In order for dynamic programming to be applicable, we only need to ensure global subadditivity: Concatenating A = B ++ C from two shorter alignments, we require which in particular specializes to the usual condition that gap penalties must be subadditive [78]. The condition that scores are of the form Eq. (2.7) and alignments can be constructed by stepwisely appending individual columns, i.e., that the recursion conforms to a regular grammar, excludes models such as Sankoff's algorithm [67] for the simultaneous alignment and structure prediction of RNAs. The reason is that RNA structure prediction is inherently non-local, giving rise to recursive steps of the form A = pA p, where p and p correspond to alignment columns that are base-paired with each other. We refer to [25] for an introduction to this kind of models.

Sum-of-Pairs Scores
By far the most widely used scoring model for multiple sequence alignments is the sum-of-pairs score [5,33] defined as Pairwise score components expressed in terms of π and τ can be interpreted in the following way: A term of the form corresponds to a (mis)match of the letters X p,i p at position i p of sequence X p and X q,i q at position i q of sequence X q . The terms score 0 0 τ q 1 i p i q and score 1 0 1 1 i p i q evaluate as gap extension and gap open scores, respectively. Since projects of multiple sequence alignments to pairwise alignments may contain gap columns, terms of the form score The end patterns π and τ can be more complex that just the gap patterns in the last and penultimate column. The strict definition of the sum-of-pairs score requires that columns consisting of two gap are removed before scoring a pairwise subalignment. The following two examples show that this cannot be done based only on the gap pattern of the current and the previous column.
The naïve generalization of affine gap cost score given above treats − d as gap extension, while it should be scored as opening a new gap. These issues are discussed at length in the context of "aligning alignments" and shown to be resolved by characterizing alignment columns by more elaborate "shapes" that describe, for each sequence, the relative position of the immediately preceding character [43]. Another variant that requires more elaborate end patterns is the use of distinct gap types as in the case of piece-wise linear gap cost functions [82]. These problems are interesting because they emphasize that the NP-completeness of alignment problems is not just related to the exponential number of gap/non-gap patterns required in the last column. In the case of pairwise alignments of multiple alignments, there are only three types of extensions. Nevertheless the problems is NP-complete [43], because exact solutions need to keep track of an exponential number of "shapes". We briefly sketch a related encoding in "Appendix B".
The term I − π in Eq. (2.6) already accounts for situations with different types of gaps. The recursion can be generalized further to more elaborate descriptions of the subtypes of alignments, as in the example given in "Appendix B". Suppose the subtypes of alignments are characterized by some variable λ and the extension of such an alignment by a column with gap pattern π gives rise to an alignment of type λ uniquely determined by λ and π . Writing λ = λ ⊕π to express this fact, we recast the recursion in the form A(I, λ) = π λ :λ=λ ⊕π A(I − π, λ ) ++ c(π ) and score it with an arbitrary column score depending on π, λ, λ , I and the sequence data X .

End Gaps and Overhangs
In pairwise semi-global and overhang alignments it is customary to use the global alignment recursions unaltered. Overhangs on the left hand side are taken care of with the initialization of the DP matrix: one simply sets S i,0 = 0 if the deletion of a prefix of length i from the first sequences is supposed to be score-neutral. The right hand end of the alignment is handled differently. Here the "free" deletion is handled by searching for the maximal entry max i S i,n 2 to determine the best position beyond which to delete the suffix of the first sequence. While this is convenient for score maximization, this trick does not carry over to probabilistic frameworks. The problem is that S n 1 ,n 2 does not correctly score the best alignment but includes gap scores for the deletions from the first sequence also beyond the last character of the second sequence. The easy remedy is to explicitly treat the cost-neutral end gap in the scoring function. Conceptually, this is what is done in the initialization step as well.
For N -way alignment problems this opens a can of worms: in principle one might want to be able to specify for any pair of sequences whether an overhang should be cost-neutral or not. Furthermore, this choice of scoring can be made independently on the left hand end and the right hand end of the alignment. In the case of a sum-of-pairs scoring model, these choices are conveniently represented by graphs, see Fig. 5. The pairwise scores now depend explicitly on the sequences to which they apply: For instance, we have s 13 (X 1,i 1 , ∅) = 0 if i 3 = 0 or i 3 = n 3 , while a regular gap score is used for all other values of i 3 .

General Framework
In a gMSA, each input sequence is contained in its entirety. The idea of local alignments is to relax this condition, and to consider also-in the most general case-multiple alignments of arbitrary infixes of the input sequences. In order to model this case, we extend the notion of an alignment column, ϕ −1 (k) in the following manner: •} such that the following properties are satisfied: Furthermore, we say that the MSA is well-formed if, for each k ∈ [1, L] there is at least one p ∈ [1, N ] such that The three distinct gap types •, •, and ∅ are introduced to distinguish, for each alignment column k, whether (an infix of) an input sequence X p "participates" in the alignment, the aligned state μ(k) ∈ [1, p ] ∪ {∅}, or not. The unaligned state is represented by μ(k) ∈ {•, •}. The distinction between aligned and an unaligned state was the key idea behind the dynamic programming algorithms for pairwise local alignments [72]. The distinction between initial and terminal unaligned states, • and •, resp., is a technical convenience that could be omitted. Condition 3 ensures that the aligned state forms a contiguous interval of alignment columns for each input sequence X p . As a special case we also allow that sequences have no aligned states at all. In order to enforce a unique representation we stipulate that μ(0) = • to μ(i) = • for i ∈ [1, L]. In practise, this case refers to an input sequence that is so different from all others that it does not significantly match anywhere. The purpose of the "empty" alignment column 0 is the encode a definite state at the beginning of the alignment. Thus, if X p is in the aligned state from the beginning, its first element must appear in the MSA. This is stipulated by Condition 4. The final Condition 5 establishes a convention for representing in a unique manner the case that sequence X p is never aligned.
The notion of a patternπ naturally generalized from gMSAs to arbitrary MSAs: The pattern of a MSA of length L is function [1, L] → {0, 1} N such thatπ p (k) = 1 iff μ p (k) ∈ [1, p ] andπ p (k) = 0 otherwise. Again, the pattern just distinguishes cells in the alignment that contain elements of the input sequences from cells that contain a gap.
Finally, we extend the index function ϒ to MSAs. [1, p ] of an MSA of length L is given by

Definition 3.3 The index function
The value ϒ p (0) is the length of the prefix of X p that remains unaligned. Definition 3.3 does not unambiguously settle the special case that a sequence is completely deleted. In principle, this could be represented by the deletion of any prefix followed by the deletion of the remaining suffix. To avoid ambiguities, we elect-without loss of generality-to interpret empty alignment rows as the deletion of the full-length suffix. Hence, ϒ p (0) = 0 is this case and thus, consistent with Definition 3.3, ϒ p (i) = 0 for all i ∈ [0, L + 1].
One easily verifies i.e., the index for sequence X p at alignment column k is the sequence position up to which X p is represented in the alignment; this includes also unaligned prefixes that are omitted. Since ϒ p (0) = 0 whenever sequences X p starts out in the aligned state, the definition is consistent with definition of the index of gMSAs. The index function completely determines the pattern function of a MSA by virtue ofπ Thus there is a one-to-one correspondence between MSAs μ and pairs (ϒ, ) that are derived from μ.

Lemma 3.4 For fixed L, there is a bijection between MSAs μ and pairs (ϒ, ) of state and index functions that
We note for later reference that a MSA μ is gMSA if and only if the state function is zero everywhere. All resulted proved below for sets of MSAs thus also hold for gMSAs when restricted to everywhere zero state functions.
Let us now turn to the recursive construction of set of all local alignments. For this purpose we also need to formally define the extension of a MSA μ by an additional column. •}) N be a MSA of length L and let (π, ) be a pair consisting of a pattern π and a state vector . Then the concatenationμ = μ ++ (π, ) is the MSA of length L + 1 with μ(i) = μ(i) for 0 ≤ i ≤ L andμ p (L + 1) = 1 + max 1≤ j≤L μ p ( j) if π p = 1, andμ p (L + 1) = •, ∅, • depending on whether p = −∞, 0, or +∞ whenever π p = 0.
Let us write A(X |I, π, ) for the set of MSAs of prefixes of X p with length I p for p ∈ [1, N ] whose last column has pattern π and state function irrespective of their length. Instead of a single empty global alignment we now have a large set of empty local alignment that differ in length of the prefixed that have been omitted. These MSAs of length L = 0 are of the form A(X |I, o, o ), where the special pattern o features no aligned element in the zero-th column. The values of I p encode the lengths of the prefixes of the input sequences X p that are omitted. The state vector therefore must satisfies o p = −∞ whenever I p > 0, i.e., for all rows in which a prefix if X p is omitted. By definition we have o p ∈ {−∞, 0} whenever I p = 0. In the following we use the notation c(τ → π ; → ) for an alignment column with pattern π and state vector that succeeds a column with pattern τ and state vector . While it is not strictly necessary to include the information on the preceding column, it is a convenient notational device to restrict transitions between patterns and to make scoring functions context dependent. Formally, we think of c(τ → π ; → ) as a set that either contains the unique column (π, ) if the transition is allowed, or as the empty set. Note that the set-wise concatenation with the empty set yields the empty set. Depending on the scoring function not all transitions τ → π ; → might be allowed. For instance, the state function cannot change from +∞ ("omitted suffix") to 0 ("aligned") or −∞ ("omitted prefix"). The user may want to further restrict transitions. In Sect. 4, for instance, we briefly consider a model that allows changes in the state function only simultaneously for all input sequences. A(X |I, π, ) the set of all MSAs of prefixes of X p with length I p for p ∈ [1, N ] whose last column has pattern π and state function . Then 1. A(X |I, π, ) ∩ A(X |I , π , ) = ∅ implies I = I , π = π and = .

I π
A(X |I, π, ) is the set of all MSAs of X 3. The sets can be constructed recursively using 3) (2) Every alignment of prefixed ends with an index 0 ≤ I p ≤ p , and corresponding pattern π p (defined by the difference of the index function between the last two columns) and a value of the state function; hence every MSA of X is contained in one of the sets A(X |I, π, ).

The alignments of the form
(3) Let us partition the sets A(X |I, π, ) w.r.t. to the alignment length L. We first show that the length-stratified version of Eq. (3.4) is true. To this end, observe that every alignment (consisting) A with L ≥ 2 columns is the concatenation of an alignment A that is one column shorter and the final column. A ends in (I, π, ) while A has a last column characterized by (J, τ, ). From J p +π p = I p we conclude that J = I −π . Any potential restrictions on transitions between patterns and/or state function are encoded in the context dependent columns c. Hence the recursion is correct for L ≥ 2. The first column follows an empty alignment, which features the special pattern o (nothing is aligned yet) and an index J specifying the length of prefixes that are omitted in unaligned states, thus the state of the zero-th column must be p = −∞ whenever J p > 0. If J p = 0, our definition leaves us the choice between p = −∞, a "deletion" of an empty prefix, or p = 0, enforcing that the first column is treated as a regular part of an alignment. Thus Eq. The correctness of Eq. (2.6) for gMSAs now follows as an immediate corollary: gMSA are characterized as the MSAs whose state function is 0 everywhere, thus the base case reduces to alignment with pattern o and state function o p = 0 for all p, whence I p = 0. In fact, the set A(0, o, 0) contains a simple empty alignment. From a modelling point of view, it seems desirable in most contexts to exclude alignments in which left-local sequences begin with an aligned gap and right-local sequences end with an aligned gap. For this purpose we have to limit transitions (τ, → π, ) in such a way that (i) p = −∞ and p = 0 implies π p = 1; and (ii) p = 0 and p = +∞ implies τ p = 1. This is achieved by setting c(τ → π ; → ) = ∅ whenever these conditions are violated. Different restrictions may also be useful, as in the case of block alignments, see Sect. 4 below.
As in the case of gMSAs, this set-wise recursion can be re-cast in the form of a linear grammar of the form The formally simple recursion now also contains rules that stepwisely remove prefixed and suffixes in productions with = = ±∞. We note in passing that the base cases A(X |I, o, o ) can, in principle, also be generated recursively by step-wisely deleting longer prefixes. This is irrelevant for our present discussion, but relevant for convenient ADP implementations in practise [85].
Instead of using the state vector ∈ {−∞, 0, +∞} will be more convenient in the following to encode the state as two subsets A, D ⊆ {1, . . . , |X |} such that p ∈ A for p = 0 and p ∈ D whenever p = +∞.

Classification of Partially Local Alignment Problems
Partially local alignment problems are naturally classified in terms of constraints on the individual sequences X p . For each sequence we can therefore specify whether the set of alignments must contain X p in its entirety, whether a prefix or a suffix is required, or whether an arbitrary, possibly empty infix also constitutes an acceptable solution. In the language of the previous section, this amounts to specifying the state for the zero-th and the last alignment column.
It is convenient to associate these constraints directly with each input sequence X p . We say that X p is rightglobal or left-global if its first or last position must be included in every valid MSA. If X p is not right-global, it is right-local, implying that valid MSAs may exclude suffixes of X p . Similarly, prefixes my be omitted for left-local sequences. A sequence is global if it is both right-and left-global. Global sequences must be completely contained in the MSA. For local sequences, i.e., those that are both right-and left-local, any infix suffices. In the following we write R and L for the set of right-global and left-global sequences, respectively. See Fig. 6 for a more complicated example.
The sets of valid alignments for given set L and R are obtained from the general case by constraining the state of the initial and the final alignment. Valid initial conditions, i.e., alignments of length 0, are of the form A (I, o, (A, D)) and consist of all MSAs that satisfy A = L and I p = 0 for all p ∈ L. On the other hand, solutions of the partially local alignment problem are of the form A (I, π, (A, D)) such that A = R and I p = p for all p ∈ R.
Conditions 3 and 4 imply that state changes from one alignment column to the next are only possible from the unaligned • state, i.e., p ∈ X \ A ∪ D, to the aligned state A (or in the special case of empty rows directly In order to construct the supremum, consider The smallest sets satisfying both conditions are D = Since (A , D ) is a valid state, it is the unique supremum.
Given R, L ⊆ X , there is a unique sublattice with minimal element (L , ∅) and maximal element (R, X \R) that defines all allowed states (X \L , ∅) (A, D) (R, X \R) that may appear in a solution of a partially local alignment problem with left-global set L and right-global set L. The pair R, L therefore characterizes the partially local alignment problem and completely specifies the set of relevant MSAs. For an example see Fig. 7.

Explicit Recursions for Optimal Alignments
Let us briefly consider the problem of computing the optimal alignment score for a partially local alignment problem of type R, L. Denote by  ,2,3}{}   {1,2,3,4}{}   {1}{2}  {1,4}{2}   {1,2}{4}   {1,2,3}{4}   {1,3,4}{2}   {1}{2,4}   {1,3}{2}  {1,3}{2,4}  {1,2}{} {1,2,4}{} In the scoring models that are usually employed for sequence alignments, one usually assumes that scores associated with state changes (if they are not omitted altogether) are at least independent of details of the emitted alignment column, i.e., the score( . ) consists of additive terms score(X, I |τ → π) and score (A , D → A, D). In this scenario, it possible to separate the extension of the alignment without state change from state changes without extension. This amounts to transforming the recursion (3.8) into the form The first alternative amounts to the usual recursion and accounts for all 2 N − 1 possible gap patterns π of the last and the preceding alignment columns. The second term separately considers all possible state changes. Instead of considering all states that are ≺-smaller than (A, D) one could also make use of the Hasse diagram of this partial order, Fig. 7, and recurse only over the immediate predecessors of (A, D). It follows immediately that (A, D) has at most N immediate predecessors. Scoring models for local alignments are usually parameterized in such a way that state changes do not incur a score contribution. In this case Eq. (3.9) can be simplified further Here we have omitted the possibility to attach scores to the transitions between states since this seems to be the most natural choice in the light of commonly used local alignment algorithms. One may add, however, a scoring term score (A , D → A, D) to the second case in Eq. (3.10) as in Eq. (3.9). Such a score may of course depend on I and X , and may be used e.g., to favor the aligned portions to start with a particular nucleotide or to end a given sequence motif. Since only transitions from immediate predecessors are admitted, the scoring model must assume additive contributions from each sequence that is "activated" or "retired". In contrast, Eq. In addition we need a starting rule of the form S → (R, X \R, τ ) for all τ that generates all alignments with the correct state (R, X \R) at the right end and all corresponding end patterns.

Explicit Recursions for Partition Functions of Alignments
The partition function over the set of alignments A(X |I, π, (A, D)) is defined as sum of the Boltzmann factors of their scores, i.e., exp(β score(μ)) (3.12) where β is the "inverse temperature" governing the relative importance of alignments with different scores. As an immediate consequence of Theorem 3.6 we obtain The rules of ADP would suggest that a partition function could be obtained from a "partition function version" of the recursion (3.9), which takes the form However, semantically, this is not what one wants to compute in a probabilistic setting because the grammar of Eq. (3.11) is ambiguous, i.e., it allows multiple distinct parses for the same alignment [29]. First, alignments in which a sequence X p that is both left and right local is completely deleted are represented multiple times, because the direct transition from not yet active to done may occur in every alignment step and all these possibilities are accounted for as separate alignments. This contradicts our intuition for the conditions under which two alignments should be regarded as the same. Of course this "overcounting" has no effect when the goal is to maximize the score. It does affect the result, however, when the task is compute probabilities over ensembles of alignments.
A second, even worse ambiguity arises from the fact that we allowed ourselves to perform a sequence state changes without emitting any alignment column in between. This amounts to multiple paths in the lattice of Fig. 7 that lead to the same overall state change. To remedy this problem, we need to restructure the grammar. The basic idea is to ensure that every production emits a column, and makes at most one state change at the same time. Given the order structure of the alignment columns and the states, this ensures unambiguity. This modification comes at a cost, however. Now we have to allow any combination of state changes in a single step-so as to count it only once. Thus, in general there are exponentially many (in N ) state transitions that need to be considered. The resulting grammar is of the form This grammar gives rise to the correct recursion (3.13).
To ensure that this grammar is unambiguous we need to enforce that a sequence X p can be activated or retired only if one of its characters is emitted in the column c π ←τ that is emitted in the same step. This restriction now requires a separate treatment of sequences that remain completely unaligned, i.e., the first type of ambiguity mentioned above. Note that this is implicit in the definition of MSAs μ, which allows a transition from • to • only after the zero-th column of the alignment.

Block-Local N-Way Alignments
The general treatment of partially local alignments suggests that further variations on the theme may also be of interest. In phylogenetic footprinting the goal is to find intervals of common length k such that their gap-less alignment maximizes a score [14]. In the present setting it would be natural to relax the no-gaps condition. Instead, one may ask for intervals [b p , e p ] for all p ∈ X such that the global alignment of the infixes/substrings X p [b p , e p ] maximizes the score. We call this the block alignment problem. This is can be seen as variant of the partial local N -way alignment where (1) all sequences are local at both ends, and (2) the state transitions are restricted to the rather trivial Hasse diagram (∅, ∅) → (X, ∅) → (∅, X ), i.e., all sequences are activated and then retired at common points in time. Naturally, end-gaps should be costly in this setting, i.e., in contrast to the unambiguous grammars discussed in the previous section we would also allow the first and last rows to contain gaps.

Concluding Remarks
In this contribution we have outlined a general formal framework towards treating partial, complex locality constraints in sequence alignment. Our approach was guided by exact DP algorithms for this class of problems. Of course the resulting algorithms are exponential in the number of sequences; after all they use the same recursive scheme as the well-known DP solution for global N -way MSA. This is not a fundamental shortcoming, however, since (a) the (decision version of the) global N -way sequence alignment problem is well known to be NP-hard [40,77], and (b) the problem is tractable in practice and of relevance for practical applications for small numbers of input sequences. Recent results from complexity theory, furthermore, strongly indicate that N -way MSA problems cannot be solved in O(n N − ) time for input sequences of length n unless common believed conjectures in the theory of computational complexity are false [1,2]. That is, the dynamic programming solutions are optimal (up to logarithmic factors e.g., by means of the well-known "Four Russians" methods [8]).
Much of the notational complications in this contribution are caused by the need to show that very general classes of scoring models are compatible with the notion of partially local alignments. This separation of concerns should in particular become practically useful when the grammar formalism is used to construct DP algorithms. The alignment types λ then simply become classes in a classified DP framework.
At present, we do not provide an implementation. A special case, however, is in practical use for breakpoint determination [4]. In [85] it has been demonstrated that N -way global alignments as well a variety of complex alignment algorithms can be constructed quite easily with the help of grammar products. In this context a general implementation in Haskell has been provided. A complete implementation of the framework outline in this contribution, however, requires some further developments, in particular an extension of the underlying theory to corresponding products of the scoring algebras and the development of a principled manner to construct the lattice of alignment states. While it is of course possible to implement models for a small number of sequences explicitly, this quickly becomes tedious and prone to errors. For alignment models are more complex than the example outlined in "Appendix A" a machinery that generates the code automatically along the line of [85] will be necessary in practice.
In practical applications we expect that partially local alignments will in general appear as components in workflows. This is indeed the case in application described in [4]: there, the putative location of a breakpoint is located approximately from a comparison of the gene orders in a pair of mitogenomes and the sequencing containing this region is extracted. The partially local alignment is then applied to these preprocessed input sequences. It is unlikely that the exact DP solutions to partially local alignments will be of much practical use for more than 3 input sequences-at least in computational biology application where sequences are typically long. This may well turn out differently in alignment applications to language data, such as the one briefly outlined in the introduction. At the very least, they may serve for comparisons in benchmarking heuristics for partially local alignment tasks. The grammar versions can in principle also be used to generate benchmark data for such problems. Currently test sets for benchmarking alignments such as BaliBase [75] are geared towards global alignments even though in some cases of the performance on particular blocks of conserved sequence enters the performance metrics [76].
The theoretical framework developed here is not limited to pure sequence alignments. It also pertains, for instance, to the concurrent alignments of RNA secondary structures. The Sankoff algorithm [67] also comes with local and semi-global variants, see e.g., [80,81] that could in principle be generalized to partially local N -way problems. Of course, the resulting algorithms will be even more expensive, scaling with O(n 3N ) rather than O(n N ) with sequence length due to the cubic scaling of the RNA folding part. Nevertheless it is at least of theoretical interest that the global alignment part, above denoted by A I ( . ), can be replaced without affecting the handling of partial locality constraints by any scheme that produces global alignments.
The Hasse diagram of this problem reads In the recursion we use the functions score() to evaluate the pairwise contributions to the sum-of-pairs score, as well as lookup() to evaluate the contribution of pairwise sequence (mis)matches. Their definitions for the specific application will be defined at the end of the section.
The Since we can reach S {1,3},{2} also from S {1,2,3},∅ , we will first define the latter, 3-dimensional matrix. The bases cases are the following: For the first and third sequence we get the following recursions, which in contrast to the above also include the activation of the third sequence: Since we "look back" two steps, it is necessary to initialize the first the diagonal step explicitly: S {1,2,3},∅, 1,1,1 = max S {1,2,3},∅ 0,0,0 + lookup(x 1 (1), x 2 (1)) + lookup(x 1 (1), x 3 (1)) + lookup(x 2 (1), x 3 (1)), We are now in the position to consider the general case. As in some of the pairwise alignments above, we have to track the activation of the third sequence. This may only occur when sequence 3 contributes a letter to the current alignment column, not when a gap is inserted for this sequence. Note that this handling of transitions to and from the active state is a modeling decision: it may be handled differently, although it seems most natural to assume that the activation of a sequence has to occur together with the insertion of a character to the alignment.  Fig. 8 Encoding of gap patterns for very general gap cost models. Occupied positions are shown in blue, gaps in white. Only three rows p, q, and r of a large MSA are show. The terminal gap between p and q is shown by the red bar; it corresponds to a deletion of the two rightmost blue boxes from p relative to q. Similarly, the blue part above the green bar are deleted from p relative to r . Comparing q and r , the last blue box is deleted from r , that terminal gap matches the red bar. The last column shows a possible extension of the alignment. In this case, the deletion from p to q is extended, while the gap in r relative to p (and q) is closed by a (mis)match shown in red (color figure online) Note the cases λ pq = 0 are consistently defined in the overlapping cases. Denoting this pairwise rule as λ = λ ⊕π allows us to write A(X, I, λ) = π λ :λ ⊕π =λ A(X, I − π, λ ) ++ c(X, I, π, λ , λ) (5. 2) The representation in particular supports sum-of-pair scores with arbitrary (subadditive) pairwise gap scores, since the type λ provides information on the true length of the pairwise gaps. We have made no effort here to devise a an encoding that is most efficient in practise. For instance, in an affine gap cost setting, the diagonal elements of λ are not used.