Abstract
Partial Order Alignment (POA) was introduced by Lee et al. in 2002 to allow the alignment of a string to a graph-like structure representing a set of aligned strings (a Multiple Sequence Alignment, MSA). However, the POA edit transcript (the sequence of edit operations that describe the alignment) does not reflect the possible elasticity of the MSA (different gaps sizes in the aligned string), leaving room for possible misalignment and its propagation in progressive MSA. Elastic-Degenerate Strings (ED-strings) are strings that can represent the outcome of an MSA by highlighting gaps and variants as a list of strings that can differ in size and that can possibly include the empty string. In this paper, we define a method that optimally aligns a string to an ED-string, the latter compactly representing an MSA, overcoming the ambiguity in the POA edit transcript while maintaining its time and space complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The method can generalize to the case of cost \(s>0\) for skips.
- 3.
In an edit distance computation framework, the match has typically null cost in order to fulfill the metric requirement that a string has zero-distance to itself; for this reason in our examples we assume \(a=0\). However, since the the dynamic programming method we design also works when one wants to compute a similarity score rather than a distance (it suffices to adapt the penalty scores and seek the maximum instead of the minimum), then in our problem statement as well as in the recurrence formula that describe our algorithm, we parametrize the score of a match with a.
References
Cisłak, A., Grabowski, S.: SOPanG2: online searching over a pan-genome without false positives. arXiv:2004.03033 [cs] (2020)
Cisłak, A., Grabowski, S., Holub, J.: SOPanG: online text searching over a pan-genome. Bioinformatics 34(24), 4290–4292 (2018)
Loytynoja, A.L., Goldman, N.: An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. 102(30), 10557–10562 (2005)
Aoyama, K., Nakashima, Y., I, T., Inenaga, S., Bannai, H., Takeda, M.: Faster online elastic degenerate string matching. In: 29th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 105 (2018)
Darby, C.A., Gaddipati, R., Schatz, M.C., Langmead, B.: Vargas: heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics 36(12), 3712–3718 (2020)
Grasso, C., Lee, C.: Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20(10), 1546–1556 (2004)
Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452–464 (2002)
The Computational Pan-Genomics Consortium: Computational Pan-Genomics: Status, Promises and Challenges. Brief. Bioinform. 19(1), 118–135 (2018)
Iliopoulos, C.S., Kundu, R., Pissis, S.P.: Efficient pattern matching in elastic-degenerate texts. In: Drewes, F., Martín-Vide, C., Truthe, B. (eds.) LATA 2017. LNCS, vol. 10168, pp. 131–142. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53733-7_9
Feng, D.-F., Doolittle, R.F.: Progressive sequence alignment as a prerequisitet to correct phylogenetic trees. J. Mol. Evol. 25(4), 351–360 (1987)
Higgins. D.G., Sharp, P.M.: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1), 237–244 (1988)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Birmelé, E., et al.: Efficient bubble enumeration in directed graphs. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 118–129. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_13
Bernardini, G., Pisanti, N., Pissis, S.P., Rosone, G.: Pattern matching on elastic-degenerate text with errors. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 74–90. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_7
Bernardini, G., Pisanti, N., Pissis, S.P., Rosone, G.: Approximate pattern matching on elastic-degenerate text. Theor. Comput. Sci. 812, 109–122 (2020)
Bernardini, G,. Gawrychowski, P., Pisanti, N., Pissis, S.P., Rosone, G.: Even faster elastic-degenerate string matching via fast matrix multiplication. In: 46th International Colloquium on Automata, Languages, and Programming (ICALP). LIPIcs, vol. 132, pp. 21:1–21:15 (2019)
Bernardini, G., Gawrychowski, P., Pisanti, N., Pissis, S.P., Rosone, G.: Elastic-degenerate string matching via fast matrix multiplication. SIAM J. Comput. 51(3), 549–576 (2022)
Li, H., Feng, X., Chu, C.: The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020)
Eizenga, J.M., et al.: Efficient dynamic variation graphs. Bioinformatics 36(21), 5139–5144 (2021)
Alzamel, M., et al.: Degenerate string comparison and applications. In: 18th International Workshop on Algorithms in Bioinformatics (WABI). LIPIcs, vol. 113, pp. 21:1–21:14 (2018)
Alzamel, M., et al.: Comparing degenerate strings. Fundamenta Informaticae 175(1–4), 41–58 (2020)
Rautiainen, M., Marschall, T.: GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020)
Mwaniki, N.M. Garrison, E. Pisanti, N.: Fast exact string to d-texts alignments. CoRR, abs/2206.03242 (2022)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)
Grossi, R., et al.: On-line pattern matching on similar texts. In: 28th Annual Symposium on Combinatorial Pattern Matching (CPM). LIPIcs, vol. 78, pp. 9:1–9:14 (2017)
Grossi, R., et al.: Circular sequence comparison: algorithms and applications. Algorithms Mol. Biol. 11, 12 (2016)
Vaser, R., Sović, I., Nagarajan, N., Šikić, M.: Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27(5), 737–746 (2017)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Carletti, V., Foggia, P., Garrison, E., Greco, L., Ritrovato, P., Vento, M.: Graph-based representations for supporting genome data analysis and visualization: opportunities and challenges. In: Conte, D., Ramel, J.-Y., Foggia, P. (eds.) GbRPR 2019. LNCS, vol. 11510, pp. 237–246. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20081-7_23
Gao, Y., Liu, Y., Ma, Y., Liu, B., Wang, Y., Xing, Y.: abPOA: an SIMD-based C library for fast partial order alignment using adaptive band. bioRxiv (2020)
Acknowledgment
This work is part of the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956229.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mwaniki, N.M., Pisanti, N. (2022). Optimal Sequence Alignment to ED-Strings. In: Bansal, M.S., Cai, Z., Mangul, S. (eds) Bioinformatics Research and Applications. ISBRA 2022. Lecture Notes in Computer Science(), vol 13760. Springer, Cham. https://doi.org/10.1007/978-3-031-23198-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-23198-8_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23197-1
Online ISBN: 978-3-031-23198-8
eBook Packages: Computer ScienceComputer Science (R0)