Abstract
Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with |E| edges that exactly matches a pattern of length m, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than \(\mathcal {O}(|E|m)\) time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in \(\mathcal {O}(|E|m)\) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. Despite the popularity of de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than \(\mathcal {O}(|E|m)\) is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic \(\tilde{O}(n\sqrt{m})\) time, where n is the text’s length [Abrahamson, SIAM J. Computing 1987].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abboud, A., Backurs, A., Hansen, T.D., Williams, V.V., Zamir, O.: Subtree isomorphism revisited. ACM Trans. Algorithms 14(3), 27:1—27:23 (2018). https://doi.org/10.1145/3093239
Abrahamson, K.R.: Generalized string matching. SIAM J. Comput. 16(6), 1039–1051 (1987). https://doi.org/10.1137/0216067
Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Wheeler languages. CoRR abs/2002.10303 (2020). https://arxiv.org/abs/2002.10303
Alanko, J.N., Gagie, T., Navarro, G., Benkner, L.S.: Tunneling on wheeler graphs. In: Data Compression Conference, DCC 2019, Snowbird, UT, USA, 26–29 March 2019. pp. 122–131 (2019). https://doi.org/10.1109/DCC.2019.00020
Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space and time-efficient index for the compacted colored de Bruijn graph. Bioinform 34(13), i169–i177 (2018). https://doi.org/10.1093/bioinformatics/bty292
Amir, A., Lewenstein, M., Lewenstein, N.: Pattern matching in hypertext. J. Algorithms 35(1), 82–99 (2000). https://doi.org/10.1006/jagm.1999.1063
Backurs, A., Indyk, P.: Which regular expression patterns are hard to match? In: IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9–11 October 2016, pp. 457–466. Hyatt Regency, New Brunswick (2016). https://doi.org/10.1109/FOCS.2016.56
Benoit, G., et al.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 16, 288:1–288:14 (2015). https://doi.org/10.1186/s12859-015-0709-7
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. J. Comput. Biol. 22(5), 336–352 (2015). https://doi.org/10.1089/cmb.2014.0160
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8, 22 (2013). https://doi.org/10.1186/1748-7188-8-22
Dondi, R., Mauri, G., Zoppis, I.: Complexity issues of string to graph approximate matching. In: Leporati, A., Martín-Vide, C., Shapira, D., Zandron, C. (eds.) LATA 2020. LNCS, vol. 12038, pp. 248–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40608-0_17
Egidi, L., Louza, F.A., Manzini, G.: Space efficient merging of de Bruijn graphs and wheeler graphs. CoRR abs/2009.03675 (2020). https://arxiv.org/abs/2009.03675
Equi, M., Grossi, R., Mäkinen, V., Tomescu, A.I.: On the complexity of string matching for graphs. In: 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, 9–12 July 2019, Patras, Greece. pp. 55:1–55:15 (2019). https://doi.org/10.4230/LIPIcs.ICALP.2019.55
Flick, P., Jain, C., Pan, T., Aluru, S.: Reprint of “a parallel connectivity algorithm for de Bruijn graphs in metagenomic applications”. Parallel Comput. 70, 54–65 (2017). https://doi.org/10.1016/j.parco.2017.09.002
Gagie, T.: \(r\)-indexing wheeler graphs. CoRR abs/2101.12341 (2021). https://arxiv.org/abs/2101.12341
Gagie, T., Manzini, G., Sirén, J.: Wheeler graphs: a framework for BWT-based data structures. Theor. Comput. Sci. 698, 67–78 (2017). https://doi.org/10.1016/j.tcs.2017.06.016
Georganas, E., Buluç, A., Chapman, J., Oliker, L., Rokhsar, D., Yelick, K.A.: Parallel de Bruijn graph construction and traversal for de novo genome assembly. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, 16–21 November 2014. pp. 437–448 (2014). https://doi.org/10.1109/SC.2014.41
Gibney, D.: An efficient elastic-degenerate text index? not likely. In: Boucher, C., Thankachan, S.V. (eds.) SPIRE 2020. LNCS, vol. 12303, pp. 76–88. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59212-7_6
Gibney, D., Hoppenworth, G., Thankachan, S.V.: Simple reductions from formula-sat to pattern matching on labeled graphs and subtree isomorphism. In: 4th Symposium on Simplicity in Algorithms, SOSA 2021, Virtual Conference, 11–12 January 2021. pp. 232–242 (2021). https://doi.org/10.1137/1.9781611976496.26
Gibney, D., Thankachan, S.V.: On the hardness and inapproximability of recognizing wheeler graphs. In: 27th Annual European Symposium on Algorithms, ESA 2019, 9–11 September 2019, Munich/Garching, Germany. pp. 51:1–51:16 (2019). https://doi.org/10.4230/LIPIcs.ESA.2019.51
Gibney, D., Thankachan, S.V., Aluru, S.: The complexity of approximate pattern matching on de Bruijn graphs (2022)
Heydari, M., Miclotte, G., de Peer, Y.V., Fostier, J.: Browniealigner: accurate alignment of illumina sequencing data to de Bruijn graphs. BMC Bioinform. 19(1), 311:1–311:10 (2018). https://doi.org/10.1186/s12859-018-2319-7
Holley, G., Peterlongo, P.: Blastgraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs. In: PSC 2012 (2012)
Holley, G., Wittler, R., Stoye, J., Hach, F.: Dynamic alignment-free and reference-free read compression. J. Comput. Biol. 25(7), 825–836 (2018). https://doi.org/10.1089/cmb.2018.0068
Hoppenworth, G., Bentley, J.W., Gibney, D., Thankachan, S.V.: The fine-grained complexity of median and center string problems under edit distance. In: 28th Annual European Symposium on Algorithms, ESA 2020, 7–9 September 2020, Pisa, Italy (Virtual Conference). pp. 61:1–61:19 (2020). https://doi.org/10.4230/LIPIcs.ESA.2020.61
Jain, C., Zhang, H., Gao, Yu., Aluru, S.: On the complexity of sequence to graph alignment. In: Cowen, L.J. (ed.) RECOMB 2019. LNCS, vol. 11467, pp. 85–100. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17083-7_6
Kamal, M.S., Parvin, S., Ashour, A.S., Shi, F., Dey, N.: De-Bruijn graph with MapReduce framework towards metagenomic data classification. Int. J. Inf. Technol. 9(1), 59–75 (2017)
Kapun, E., Tsarev, F.: On NP-hardness of the paired de Bruijn sound cycle problem. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 59–69. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40453-5_6
Kavya, V.N.S., Tayal, K., Srinivasan, R., Sivadasan, N.: Sequence alignment on directed graphs. J. Comput. Biol. 26(1), 53–67 (2019). https://doi.org/10.1089/cmb.2017.0264
Li, D., Liu, C., Luo, R., Sadakane, K., Lam, T.W.: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31(10), 1674–1676 (2015). https://doi.org/10.1093/bioinformatics/btv033
Limasset, A., Cazaux, B., Rivals, E., Peterlongo, P.: Read mapping on de Bruijn graphs. BMC Bioinform. 17, 237 (2016). https://doi.org/10.1186/s12859-016-1103-9
Limasset, A., Flot, J., Peterlongo, P.: Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 36(2), 651 (2020). https://doi.org/10.1093/bioinformatics/btz548
Lin, Y., Shen, M.W., Yuan, J., Chaisson, M., Pevzner, P.A.: Assembly of long error-prone reads using de Bruijn graphs. In: Proceedings of the Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Santa Monica, CA, USA, 17–21 April 2016, p. 265 (2016). https://link.springer.com/content/pdf/bbm%3A978-3-319-31957-5%2F1.pdf
Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016). https://doi.org/10.1093/bioinformatics/btw371
Morisse, P., Lecroq, T., Lefebvre, A.: Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 34(24), 4213–4222 (2018). https://doi.org/10.1093/bioinformatics/bty521
Navarro, G.: Improved approximate pattern matching on hypertext. Theor. Comput. Sci. 237(1–2), 455–463 (2000). https://doi.org/10.1016/S0304-3975(99)00333-3
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109(33), 13272–13277 (2012). https://doi.org/10.1073/pnas.1121464109
Peng, Y., Leung, H.C.M., Yiu, S., Chin, F.Y.L.: IDBA - a practical iterative de Bruijn graph de novo assembler. In: Proceedings of the 14th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2010, Lisbon, Portugal, 25–28 April 2010. pp. 426–440 (2010). https://doi.org/10.1007/978-3-642-12683-3_28
Peng, Y., Leung, H.C.M., Yiu, S., Lv, M., Zhu, X., Chin, F.Y.L.: IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics 29(13), 326–334 (2013). https://doi.org/10.1093/bioinformatics/btt219
Pevzner, P.A.: 1-tuple DNA sequencing: computer analysis. J. Biomol. Struc. Dyn. 7(1), 63–73 (1989)
Plesník, J.: The np-completeness of the hamiltonian cycle problem in planar digraphs with degree bound two. Inf. Process. Lett. 8(4), 199–201 (1979). https://doi.org/10.1016/0020-0190(79)90023-1
Rautiainen, M., Marschall, T.: Aligning sequences to general graphs in o (v+ me) time. bioRxiv p. 216127 (2017)
Ren, X., et al.: Evaluating de Bruijn graph assemblers on 454 transcriptomic data. PLoS ONE 7(12), e51188 (2012)
Williams, V.V.: Hardness of easy problems: basing hardness on popular conjectures such as the strong exponential time hypothesis (invited talk). In: 10th International Symposium on Parameterized and Exact Computation, IPEC 2015, 16–18 September 2015, Patras, Greece. pp. 17–29 (2015). https://doi.org/10.4230/LIPIcs.IPEC.2015.17
Ye, Y., Tang, H.: Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 32(7), 1001–1008 (2016). https://doi.org/10.1093/bioinformatics/btv510
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Acknowledgement
This research is supported in part by the U.S. National Science Foundation (NSF) grants CCF-1704552, CCF-1816027, CCF-2112643, and CCF-2146003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gibney, D., Thankachan, S.V., Aluru, S. (2022). The Complexity of Approximate Pattern Matching on de Bruijn Graphs. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)