Abstract
Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.
J. Radoszewski and J. Straszyński—Supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The true course of the algorithm will not actually perform much of its operations on a compact trie, but the intuition is best conveyed by visualizing them this way.
References
Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with k-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 636–649. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73117-9_45
Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: Gao, X., Du, H., Han, M. (eds.) COCOA 2017. LNCS, vol. 10628, pp. 109–121. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71147-8_8
Antoniou, P., Daykin, J.W., Iliopoulos, C.S., Kourie, D., Mouchard, L., Pissis, S.P.: Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome. In: Information Technology and Applications in Biomedicine, ITAB 2009. IEEE (2009). https://doi.org/10.1109/itab.2009.5394394
Ayad, L.A.K., Barton, C., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with \(k\)-errors and applications. In: Gagie, T., et al. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 27–41. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00479-8_3
Charalampopoulos, P., et al.: Linear-time algorithm for long LCF with \(k\) mismatches. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Combinatorial Pattern Matching, CPM 2018. LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.23
Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Babai, L. (ed.) 36th Annual ACM Symposium on Theory of Computing, STOC 2004, pp. 91–100. ACM (2004). https://doi.org/10.1145/1007352.1007374
Derrien, T.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012). https://doi.org/10.1371/journal.pone.0030377
Eades, P., McKay, B.D.: An algorithm for generating subsets of fixed size with a strong minimal change property. Inf. Process. Lett. 19(3), 131–133 (1984). https://doi.org/10.1016/0020-0190(84)90091-7
Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th IEEE Annual Symposium on Foundations of Computer Science, FOCS 1997, pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102
Fonseca, N.A., Rung, J., Brazma, A., Marioni, J.C.: Tools for mapping high-throughput sequencing data. Bioinformatics 28(24), 3169–3177 (2012). https://doi.org/10.1093/bioinformatics/bts605
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(O(1)\) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884
Impagliazzo, R., Paturi, R.: On the complexity of \(k\)-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). https://doi.org/10.1006/jcss.2000.1727
Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001). https://doi.org/10.1006/jcss.2001.1774
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006). https://doi.org/10.1145/1217856.1217858
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17
Kociumaka, T., Radoszewski, J., Starikovskaya, T.A.: Longest common substring with approximately \(k\) mismatches (2017). arxiv.1712.08573
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976). https://doi.org/10.1145/321941.321946
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 211–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_14
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016). https://doi.org/10.1089/cmb.2015.0235
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Alzamel, M. et al. (2018). Efficient Computation of Sequence Mappability. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-00479-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)