Efficient Computation of Sequence Mappability

Alzamel, Mai; Charalampopoulos, Panagiotis; Iliopoulos, Costas S.; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub; Straszyński, Juliusz

doi:10.1007/978-3-030-00479-8_2

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11147))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

808 Accesses
2 Citations

Abstract

Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.

J. Radoszewski and J. Straszyński—Supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The true course of the algorithm will not actually perform much of its operations on a compact trie, but the intuition is best conveyed by visualizing them this way.

References

Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with k-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 636–649. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73117-9_45
Chapter Google Scholar
Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: Gao, X., Du, H., Han, M. (eds.) COCOA 2017. LNCS, vol. 10628, pp. 109–121. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71147-8_8
Chapter Google Scholar
Antoniou, P., Daykin, J.W., Iliopoulos, C.S., Kourie, D., Mouchard, L., Pissis, S.P.: Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome. In: Information Technology and Applications in Biomedicine, ITAB 2009. IEEE (2009). https://doi.org/10.1109/itab.2009.5394394
Ayad, L.A.K., Barton, C., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with \(k\)-errors and applications. In: Gagie, T., et al. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 27–41. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00479-8_3
Charalampopoulos, P., et al.: Linear-time algorithm for long LCF with \(k\) mismatches. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Combinatorial Pattern Matching, CPM 2018. LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.23
Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Babai, L. (ed.) 36th Annual ACM Symposium on Theory of Computing, STOC 2004, pp. 91–100. ACM (2004). https://doi.org/10.1145/1007352.1007374
Derrien, T.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012). https://doi.org/10.1371/journal.pone.0030377
Article Google Scholar
Eades, P., McKay, B.D.: An algorithm for generating subsets of fixed size with a strong minimal change property. Inf. Process. Lett. 19(3), 131–133 (1984). https://doi.org/10.1016/0020-0190(84)90091-7
Article MathSciNet MATH Google Scholar
Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th IEEE Annual Symposium on Foundations of Computer Science, FOCS 1997, pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102
Fonseca, N.A., Rung, J., Brazma, A., Marioni, J.C.: Tools for mapping high-throughput sequencing data. Bioinformatics 28(24), 3169–3177 (2012). https://doi.org/10.1093/bioinformatics/bts605
Article Google Scholar
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(O(1)\) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884
Article MathSciNet MATH Google Scholar
Impagliazzo, R., Paturi, R.: On the complexity of \(k\)-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). https://doi.org/10.1006/jcss.2000.1727
Article MathSciNet MATH Google Scholar
Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001). https://doi.org/10.1006/jcss.2001.1774
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006). https://doi.org/10.1145/1217856.1217858
Article MathSciNet MATH Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48194-X_17
Chapter Google Scholar
Kociumaka, T., Radoszewski, J., Starikovskaya, T.A.: Longest common substring with approximately \(k\) mismatches (2017). arxiv.1712.08573
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058
Article MathSciNet MATH Google Scholar
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
Chapter Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976). https://doi.org/10.1145/321941.321946
Article MathSciNet MATH Google Scholar
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 211–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_14
Chapter Google Scholar
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016). https://doi.org/10.1089/cmb.2015.0235
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, King’s College London, London, UK
Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos & Solon P. Pissis
Computer Science Department, King Saud University, Riyadh, Saudi Arabia
Mai Alzamel
Institute of Informatics, University of Warsaw, Warsaw, Poland
Tomasz Kociumaka, Jakub Radoszewski & Juliusz Straszyński

Authors

Mai Alzamel
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Charalampopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Costas S. Iliopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Kociumaka
View author publications
You can also search for this author in PubMed Google Scholar
Solon P. Pissis
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Radoszewski
View author publications
You can also search for this author in PubMed Google Scholar
Juliusz Straszyński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juliusz Straszyński .

Editor information

Editors and Affiliations

Diego Portales University, Santiago, Chile
Travis Gagie
The University of Melbourne, Melbourne, VIC, Australia
Alistair Moffat
University of Chile, Santiago, Chile
Gonzalo Navarro
Universidad de Ingeniería y Tecnología, Lima, Peru
Ernesto Cuadros-Vargas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alzamel, M. et al. (2018). Efficient Computation of Sequence Mappability. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-00479-8_2
Published: 14 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics