Advertisement

On-Line Pattern Matching on Uncertain Sequences and Applications

  • Carl Barton
  • Chang Liu
  • Solon P. Pissis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10043)

Abstract

We study the fundamental problem of pattern matching in the case where the string data is weighted: for every position of the string and every letter of the alphabet a probability of occurrence for this letter at this position is given. Sequences of this type are commonly used to represent uncertain data. They are of particular interest in computational molecular biology as they can represent different kind of ambiguities in DNA sequences: distributions of SNPs in genomes populations; position frequency matrices of DNA binding profiles; or even sequencing-related uncertainties. A weighted string may thus represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. In this article, we present new average-case results on pattern matching on weighted strings and show how they are applied effectively in several biological contexts. A free open-source implementation of our algorithms is made available.

Keywords

Amyotrophic Lateral Sclerosis Pattern Match Valid Factor Suffix Tree Extensive Experimental Result 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Amir, A., Chencinski, E., Iliopoulos, C.S., Kopelowitz, T., Zhang, H.: Property matching and weighted matching. Theor. Comput. Sci. 395(2–3), 298–310 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Amir, A., Iliopoulos, C., Kapah, O., Porat, E.: Approximate matching in weighted sequences. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 365–376. Springer, Heidelberg (2006). doi: 10.1007/11780441_33 CrossRefGoogle Scholar
  3. 3.
    Barton, C., Iliopoulos, C.S., Pissis, S.P.: Optimal computation of all tandem repeats in a weighted sequence. Algorithms Mol. Biol. 9(21), 1–12 (2014)Google Scholar
  4. 4.
    Barton, C., Kociumaka, T., Pissis, S.P., Radoszewski, J.: Efficient index for weighted sequences. In: CPM 2016, LIPIcs, vol. 54, pp. 4: 1–4: 13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)Google Scholar
  5. 5.
    Barton, C., Liu, C., Pissis, S.P.: Fast average-case pattern matching on weighted sequences. CoRR abs/1512.01085 (2015). (submitted to IPL)Google Scholar
  6. 6.
    Barton, C., Pissis, S.P.: Linear-time computation of prefix table for weighted strings. In: Manea, F., Nowotka, D. (eds.) WORDS 2015. LNCS, vol. 9304, pp. 73–84. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-23660-5_7 CrossRefGoogle Scholar
  7. 7.
    Caspi, R., Helinski, D.R., Pacek, M., Konieczny, I.: Interactions of DnaA proteins from distantly related bacteria with the replication origin of the broad host range plasmid RK2. J. Biol. Chem. 275(24), 18454–18461 (2000)CrossRefGoogle Scholar
  8. 8.
    Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 259–273. Springer, Heidelberg (1994). doi: 10.1007/3-540-58094-8_23 CrossRefGoogle Scholar
  9. 9.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, New York (2007)CrossRefzbMATHGoogle Scholar
  10. 10.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS 1997, pp. 137–143. IEEE Computer Society (1997)Google Scholar
  11. 11.
    Guo, Y., Jamison, D.C.: The distribution of SNPs in human gene regulatory regions. BMC Genom. 6(1), 1–11 (2005)CrossRefGoogle Scholar
  12. 12.
    Hattori, M., et al.: The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000)CrossRefGoogle Scholar
  13. 13.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), 361–370 (2013)CrossRefGoogle Scholar
  14. 14.
    Kociumaka, T., Pissis, S.P., Radoszewski, J.: Pattern matching and consensus problems on weighted sequences and profiles. In: ISAAC 2016, LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)Google Scholar
  15. 15.
    Lovász, L., Pelikán, J., Vesztergombi, K.: Discrete Mathematics: Elementary and Beyond. Springer, New York (2003)CrossRefzbMATHGoogle Scholar
  16. 16.
    Musser, D.R.: Introspective sorting and selection algorithms. Softw. Pract. Exp. 27(8), 983–993 (1997)CrossRefGoogle Scholar
  17. 17.
    Pizzi, C., Ukkonen, E.: Fast profile matching algorithms - a survey. Theor. Comput. Sci. 395(2–3), 137–157 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32(1), D91–D94 (2004)CrossRefGoogle Scholar
  19. 19.
    Varela, M.A., Amos, W.: Heterogeneous distribution of SNPs in the human genome: microsatellites as predictors of nucleotide diversity and divergence. Genomics 95(3), 151–159 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.The Blizard Institute, Barts and The London School of Medicine and DentistryQueen Mary University of LondonLondonUK
  2. 2.Department of InformaticsKing’s College LondonLondonUK

Personalised recommendations