Skip to main content

Lower Bounds on the Generalized Central Moments of the Optimal Alignments Score of Random Sequences

Abstract

We present a general approach to the problem of determining tight asymptotic lower bounds for generalized central moments of the optimal alignment score of two independent sequences of i.i.d. random variables. At first, these are obtained under a main assumption for which sufficient conditions are provided. When the main assumption fails, we nevertheless develop a “uniform approximation” method leading to asymptotic lower bounds. Our general results are then applied to the length of the longest common subsequences of binary strings, in which case asymptotic lower bounds are obtained for the moments and the exponential moments of the optimal score. As a by-product, a local upper bound on the rate function associated with the length of the longest common subsequences of two binary strings is also obtained.

This is a preview of subscription content, access via your institution.

Notes

  1. In [4], the scoring function takes the value 1 for matches and the penalty \(-\mu \) for mismatches. Moreover, the gap in [4] is an indel in one sequence, and the gap price \(-\delta \) is assumed to be negative.

References

  1. Alexander, K.S.: The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4(4), 1074–1082 (1994)

    MathSciNet  Article  MATH  Google Scholar 

  2. Amsalu, S., Houdré, C., Matzinger, H.: Sparse long blocks and the micro-structure of the longest common subsequences. J. Stat. Phys. 154(6), 1516–1549 (2014)

    Article  MATH  Google Scholar 

  3. Amsalu, S., Houdré, C., Matzinger, H.: Sparse Long Blocks and the Variance of the Length of Longest Common Subsequences in Random Words. arXiv:1204.1009v2 (2016)

  4. Arratia, R., Waterman, M.S.: A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 4(1), 200–225 (1994)

    MathSciNet  Article  MATH  Google Scholar 

  5. Bonetto, F., Matzinger, H.: Fluctuations of the longest common subsequence in the asymmetric case of 2- and 3-letter alphabets. Lat. Am. J. Probab. Math. Stat. 2, 195–216 (2006)

    MathSciNet  MATH  Google Scholar 

  6. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)

    Book  MATH  Google Scholar 

  7. Christianini, N., Hahn, M.W.: Introduction to Computational Genomics: A Case Studies Approach. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  8. Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. J. Appl. Probab. 12(2), 306–315 (1975)

    MathSciNet  Article  MATH  Google Scholar 

  9. Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)

    MATH  Google Scholar 

  10. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    Book  MATH  Google Scholar 

  11. Durringer, C., Lember, J., Matzinger, H.: Deviation from the mean in sequence comparison with a periodic sequence. Lat. Am. J. Probab. Math. Stat. 3, 1–29 (2007)

    MathSciNet  MATH  Google Scholar 

  12. Gong, R., Houdré, C., Işlak, Ü.: A Central Limit Theorem for the Optimal Alignments Score in Multiple Random Words. arXiv:1512.05699v2 (2016)

  13. Grossmann, S., Yakir, B.: Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. Bernoulli 10(5), 829–845 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  14. Hammersley, J.M.: Postulates for subadditive processes. Ann. Probab. 2(4), 652–680 (1974)

    MathSciNet  Article  MATH  Google Scholar 

  15. Houdré, C., Işlak, Ü.: A Central Limit Theorem for the Length of the Longest Common Subsequences in Random Words. arXiv:1408.1559v3 (2015)

  16. Houdré, C., Ma, J.: On the order of the central moments of the length of the longest common subsequences in random words. In: High Dimensional Probability VII: The Cargèse Volume. Progress in Probability 71, pp. 105–136. Birkhauser (2016)

  17. Houdré, C., Matzinger, H.: On the variance of the optimal alignments score for binary random words and an asymmetric scoring function. J. Stat. Phys. 164(3), 693–734 (2016)

    MathSciNet  Article  MATH  Google Scholar 

  18. Kečkić, J.D., Vasić, P.M.: Some inequalities for the gamma function. Publications de L’institut Mathématique, Nouvelle Série 25, 107–114 (1971)

  19. Lember, J., Matzinger, H.: Standard deviation of the longest common subsequence. Ann. Probab. 37(3), 1192–1235 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  20. Lember, J., Matzinger, H., Torres, F.: The rate of the convergence of the mean score in random sequence comparison. Ann. Appl. Probab. 22(3), 1046–1058 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  21. Lember, J., Matzinger, H., Torres, F.: General Approach to the Fluctuations Problem in Random Sequence Comparison. arXiv:1211.5072v1 (2012)

  22. Lember, J., Matzinger, H., Torres, F.: Proportion of gaps and fluctuations of the optimal score in random sequence comparison. In: Limit Theorems in Probability, Statistics and Number Theory (In Honor of Friedrich Götze), vol. 42, pp. 207–234. Springer, Berlin (2013)

  23. Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: ACL’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 605 (2004)

  24. Melamed, I.D.: Automatic evaluation and uniform filter cascades for inducing N-best translation lexicons. In: Proceedings of the Third Workshop on Very Large Corpora (1995)

  25. Melamed, I.D.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999)

    Google Scholar 

  26. Pevzner, P.A.: Computational Molecular Biology: An Algorithmic Approach. MIT Press, Cambridge (2000)

    MATH  Google Scholar 

  27. Shiryaev, A.N.: Probability, 2nd edn. Springer, New York (1995)

    MATH  Google Scholar 

  28. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  29. Steele, J.M.: An Efron–Stein inequality for nonsymmetric statistics. Ann. Stat. 14(2), 753–758 (1986)

    MathSciNet  Article  MATH  Google Scholar 

  30. Torres, F.: On the Probabilistic Longest Common Subsequence Problem for Sequences of Independent Blocks, Ph.D Thesis, Bielefeld University (2009)

  31. Waterman, M.S.: Estimating statistical significance of sequence alignments. Philos. Trans. R. Soc. Biol. Sci. 344(1310), 383–390 (1994)

    Article  Google Scholar 

  32. Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC Press, Virginia Beach (1995)

    Book  MATH  Google Scholar 

  33. Yang, C.C., Li, K.W.: Automatic construction of English/Chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54(8), 730–742 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruoting Gong.

Additional information

C. Houdré: Research supported in part by the Grant # 246283 from the Simons Foundation and by a Simons Fellowship Grant # 267336. J. Lember: Research supported by Estonian Science foundation Grant No. 5822 and by institutional research funding IUT34-5.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gong, R., Houdré, C. & Lember, J. Lower Bounds on the Generalized Central Moments of the Optimal Alignments Score of Random Sequences. J Theor Probab 31, 643–683 (2018). https://doi.org/10.1007/s10959-016-0730-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10959-016-0730-4

Keywords

  • Longest common subsequence
  • Optimal alignment
  • Last passage percolation

Mathematics Subject Classification (2010)

  • 05A05
  • 60C05
  • 60F10