Skip to main content
Log in

Tweedie Distributions for Biological Sequences Alignments

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

An important technique in the study of the similarity between biological sequences is the analysis of their alignments score distribution. The estimation of such distribution plays a central role in the evaluation of the statistical significance of these alignments. In the amino acid sequences alignment, the scores of the ungapped aligned segments are proven to be asymptotically distributed according to the extreme value law. Their gapped alignments scores are generally fitted with poisson or Gumbel distributions. In order to widen the scope of the candidate distributions, other classes of statistical models can be used. In this paper, we proposed to use the class of exponential dispersion models which includes several common laws such as Gaussian, Poisson and Gamma distributions on top of many others. In this context, a new algorithm for this model parameters estimation was introduced. This proposed approach is based on the selection of the appropriate distribution and maximum likelihood estimation. An asymptotic confidence interval was provided to estimate the dispersion parameter. Ultimately, the suggested algorithm performance was evaluated through different numerical experiments based on random sequences using different generation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Smith Temple F, Waterman Michael S (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  CAS  PubMed  Google Scholar 

  2. Needleman Saul B, Wunsch Christian D (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  CAS  PubMed  Google Scholar 

  3. Storey John D, Siegmund David (2001) Approximate p-values for local sequence alignments: numerical studies. J Comput Biol 8(5):549–556

    Article  CAS  PubMed  Google Scholar 

  4. Pang Hongxia, Tang Jiaowei, Chen Su-Shing, Tao Shiheng (2005) Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinf 6(1):1–9

    Article  Google Scholar 

  5. Margelevičius Mindaugas (2019) Estimating statistical significance of local protein profile-profile alignments. BMC Bioinf 20(1):1–13

    Article  Google Scholar 

  6. Karlin Samuel, Altschul Stephen F (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci 87(6):2264–2268

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  7. Altschul Stephen F, Bundschuh Ralf, Olsen Rolf, Hwa Terence (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29(2):351–361

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Dembo Amir, Karlin Samuel, Zeitouni Ofer (1994) Limit distribution of maximal non-aligned two-sequence segmental score. Ann Probability: 2022–2039

  9. Ankit Agrawal, Volker Brendel, Xiaoqiu Huang (2008) Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. International symposium on bioinformatics research and applications. Springer, pp 50–61

    Google Scholar 

  10. Mott Richard (2000) Accurate formula for p-values of gapped local sequence and profile alignments. J Mol Biol 300(3):649–659

    Article  CAS  PubMed  Google Scholar 

  11. Nojoomi Saghi, Koehl Patrice (2017) String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics 18(1):1–15

    Google Scholar 

  12. Wolfsheimer Stefan, Burghardt Bernd, Hartmann Alexander K (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms for Molecular Biology 2(1):1–17

    Article  Google Scholar 

  13. Jorgensen Bent (1997) The theory of dispersion models. CRC Press

    Google Scholar 

  14. Hassine Aymen, Masmoudi Afif, Ghribi Abdelaziz (2017) Tweedie regression model: a proposed statistical approach for modelling indoor signal path loss. Int J Numer Model Electron Networks Devices Fields 30(6):e2243

    Article  Google Scholar 

  15. Dunn Peter K (2017) Package ‘tweedie’. R Package version

  16. Dunn Peter K, Smyth Gordon K (2005) Series evaluation of Tweedie exponential dispersion model densities. Stat Comput 15:267–280

    Article  MathSciNet  Google Scholar 

  17. Dunn Peter K, Smyth Gordon K (2008) Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Stat Comput 18(1):73–86

    Article  MathSciNet  Google Scholar 

  18. Browne Patrick Denis, Nielsen Tue Kjærgaard, Kot Witold, Aggerholm Anni, Gilbert MTP, Puetz Lara, Rasmussen Morten, Zervas Athanasios, Hansen Lars Hestbjerg (2020) GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms. GigaScience 9(2):giaa008

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Farell Eric M, Alexandre Gladys (2012) Bovine serum albumin further enhances the effects of organic solvents on increased yield of polymerase chain reaction of GC-rich templates. BMC Res Notes 5(1):1–8

    Article  Google Scholar 

Download references

Acknowledgements

This research has received no external funding.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ben Hassen Hanen.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Appendix A: Details on the Reference Sequences

Appendix A: Details on the Reference Sequences

Tables 3 and 4 display detailed information about the reference sequences used in this study.

Table 3 Detailed information for the protein reference sequences
Table 4 Detailed information for the DNA reference sequences

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hanen, B.H., Khalil, M. & Afif, M. Tweedie Distributions for Biological Sequences Alignments. Stat Biosci 16, 165–184 (2024). https://doi.org/10.1007/s12561-023-09388-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-023-09388-4

Keywords

Navigation