Tweedie Distributions for Biological Sequences Alignments

Hanen, Ben Hassen; Khalil, Masmoudi; Afif, Masmoudi

doi:10.1007/s12561-023-09388-4

Tweedie Distributions for Biological Sequences Alignments

Published: 09 October 2023

Volume 16, pages 165–184, (2024)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Ben Hassen Hanen¹,
Masmoudi Khalil¹ &
Masmoudi Afif¹

60 Accesses
Explore all metrics

Abstract

An important technique in the study of the similarity between biological sequences is the analysis of their alignments score distribution. The estimation of such distribution plays a central role in the evaluation of the statistical significance of these alignments. In the amino acid sequences alignment, the scores of the ungapped aligned segments are proven to be asymptotically distributed according to the extreme value law. Their gapped alignments scores are generally fitted with poisson or Gumbel distributions. In order to widen the scope of the candidate distributions, other classes of statistical models can be used. In this paper, we proposed to use the class of exponential dispersion models which includes several common laws such as Gaussian, Poisson and Gamma distributions on top of many others. In this context, a new algorithm for this model parameters estimation was introduced. This proposed approach is based on the selection of the appropriate distribution and maximum likelihood estimation. An asymptotic confidence interval was provided to estimate the dispersion parameter. Ultimately, the suggested algorithm performance was evaluated through different numerical experiments based on random sequences using different generation techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-normal Limiting Distribution for Optimal Alignment Scores of Strings in Binary Alphabets

Article 07 July 2017

PicXAA: A Probabilistic Scheme for Finding the Maximum Expected Accuracy Alignment of Multiple Biological Sequences

Word Match Counts Between Markovian Biological Sequences

References

Smith Temple F, Waterman Michael S (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article CAS PubMed Google Scholar
Needleman Saul B, Wunsch Christian D (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article CAS PubMed Google Scholar
Storey John D, Siegmund David (2001) Approximate p-values for local sequence alignments: numerical studies. J Comput Biol 8(5):549–556
Article CAS PubMed Google Scholar
Pang Hongxia, Tang Jiaowei, Chen Su-Shing, Tao Shiheng (2005) Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinf 6(1):1–9
Article Google Scholar
Margelevičius Mindaugas (2019) Estimating statistical significance of local protein profile-profile alignments. BMC Bioinf 20(1):1–13
Article Google Scholar
Karlin Samuel, Altschul Stephen F (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci 87(6):2264–2268
Article ADS CAS PubMed PubMed Central Google Scholar
Altschul Stephen F, Bundschuh Ralf, Olsen Rolf, Hwa Terence (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29(2):351–361
Article CAS PubMed PubMed Central Google Scholar
Dembo Amir, Karlin Samuel, Zeitouni Ofer (1994) Limit distribution of maximal non-aligned two-sequence segmental score. Ann Probability: 2022–2039
Ankit Agrawal, Volker Brendel, Xiaoqiu Huang (2008) Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. International symposium on bioinformatics research and applications. Springer, pp 50–61
Google Scholar
Mott Richard (2000) Accurate formula for p-values of gapped local sequence and profile alignments. J Mol Biol 300(3):649–659
Article CAS PubMed Google Scholar
Nojoomi Saghi, Koehl Patrice (2017) String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics 18(1):1–15
Google Scholar
Wolfsheimer Stefan, Burghardt Bernd, Hartmann Alexander K (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms for Molecular Biology 2(1):1–17
Article Google Scholar
Jorgensen Bent (1997) The theory of dispersion models. CRC Press
Google Scholar
Hassine Aymen, Masmoudi Afif, Ghribi Abdelaziz (2017) Tweedie regression model: a proposed statistical approach for modelling indoor signal path loss. Int J Numer Model Electron Networks Devices Fields 30(6):e2243
Article Google Scholar
Dunn Peter K (2017) Package ‘tweedie’. R Package version
Dunn Peter K, Smyth Gordon K (2005) Series evaluation of Tweedie exponential dispersion model densities. Stat Comput 15:267–280
Article MathSciNet Google Scholar
Dunn Peter K, Smyth Gordon K (2008) Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Stat Comput 18(1):73–86
Article MathSciNet Google Scholar
Browne Patrick Denis, Nielsen Tue Kjærgaard, Kot Witold, Aggerholm Anni, Gilbert MTP, Puetz Lara, Rasmussen Morten, Zervas Athanasios, Hansen Lars Hestbjerg (2020) GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms. GigaScience 9(2):giaa008
Article CAS PubMed PubMed Central Google Scholar
Farell Eric M, Alexandre Gladys (2012) Bovine serum albumin further enhances the effects of organic solvents on increased yield of polymerase chain reaction of GC-rich templates. BMC Res Notes 5(1):1–8
Article Google Scholar

Download references

Acknowledgements

This research has received no external funding.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Laboratory of Probability and Statistics, Faculty of Sciences of Sfax, University of Sfax, PB 1171, Sfax, 3000, Tunisia
Ben Hassen Hanen, Masmoudi Khalil & Masmoudi Afif

Authors

Ben Hassen Hanen
View author publications
You can also search for this author in PubMed Google Scholar
Masmoudi Khalil
View author publications
You can also search for this author in PubMed Google Scholar
Masmoudi Afif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ben Hassen Hanen.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Appendix A: Details on the Reference Sequences

Tables 3 and 4 display detailed information about the reference sequences used in this study.

Table 3 Detailed information for the protein reference sequences

Full size table

Table 4 Detailed information for the DNA reference sequences

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hanen, B.H., Khalil, M. & Afif, M. Tweedie Distributions for Biological Sequences Alignments. Stat Biosci 16, 165–184 (2024). https://doi.org/10.1007/s12561-023-09388-4

Download citation

Received: 28 October 2022
Revised: 28 July 2023
Accepted: 02 September 2023
Published: 09 October 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s12561-023-09388-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tweedie Distributions for Biological Sequences Alignments

Abstract

Access this article

Similar content being viewed by others

Non-normal Limiting Distribution for Optimal Alignment Scores of Strings in Binary Alphabets

PicXAA: A Probabilistic Scheme for Finding the Maximum Expected Accuracy Alignment of Multiple Biological Sequences

Word Match Counts Between Markovian Biological Sequences

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A: Details on the Reference Sequences

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tweedie Distributions for Biological Sequences Alignments

Abstract

Access this article

Similar content being viewed by others

Non-normal Limiting Distribution for Optimal Alignment Scores of Strings in Binary Alphabets

PicXAA: A Probabilistic Scheme for Finding the Maximum Expected Accuracy Alignment of Multiple Biological Sequences

Word Match Counts Between Markovian Biological Sequences

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A: Details on the Reference Sequences

Appendix A: Details on the Reference Sequences

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation