Abstract
An important technique in the study of the similarity between biological sequences is the analysis of their alignments score distribution. The estimation of such distribution plays a central role in the evaluation of the statistical significance of these alignments. In the amino acid sequences alignment, the scores of the ungapped aligned segments are proven to be asymptotically distributed according to the extreme value law. Their gapped alignments scores are generally fitted with poisson or Gumbel distributions. In order to widen the scope of the candidate distributions, other classes of statistical models can be used. In this paper, we proposed to use the class of exponential dispersion models which includes several common laws such as Gaussian, Poisson and Gamma distributions on top of many others. In this context, a new algorithm for this model parameters estimation was introduced. This proposed approach is based on the selection of the appropriate distribution and maximum likelihood estimation. An asymptotic confidence interval was provided to estimate the dispersion parameter. Ultimately, the suggested algorithm performance was evaluated through different numerical experiments based on random sequences using different generation techniques.
Similar content being viewed by others
References
Smith Temple F, Waterman Michael S (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Needleman Saul B, Wunsch Christian D (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Storey John D, Siegmund David (2001) Approximate p-values for local sequence alignments: numerical studies. J Comput Biol 8(5):549–556
Pang Hongxia, Tang Jiaowei, Chen Su-Shing, Tao Shiheng (2005) Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinf 6(1):1–9
Margelevičius Mindaugas (2019) Estimating statistical significance of local protein profile-profile alignments. BMC Bioinf 20(1):1–13
Karlin Samuel, Altschul Stephen F (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci 87(6):2264–2268
Altschul Stephen F, Bundschuh Ralf, Olsen Rolf, Hwa Terence (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29(2):351–361
Dembo Amir, Karlin Samuel, Zeitouni Ofer (1994) Limit distribution of maximal non-aligned two-sequence segmental score. Ann Probability: 2022–2039
Ankit Agrawal, Volker Brendel, Xiaoqiu Huang (2008) Pairwise statistical significance versus database statistical significance for local alignment of protein sequences. International symposium on bioinformatics research and applications. Springer, pp 50–61
Mott Richard (2000) Accurate formula for p-values of gapped local sequence and profile alignments. J Mol Biol 300(3):649–659
Nojoomi Saghi, Koehl Patrice (2017) String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics 18(1):1–15
Wolfsheimer Stefan, Burghardt Bernd, Hartmann Alexander K (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms for Molecular Biology 2(1):1–17
Jorgensen Bent (1997) The theory of dispersion models. CRC Press
Hassine Aymen, Masmoudi Afif, Ghribi Abdelaziz (2017) Tweedie regression model: a proposed statistical approach for modelling indoor signal path loss. Int J Numer Model Electron Networks Devices Fields 30(6):e2243
Dunn Peter K (2017) Package ‘tweedie’. R Package version
Dunn Peter K, Smyth Gordon K (2005) Series evaluation of Tweedie exponential dispersion model densities. Stat Comput 15:267–280
Dunn Peter K, Smyth Gordon K (2008) Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Stat Comput 18(1):73–86
Browne Patrick Denis, Nielsen Tue Kjærgaard, Kot Witold, Aggerholm Anni, Gilbert MTP, Puetz Lara, Rasmussen Morten, Zervas Athanasios, Hansen Lars Hestbjerg (2020) GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms. GigaScience 9(2):giaa008
Farell Eric M, Alexandre Gladys (2012) Bovine serum albumin further enhances the effects of organic solvents on increased yield of polymerase chain reaction of GC-rich templates. BMC Res Notes 5(1):1–8
Acknowledgements
This research has received no external funding.
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hanen, B.H., Khalil, M. & Afif, M. Tweedie Distributions for Biological Sequences Alignments. Stat Biosci 16, 165–184 (2024). https://doi.org/10.1007/s12561-023-09388-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-023-09388-4