Choice of Metric Divergence in Genome Sequence Comparison

Ghosh, Soumen; Pal, Jayanta; Maji, Bansibadan; Cattani, Carlo; Bhattacharya, Dilip Kumar

doi:10.1007/s10930-024-10189-x

Choice of Metric Divergence in Genome Sequence Comparison

Published: 16 March 2024

Volume 43, pages 259–273, (2024)
Cite this article

The Protein Journal Aims and scope Submit manuscript

Soumen Ghosh¹,
Jayanta Pal²,
Bansibadan Maji³,
Carlo Cattani⁴ &
…
Dilip Kumar Bhattacharya⁵

60 Accesses
1 Citation
Explore all metrics

Abstract

The paper introduces a novel probability descriptor for genome sequence comparison, employing a generalized form of Jensen-Shannon divergence. This divergence metric stems from a one-parameter family, comprising fractions up to a maximum value of half. Utilizing this metric as a distance measure, a distance matrix is computed for the new probability descriptor, shaping Phylogenetic trees via the neighbor-joining method. Initial exploration involves setting the parameter at half for various species. Assessing the impact of parameter variation, trees drawn at different parameter values (half, one-fourth, one-eighth). However, measurement scales decrease with parameter value increments, with higher similarity accuracy corresponding to lower scale values. Ultimately, the highest accuracy aligns with the maximum parameter value of half. Comparative analyses against previous methods, evaluating via Symmetric Distance (SD) values and rationalized perception, consistently favor the present approach's results. Notably, outcomes at the maximum parameter value exhibit the most accuracy, validating the method's efficacy against earlier approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life

MIA: Mutual Information Analyzer, a graphic user interface program that calculates entropy, vertical and horizontal mutual information of molecular sequence sets

Article Open access 10 December 2015

Information Metrics for Phylogenetic Trees via Distributions of Discrete and Continuous Characters

Data Availability

No datasets were generated or analysed during the current study.

References

Phillips A, Janies D, Wheeler W (2000) Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol 16(3):317–330. https://doi.org/10.1006/mpev.2000.0785
Article CAS PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680. https://doi.org/10.1093/nar/22.22.4673
Article CAS PubMed PubMed Central Google Scholar
Katoh K et al (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. https://doi.org/10.1093/nar/gkf436
Article PubMed PubMed Central Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison—A review. Bioinformatics 19(4):513–523. https://doi.org/10.1093/bioinformatics/btg005
Article CAS PubMed Google Scholar
Domazet-Lošo M, Haubold B (2011) Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27(11):1466–1472. https://doi.org/10.1093/bioinformatics/btr176
Article CAS PubMed Google Scholar
Gates MA (1986) A simple way to look at DNA. J Theor Biol 119(3):319–328. https://doi.org/10.1016/s0022-5193(86)80144-8
Article CAS PubMed Google Scholar
Nandy A (1994) A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Curr Sci 66:309–314
CAS Google Scholar
Leong PM, Morgenthaler S (1995) Random walk and gap plots of DNA sequences. Bioinformatics 11(5):503–507. https://doi.org/10.1093/bioinformatics/11.5.503
Article CAS Google Scholar
Guo X, Randic M, Basak SC (2001) A novel 2-D graphical representation of DNA sequences of low degeneracy. Chem Phys Lett 350(1–2):106–112. https://doi.org/10.1016/S0009-2614(01)01246-5
Article CAS Google Scholar
Yau SS et al (2003) DNA sequence representation without degeneracy. Nucleic Acids Res 31(12):3078–3080. https://doi.org/10.1093/nar/gkg432
Article CAS PubMed PubMed Central Google Scholar
Liao Bo (2005) A 2D graphical representation of DNA sequence. Chem Phys Lett 401(1–3):196–199. https://doi.org/10.1016/j.cplett.2004.11.059
Article CAS Google Scholar
Liao Bo, Tan M, Ding K (2005) Application of 2-D graphical representation of DNA sequence. Chem Phys Lett 414(4–6):296–300. https://doi.org/10.1016/J.CPLETT.2005.08.079
Article CAS Google Scholar
Song J, Tang H (2005) A new 2-D graphical representation of DNA sequences and their numerical characterization. J Biochem Biophys Methods 63(3):228–239. https://doi.org/10.1016/j.jbbm.2005.04.004
Article CAS PubMed Google Scholar
Randić M et al (2003) Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chem Phys Lett 368(1–2):1–6. https://doi.org/10.1016/S0009-2614(02)01784-0
Article Google Scholar
Randić M et al (2003) Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem Phys Lett 371(1–2):202–207. https://doi.org/10.1016/S0009-2614(03)00244-6
Article CAS Google Scholar
Yao Y-H, Liao Bo, Wang T-M (2005) A 2D graphical representation of RNA secondary structures and the analysis of similarity/dissimilarity based on it. J Mol Struct (Thoechem) 755(1–3):131–136. https://doi.org/10.1016/j.theochem.2005.08.009
Article CAS Google Scholar
Randić M et al (2000) On 3-D graphical representation of DNA primary sequences and their numerical characterization. J Chem Inf Comput Sci 40(5):1235–1244. https://doi.org/10.1021/ci000034q
Article CAS PubMed Google Scholar
Nandy A, Nandy P (1995) Graphical analysis of DNA sequence structure: II. Relative abundances of nucleotides in DNAs, gene evolution and duplication. Curr Sci 68:75–85
CAS Google Scholar
Yao Y-H, Nan X-Y, Wang T-M (2006) A new 2D graphical representation—Classification curve and the analysis of similarity/dissimilarity of DNA sequences. J Mol Struct (Thoechem) 764(1–3):101–108. https://doi.org/10.1016/j.theochem.2006.02.007
Article CAS Google Scholar
Das S, Pal J, Bhattacharya DK (2015) Geometrical method of exhibiting similarity/dissimilarity under new 3D classification curves and establishing significance difference of different parameters of estimation. Intl J Adv Res Comp Sci SoftwEngg 5:279–287
Google Scholar
Randić M et al (2001) On characterization of proteomics maps and chemically induced changes in proteomes using matrix invariants: application to peroxisome proliferators. Med Chem Res 10(7–8):456–479
Google Scholar
Qi Z-H, Fan T-R (2007) PN-curve: A 3D graphical representation of DNA sequences and their numerical characterization. Chem Phys Lett 442(4–6):434–440. https://doi.org/10.1016/j.cplett.2007.06.029
Article CAS Google Scholar
Akhtar M, Epps J, Ambikairajah E (2008) Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Selected Topics Signal Process 2(3):310–321. https://doi.org/10.1109/JSTSP.2008.923854
Article Google Scholar
Chakravarthy N et al (2004) Autoregressive modeling and feature analysis of DNA sequences. EURASIP J Adv Signal Process 2004(1):1–16. https://doi.org/10.1155/S111086570430925X
Article Google Scholar
Chi R, Ding K (2005) Novel 4D numerical representation of DNA sequences. Chem Phys Lett 407(1–3):63–67. https://doi.org/10.1016/j.cplett.2005.03.056
Article CAS Google Scholar
Nieto JJ, Torres A, Vázquez-Trasande MM (2003) A metric space to study differences between polynucleotides. Appl Math Lett 16(8):1289–1294. https://doi.org/10.1016/S0893-9659(03)90131-5
Article Google Scholar
Nieto JJ et al (2006) Fuzzy polynucleotide spaces and metrics. Bull Math Biol 68(3):703–725. https://doi.org/10.1007/s11538-005-9020-5
Article CAS PubMed Google Scholar
Torres A, Nieto JJ (2003) The fuzzy polynucleotide space: basic properties. Bioinformatics 19(5):587–592. https://doi.org/10.1093/bioinformatics/btg032
Article CAS PubMed Google Scholar
Sadegh-Zadeh K (2000) Fuzzy genomes. Artif Intell Med 18(1):1–28. https://doi.org/10.1016/s0933-3657(99)00032-9
Article CAS PubMed Google Scholar
Kong S-G, Kosko B (1992) Adaptive fuzzy systems for backing up a truck-and-trailer. IEEE Trans Neural Networks 3(2):211–223. https://doi.org/10.1109/72.125862
Article CAS PubMed Google Scholar
Qi X et al (2011) A novel model for DNA sequence similarity analysis based on graph theory. Evolut Bioinformatics 7:EBO-S7364. https://doi.org/10.4137/EBO.S7364
Article Google Scholar
Das S et al (2020) A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics 112(6):4701–4714. https://doi.org/10.1016/j.ygeno.2020.08.023
Article CAS PubMed Google Scholar
Das S et al (2018) Optimal choice of k-mer in composition vector method for genome sequence comparison. Genomics 110(5):263–273. https://doi.org/10.1016/j.ygeno.2017.11.003
Article CAS PubMed Google Scholar
Afreixo V et al (2009) Genome analysis with inter-nucleotide distances. Bioinformatics 25(23):3064–3070. https://doi.org/10.1093/bioinformatics/btp546
Article CAS PubMed PubMed Central Google Scholar
Tavares A et al. Detection of exceptional genomic words: a comparison between species. No. 63. EasyChair, 2018.
Tavares H et al (2017) DNA word analysis based on the distribution of the distances between symmetric words. Sci Rep 7(1):728
Article PubMed PubMed Central Google Scholar
Goldberger AL, Peng CK (2005) Genomic classification using an information-based similarity index: application to the SARS coronavirus. J Comput Biol 12(8):1103–1116. https://doi.org/10.1089/cmb.2005.12.1103
Article PubMed Google Scholar
Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20(18):3455–3461. https://doi.org/10.1093/bioinformatics/bth426
Article CAS PubMed Google Scholar
Kullback S (1968) Information theory and statistics. Dover Publi Inc, New York
Google Scholar
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proce Royal Soc London Series A Math Phys Sci 186(1007):453–461
CAS Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article Google Scholar
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37(1):145–151. https://doi.org/10.1109/18.61115
Article Google Scholar
Lu J, Henchion M, MacNamee B. Extending jensen shannon divergence to compare multiple corpora. InMcAuley, J., McKeever, S.(eds.). Proceedings of the 25th Irish Conference on Artificial Intelligence and Cognitive Science 2017. CEUR-WS. org..
Lu G (2013) A class of new metrics for n-dimensional unit hypercube. J Appl Math. https://doi.org/10.1155/2013/942687
Article Google Scholar
Das S et al (2013) Some anomalies in the analysis of whole genome sequence on the basis of Fuzzy set theory. Int J Artif Intell Neural Netw 3(2):38–41
Google Scholar
Ghosh S et al (2023) A method of genome sequence comparison based on a new form of fuzzy polynucleotide space Frontiers of ICT in Healthcare. Proceedings of EAIT 2022. Springer Nature Singapore, Singapore, pp 125–135
Google Scholar
Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425. https://doi.org/10.1093/oxfordjournals.molbev.a040454
Article CAS PubMed Google Scholar
Yu C, Deng M, Yau SS (2011) DNA sequence comparison by a novel probabilistic method. Information Sci 181(8):1484–1492. https://doi.org/10.1016/j.ins.2010.12.010
Article Google Scholar
Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53(1–2):131–147
Article Google Scholar
Felsenstein, J. (2005). PHYLIP (phylogeny inference package) Distributed by the author. Dept. Genome Sci., Univ. Wash., Seattle Version, 3.

Download references

Author information

Authors and Affiliations

Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India
Soumen Ghosh
Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
Jayanta Pal
Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
Bansibadan Maji
DEIM, University of Tuscia, Largo Dell’Universita, 01100, Viterbo, Italy
Carlo Cattani
Pure Mathematics, University of Calcutta, Kolkata, West Bengal, India
Dilip Kumar Bhattacharya

Authors

Soumen Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Jayanta Pal
View author publications
You can also search for this author in PubMed Google Scholar
Bansibadan Maji
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Cattani
View author publications
You can also search for this author in PubMed Google Scholar
Dilip Kumar Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SG: Design and development of the work and finalization of draft. JP: Data collection, analysis and interpretation. BM: Initial drafting the article. CC: Critical revision of the article after final draft. DKB: Concepttion of the work and critical revision of the article after final draft.

Corresponding author

Correspondence to Soumen Ghosh.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ghosh, S., Pal, J., Maji, B. et al. Choice of Metric Divergence in Genome Sequence Comparison. Protein J 43, 259–273 (2024). https://doi.org/10.1007/s10930-024-10189-x

Download citation

Accepted: 28 February 2024
Published: 16 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10930-024-10189-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Choice of Metric Divergence in Genome Sequence Comparison

Abstract

Access this article

Similar content being viewed by others

Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life

MIA: Mutual Information Analyzer, a graphic user interface program that calculates entropy, vertical and horizontal mutual information of molecular sequence sets

Information Metrics for Phylogenetic Trees via Distributions of Discrete and Continuous Characters

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Choice of Metric Divergence in Genome Sequence Comparison

Abstract

Access this article

Similar content being viewed by others

Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life

MIA: Mutual Information Analyzer, a graphic user interface program that calculates entropy, vertical and horizontal mutual information of molecular sequence sets

Information Metrics for Phylogenetic Trees via Distributions of Discrete and Continuous Characters

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation