Skip to main content
Log in

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

  • Animal Genetics • Original Paper
  • Published:
Journal of Applied Genetics Aims and scope Submit manuscript

Abstract

Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Nearest distance (single linkage method)

  2. Furthest distance (complete linkage method)

  3. Unweighted pair group method average (UPGMA, group average)

  4. Weighted pair group method average (WPGMA)

  5. Unweighted pair group method centroid (UPGMC)

  6. Weighted pair group method centroid (WPGMC)

References

  • Erill I, Deloumeaux P, Gorzalka JD (2012) Information theory and biological sequences: insights from an evolutionary perspective, Information Theory: New Research. Nova Science Publishers, New York, pp 1–28

    Google Scholar 

  • Freund Y, Schapire R (1996) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119. https://doi.org/10.1006/jcss.1997.1504

    Article  Google Scholar 

  • Gray RM (2011) Entropy and information theory. Springer Science & Business Media

  • Jiang S, Tang C, Zhang L, Zhang A (2014) A maximum entropy approach to classifying gene array data sets. Workshop on data Mining for Genomics, first SIAM international conference on data mining

  • Kim J, Kim S, Lee K, Kwon Y (2009) Entropy analysis in yeast DNA. Chaos, Solitons Fractals 39(4):1565–1571

    Article  CAS  Google Scholar 

  • Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86. https://doi.org/10.1214/aoms/1177729694

    Article  Google Scholar 

  • Lemay DG, Lynn DJ, Martin WF, Neville MC, Casey TM, Rincon G, Kriventseva EV, Barris WC, Hinrichs AS, Molenaar AJ (2009) The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biol 10:R43

    Article  Google Scholar 

  • Liou C-Y, Tseng S-H, Cheng W-C, Tsai H-Y (2013) Structural complexity of DNA sequence. Comput Math Methods Med 2013:11

    Article  Google Scholar 

  • Machado JAT (2012) Shannon entropy analysis of the genome code. Math Probl Eng Article ID 132625:12. https://doi.org/10.1155/2012/132625

  • Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome wide dense marker maps. Genetics 157:1819–1829

    CAS  PubMed  PubMed Central  Google Scholar 

  • Monge RE, Crespo JL (2014) Comparison of complexity measures for DNA sequence analysis. In 3rd IEEE international work-conference on bioinspired intelligence (pp. 71-75). IEEE

  • Neagoe M, Popescu D, Niculescu V (2014) Applications of entropic divergence measures for DNA segmentation into high variable regions of cryptosporidium spp. gp60 gene. Rom Rep Phys 66(4):1078–1087

    Google Scholar 

  • Pham, TD, Crane, DI, Tannock D, Beck D (2004) Kullback-Leibler dissimilarity of Markov models for phylogenetic tree reconstruction. In Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 157–160 IEEE

  • Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A, Fontenla-Romero O (2011) A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Netw 24(8):888–896

    PubMed  Google Scholar 

  • Ruiz-Marín M, Matilla-García M, Cordoba JAG, Susillo-González JL, Romo-Astorga A, González-Pérez A, Ruiz A, Gayán J (2010) An entropy test for single-locus genetic association analysis. BMC Genet 11:19

    Article  Google Scholar 

  • Shannon C (1948) A mathematical theory of communication, bell system technical journal 27: 379-423 and 623–656. Math rev (MathSciNet) MR10:133e

  • Sherwin WB (2010) Entropy and information approaches to genetic diversity and its expression: genomic geography. Entropy 12:1765–1798

    Article  CAS  Google Scholar 

  • Vinga S (2014) Information theory applications for biological sequence analysis. Brief Bioinform 15(3):376–389

    Article  Google Scholar 

  • Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C et al (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids res 38(web server issue):W214–W220. https://doi.org/10.1093/nar/gkq537

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Xie X, Yuan Z, Song J (2010) Complexity and entropy analysis of DNA Methyltransferase. J Data Min Genomics Proteomics 1:1–7

    Article  Google Scholar 

  • LQ Zhou, ZG Yu, PR Nie, FF Liao, VV Anh, and YJ Chen (2007) Log-correlation distance and fourier transform with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In proceedings of the third international conference on natural computation - volume 02, ICNC ‘07, pages 304–308, Washington, DC, USA. IEEE Computer Society

Download references

Acknowledgments

The authors express their best to Kamyar Shiouee for his comments on MATLAB codes, Dr. Saeid Ansari-Mahyari for providing necessary infrastructure and resources, and Professor Ahmad Oryan for his supports.

Author information

Authors and Affiliations

Authors

Contributions

Houshang Dehghanzadeh run algorithm, wrote up the paper, and did the literature review.

Mostafa Ghaderi-Zefrehei was the principle investigator of this study, developed ideas from the scratch, conduced computational pipelines, and participated in writing up paper.

Seyed Ziaeddin Mirhoseini and Saeid Esmaeilkhaniyan performed data collection and explored the results.

Ishaku Lemu Haruna performed editing the paper.

Hamed Amirpour-Najafabadi has been involved in drafting the manuscript and revising it critically for important intellectual content and final approval of the version to be published.

Corresponding author

Correspondence to Mostafa Ghaderi-Zefrehei.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by: Maciej Szydlowski

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(JPG 50 kb)

ESM 2

(JPG 47 kb)

ESM 3

(JPG 45 kb)

ESM 4

(JPG 45 kb)

ESM 5

(JPG 52 kb)

ESM 6

(JPG 85 kb)

ESM 7

(JPG 83 kb)

ESM 8

(JPG 82 kb)

ESM 9

(JPG 87 kb)

ESM 10

(JPG 84 kb)

ESM 11

(JPG 89 kb)

ESM 12

(DOCX 30 kb)

ESM 13

(XLSX 11 kb)

ESM 14

(XLSX 14 kb)

ESM 15

(XLSX 60 kb)

ESM 16

(XLSX 13 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dehghanzadeh, H., Ghaderi-Zefrehei, M., Mirhoseini, S.Z. et al. A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering. J Appl Genetics 61, 231–238 (2020). https://doi.org/10.1007/s13353-020-00543-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13353-020-00543-x

Keywords

Navigation