Abstract
Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.
Similar content being viewed by others
Notes
Nearest distance (single linkage method)
Furthest distance (complete linkage method)
Unweighted pair group method average (UPGMA, group average)
Weighted pair group method average (WPGMA)
Unweighted pair group method centroid (UPGMC)
Weighted pair group method centroid (WPGMC)
References
Erill I, Deloumeaux P, Gorzalka JD (2012) Information theory and biological sequences: insights from an evolutionary perspective, Information Theory: New Research. Nova Science Publishers, New York, pp 1–28
Freund Y, Schapire R (1996) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119. https://doi.org/10.1006/jcss.1997.1504
Gray RM (2011) Entropy and information theory. Springer Science & Business Media
Jiang S, Tang C, Zhang L, Zhang A (2014) A maximum entropy approach to classifying gene array data sets. Workshop on data Mining for Genomics, first SIAM international conference on data mining
Kim J, Kim S, Lee K, Kwon Y (2009) Entropy analysis in yeast DNA. Chaos, Solitons Fractals 39(4):1565–1571
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86. https://doi.org/10.1214/aoms/1177729694
Lemay DG, Lynn DJ, Martin WF, Neville MC, Casey TM, Rincon G, Kriventseva EV, Barris WC, Hinrichs AS, Molenaar AJ (2009) The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biol 10:R43
Liou C-Y, Tseng S-H, Cheng W-C, Tsai H-Y (2013) Structural complexity of DNA sequence. Comput Math Methods Med 2013:11
Machado JAT (2012) Shannon entropy analysis of the genome code. Math Probl Eng Article ID 132625:12. https://doi.org/10.1155/2012/132625
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome wide dense marker maps. Genetics 157:1819–1829
Monge RE, Crespo JL (2014) Comparison of complexity measures for DNA sequence analysis. In 3rd IEEE international work-conference on bioinspired intelligence (pp. 71-75). IEEE
Neagoe M, Popescu D, Niculescu V (2014) Applications of entropic divergence measures for DNA segmentation into high variable regions of cryptosporidium spp. gp60 gene. Rom Rep Phys 66(4):1078–1087
Pham, TD, Crane, DI, Tannock D, Beck D (2004) Kullback-Leibler dissimilarity of Markov models for phylogenetic tree reconstruction. In Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 157–160 IEEE
Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A, Fontenla-Romero O (2011) A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Netw 24(8):888–896
Ruiz-Marín M, Matilla-García M, Cordoba JAG, Susillo-González JL, Romo-Astorga A, González-Pérez A, Ruiz A, Gayán J (2010) An entropy test for single-locus genetic association analysis. BMC Genet 11:19
Shannon C (1948) A mathematical theory of communication, bell system technical journal 27: 379-423 and 623–656. Math rev (MathSciNet) MR10:133e
Sherwin WB (2010) Entropy and information approaches to genetic diversity and its expression: genomic geography. Entropy 12:1765–1798
Vinga S (2014) Information theory applications for biological sequence analysis. Brief Bioinform 15(3):376–389
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C et al (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids res 38(web server issue):W214–W220. https://doi.org/10.1093/nar/gkq537
Xie X, Yuan Z, Song J (2010) Complexity and entropy analysis of DNA Methyltransferase. J Data Min Genomics Proteomics 1:1–7
LQ Zhou, ZG Yu, PR Nie, FF Liao, VV Anh, and YJ Chen (2007) Log-correlation distance and fourier transform with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In proceedings of the third international conference on natural computation - volume 02, ICNC ‘07, pages 304–308, Washington, DC, USA. IEEE Computer Society
Acknowledgments
The authors express their best to Kamyar Shiouee for his comments on MATLAB codes, Dr. Saeid Ansari-Mahyari for providing necessary infrastructure and resources, and Professor Ahmad Oryan for his supports.
Author information
Authors and Affiliations
Contributions
Houshang Dehghanzadeh run algorithm, wrote up the paper, and did the literature review.
Mostafa Ghaderi-Zefrehei was the principle investigator of this study, developed ideas from the scratch, conduced computational pipelines, and participated in writing up paper.
Seyed Ziaeddin Mirhoseini and Saeid Esmaeilkhaniyan performed data collection and explored the results.
Ishaku Lemu Haruna performed editing the paper.
Hamed Amirpour-Najafabadi has been involved in drafting the manuscript and revising it critically for important intellectual content and final approval of the version to be published.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with animals performed by any of the authors.
Additional information
Communicated by: Maciej Szydlowski
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Dehghanzadeh, H., Ghaderi-Zefrehei, M., Mirhoseini, S.Z. et al. A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering. J Appl Genetics 61, 231–238 (2020). https://doi.org/10.1007/s13353-020-00543-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13353-020-00543-x