A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Dehghanzadeh, Houshang; Ghaderi-Zefrehei, Mostafa; Mirhoseini, Seyed Ziaeddin; Esmaeilkhaniyan, Saeid; Haruna, Ishaku Lemu; Amirpour Najafabadi, Hamed

doi:10.1007/s13353-020-00543-x

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Animal Genetics • Original Paper
Published: 24 January 2020

Volume 61, pages 231–238, (2020)
Cite this article

Journal of Applied Genetics Aims and scope Submit manuscript

Houshang Dehghanzadeh¹,
Mostafa Ghaderi-Zefrehei ORCID: orcid.org/0000-0002-9710-883X²,
Seyed Ziaeddin Mirhoseini³,
Saeid Esmaeilkhaniyan⁴,
Ishaku Lemu Haruna⁵ &
…
Hamed Amirpour Najafabadi ORCID: orcid.org/0000-0002-6057-5514⁵

419 Accesses
6 Citations
Explore all metrics

Abstract

Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Article Open access 05 December 2014

Notes

Nearest distance (single linkage method)
Furthest distance (complete linkage method)
Unweighted pair group method average (UPGMA, group average)
Weighted pair group method average (WPGMA)
Unweighted pair group method centroid (UPGMC)
Weighted pair group method centroid (WPGMC)

References

Erill I, Deloumeaux P, Gorzalka JD (2012) Information theory and biological sequences: insights from an evolutionary perspective, Information Theory: New Research. Nova Science Publishers, New York, pp 1–28
Google Scholar
Freund Y, Schapire R (1996) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119. https://doi.org/10.1006/jcss.1997.1504
Article Google Scholar
Gray RM (2011) Entropy and information theory. Springer Science & Business Media
Jiang S, Tang C, Zhang L, Zhang A (2014) A maximum entropy approach to classifying gene array data sets. Workshop on data Mining for Genomics, first SIAM international conference on data mining
Kim J, Kim S, Lee K, Kwon Y (2009) Entropy analysis in yeast DNA. Chaos, Solitons Fractals 39(4):1565–1571
Article CAS Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86. https://doi.org/10.1214/aoms/1177729694
Article Google Scholar
Lemay DG, Lynn DJ, Martin WF, Neville MC, Casey TM, Rincon G, Kriventseva EV, Barris WC, Hinrichs AS, Molenaar AJ (2009) The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biol 10:R43
Article Google Scholar
Liou C-Y, Tseng S-H, Cheng W-C, Tsai H-Y (2013) Structural complexity of DNA sequence. Comput Math Methods Med 2013:11
Article Google Scholar
Machado JAT (2012) Shannon entropy analysis of the genome code. Math Probl Eng Article ID 132625:12. https://doi.org/10.1155/2012/132625
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome wide dense marker maps. Genetics 157:1819–1829
CAS PubMed PubMed Central Google Scholar
Monge RE, Crespo JL (2014) Comparison of complexity measures for DNA sequence analysis. In 3rd IEEE international work-conference on bioinspired intelligence (pp. 71-75). IEEE
Neagoe M, Popescu D, Niculescu V (2014) Applications of entropic divergence measures for DNA segmentation into high variable regions of cryptosporidium spp. gp60 gene. Rom Rep Phys 66(4):1078–1087
Google Scholar
Pham, TD, Crane, DI, Tannock D, Beck D (2004) Kullback-Leibler dissimilarity of Markov models for phylogenetic tree reconstruction. In Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 157–160 IEEE
Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A, Fontenla-Romero O (2011) A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Netw 24(8):888–896
PubMed Google Scholar
Ruiz-Marín M, Matilla-García M, Cordoba JAG, Susillo-González JL, Romo-Astorga A, González-Pérez A, Ruiz A, Gayán J (2010) An entropy test for single-locus genetic association analysis. BMC Genet 11:19
Article Google Scholar
Shannon C (1948) A mathematical theory of communication, bell system technical journal 27: 379-423 and 623–656. Math rev (MathSciNet) MR10:133e
Sherwin WB (2010) Entropy and information approaches to genetic diversity and its expression: genomic geography. Entropy 12:1765–1798
Article CAS Google Scholar
Vinga S (2014) Information theory applications for biological sequence analysis. Brief Bioinform 15(3):376–389
Article Google Scholar
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C et al (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids res 38(web server issue):W214–W220. https://doi.org/10.1093/nar/gkq537
Article CAS PubMed PubMed Central Google Scholar
Xie X, Yuan Z, Song J (2010) Complexity and entropy analysis of DNA Methyltransferase. J Data Min Genomics Proteomics 1:1–7
Article Google Scholar
LQ Zhou, ZG Yu, PR Nie, FF Liao, VV Anh, and YJ Chen (2007) Log-correlation distance and fourier transform with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In proceedings of the third international conference on natural computation - volume 02, ICNC ‘07, pages 304–308, Washington, DC, USA. IEEE Computer Society

Download references

Acknowledgments

The authors express their best to Kamyar Shiouee for his comments on MATLAB codes, Dr. Saeid Ansari-Mahyari for providing necessary infrastructure and resources, and Professor Ahmad Oryan for his supports.

Author information

Authors and Affiliations

Department of Animal Science Research, Guilan Agricultural and Natural Resources Research and Education Center, AREEO, Rasht, Iran
Houshang Dehghanzadeh
Department of Animal Science, Faculty of Agriculture, University of Yasouj, P. O. Box: 75914, Yasouj, Iran
Mostafa Ghaderi-Zefrehei
Department of Animal Science, Faculty of Agricultural Sciences, University of Guilan, Rasht, Iran
Seyed Ziaeddin Mirhoseini
Animal Science Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran
Saeid Esmaeilkhaniyan
Faculty of Agriculture and Life Sciences, Lincoln University, Lincoln, New Zealand
Ishaku Lemu Haruna & Hamed Amirpour Najafabadi

Authors

Houshang Dehghanzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Ghaderi-Zefrehei
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Ziaeddin Mirhoseini
View author publications
You can also search for this author in PubMed Google Scholar
Saeid Esmaeilkhaniyan
View author publications
You can also search for this author in PubMed Google Scholar
Ishaku Lemu Haruna
View author publications
You can also search for this author in PubMed Google Scholar
Hamed Amirpour Najafabadi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Houshang Dehghanzadeh run algorithm, wrote up the paper, and did the literature review.

Mostafa Ghaderi-Zefrehei was the principle investigator of this study, developed ideas from the scratch, conduced computational pipelines, and participated in writing up paper.

Seyed Ziaeddin Mirhoseini and Saeid Esmaeilkhaniyan performed data collection and explored the results.

Ishaku Lemu Haruna performed editing the paper.

Hamed Amirpour-Najafabadi has been involved in drafting the manuscript and revising it critically for important intellectual content and final approval of the version to be published.

Corresponding author

Correspondence to Mostafa Ghaderi-Zefrehei.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by: Maciej Szydlowski

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(JPG 50 kb)

ESM 2

(JPG 47 kb)

ESM 3

(JPG 45 kb)

ESM 4

(JPG 45 kb)

ESM 5

(JPG 52 kb)

ESM 6

(JPG 85 kb)

ESM 7

(JPG 83 kb)

ESM 8

(JPG 82 kb)

ESM 9

(JPG 87 kb)

ESM 10

(JPG 84 kb)

ESM 11

(JPG 89 kb)

ESM 12

(DOCX 30 kb)

ESM 13

(XLSX 11 kb)

ESM 14

(XLSX 14 kb)

ESM 15

(XLSX 60 kb)

ESM 16

(XLSX 13 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dehghanzadeh, H., Ghaderi-Zefrehei, M., Mirhoseini, S.Z. et al. A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering. J Appl Genetics 61, 231–238 (2020). https://doi.org/10.1007/s13353-020-00543-x

Download citation

Received: 20 May 2019
Revised: 07 September 2019
Accepted: 08 January 2020
Published: 24 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s13353-020-00543-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Abstract

Access this article

Similar content being viewed by others

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation