On Entropy-Based Data Mining

  • Andreas Holzinger
  • Matthias Hörtenhuber
  • Christopher Mayer
  • Martin Bachler
  • Siegfried Wassertheurer
  • Armando J. Pinho
  • David Koslicki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8401)

Abstract

In the real world, we are confronted not only with complex and high-dimensional data sets, but usually with noisy, incomplete and uncertain data, where the application of traditional methods of knowledge discovery and data mining always entail the danger of modeling artifacts. Originally, information entropy was introduced by Shannon (1949), as a measure of uncertainty in the data. But up to the present, there have emerged many different types of entropy methods with a large number of different purposes and possible application areas. In this paper, we briefly discuss the applicability of entropy methods for the use in knowledge discovery and data mining, with particular emphasis on biomedical data. We present a very short overview of the state-of-the-art, with focus on four methods: Approximate Entropy (ApEn), Sample Entropy (SampEn), Fuzzy Entropy (FuzzyEn), and Topological Entropy (FiniteTopEn). Finally, we discuss some open problems and future research challenges.

Keywords

Entropy Data Mining Knowledge Discovery Topological Entropy FiniteTopEn Approximate Entropy Fuzzy Entropy Sample Entropy Biomedical Informatics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Holzinger, A.: On knowledge discovery and interactive intelligent visualization of biomedical data - challenges in human computer interaction and biomedical informatics. In: DATA 2012, vol. 1, pp. 9–20. INSTICC (2012)Google Scholar
  2. 2.
    Downarowicz, T.: Entropy in dynamical systems, vol. 18. Cambridge University Press, Cambridge (2011)CrossRefMATHGoogle Scholar
  3. 3.
    Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana (1949)MATHGoogle Scholar
  4. 4.
    Pincus, S.M.: Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences 88(6), 2297–2301 (1991)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Pincus, S.: Approximate entropy (apen) as a complexity measure. Chaos: An Interdisciplinary Journal of Nonlinear Science 5(1), 110–117 (1995)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. 41(3), 1–58 (2009)CrossRefGoogle Scholar
  7. 7.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin (2006)MATHGoogle Scholar
  8. 8.
    Holzinger, A., Simonic, K.-M. (eds.): Information Quality in e-Health. LNCS, vol. 7058. Springer, Heidelberg (2011)Google Scholar
  9. 9.
    Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Mining and Knowledge Discovery 7(1), 81–99 (2003)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Gschwandtner, T., Gärtner, J., Aigner, W., Miksch, S.: A taxonomy of dirty time-oriented data. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 58–72. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Clausius, R.: On the motive power of heat, and on the laws which can be deduced from it for the theory of heat, poggendorff’s annalen der physick, lxxix (1850)Google Scholar
  12. 12.
    Sethna, J.P.: Statistical mechanics: Entropy, order parameters, and complexity, vol. 14. Oxford University Press, New York (2006)MATHGoogle Scholar
  13. 13.
    Jaynes, E.T.: Information theory and statistical mechanics. Physical Review 106(4), 620 (1957)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Golan, A.: Information and entropy econometrics: A review and synthesis. Now Publishers Inc. (2008)Google Scholar
  15. 15.
    Holzinger, A.: Biomedical Informatics: Discovering Knowledge in Big Data. Springer, New York (2014)CrossRefMATHGoogle Scholar
  16. 16.
    Jaynes, E.T.: Information theory and statistical mechanics. Physical Review 106(4), 620 (1957)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Mowshowitz, A.: Entropy and the complexity of graphs: I. an index of the relative complexity of a graph. The Bulletin of Mathematical Biophysics 30(1), 175–204 (1968)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Körner, J.: Coding of an information source having ambiguous alphabet and the entropy of graphs. In: 6th Prague Conference on Information Theory, pp. 411–425 (1973)Google Scholar
  19. 19.
    Holzinger, A., Ofner, B., Stocker, C., Calero Valdez, A., Schaar, A.K., Ziefle, M., Dehmer, M.: On graph entropy measures for knowledge discovery from publication network data. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 354–362. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  20. 20.
    Dehmer, M., Mowshowitz, A.: A history of graph entropy measures. Information Sciences 181(1), 57–78 (2011)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Posner, E.C.: Random coding strategies for minimum entropy. IEEE Transactions on Information Theory 21(4), 388–391 (1975)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Yuan, L., Kesavan, H.: Minimum entropy and information measure. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(3), 488–491 (1998)CrossRefGoogle Scholar
  23. 23.
    Rubinstein, R.Y.: Optimization of computer simulation models with rare events. European Journal of Operational Research 99(1), 89–112 (1997)CrossRefGoogle Scholar
  24. 24.
    De Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Annals of Operations Research 134(1), 19–67 (2005)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Tsallis, C.: Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics 52(1-2), 479–487 (1988)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    de Albuquerque, M.P., Esquef, I.A., Mello, A.R.G., de Albuquerque, M.P.: Image thresholding using tsallis entropy. Pattern Recognition Letters 25(9), 1059–1065 (2004)CrossRefGoogle Scholar
  27. 27.
    Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278(6), H2039–H2049 (2000)Google Scholar
  28. 28.
    Chen, W., Wang, Z., Xie, H., Yu, W.: Characterization of surface emg signal based on fuzzy entropy. IEEE Transactions on Neural Systems and Rehabilitation Engineering 15(2), 266–272 (2007)CrossRefGoogle Scholar
  29. 29.
    Liu, C., Li, K., Zhao, L., Liu, F., Zheng, D., Liu, C., Liu, S.: Analysis of heart rate variability using fuzzy measure entropy. Comput. Biol. Med. 43(2), 100–108 (2013)CrossRefGoogle Scholar
  30. 30.
    Adler, R.L., Konheim, A.G., McAndrew, M.H.: Topological entropy. Transactions of the American Mathematical Society 114(2), 309–319 (1965)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Adler, R., Downarowicz, T., Misiurewicz, M.: Topological entropy. Scholarpedia 3(2), 2200 (2008)CrossRefGoogle Scholar
  32. 32.
    Koslicki, D.: Topological entropy of dna sequences. Bioinformatics 27(8), 1061–1067 (2011)CrossRefGoogle Scholar
  33. 33.
    Solomonoff, R.J.: A formal theory of inductive inference. Part I. Information and Control 7(1), 1–22 (1964)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Solomonoff, R.J.: A formal theory of inductive inference. Part II. Information and Control 7(2), 224–254 (1964)MathSciNetCrossRefMATHGoogle Scholar
  35. 35.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)MathSciNetMATHGoogle Scholar
  36. 36.
    Chaitin, G.J.: On the length of programs for computing finite binary sequences. Journal of the ACM 13, 547–569 (1966)MathSciNetCrossRefMATHGoogle Scholar
  37. 37.
    Acharya, U.R., Molinari, F., Sree, S.V., Chattopadhyay, S., Ng, K.-H., Suri, J.S.: Automated diagnosis of epileptic eeg using entropies. Biomedical Signal Processing and Control 7(4), 401–408 (2012)CrossRefGoogle Scholar
  38. 38.
    Hornero, R., Aboy, M., Abasolo, D., McNames, J., Wakeland, W., Goldstein, B.: Complex analysis of intracranial hypertension using approximate entropy. Crit. Care. Med. 34(1), 87–95 (2006)CrossRefGoogle Scholar
  39. 39.
    Batchinsky, A.I., Salinas, J., Cancio, L.C., Holcomb, J.: Assessment of the need to perform life-saving interventions using comprehensive analysis of the electrocardiogram and artificial neural networks. Use of Advanced Techologies and New Procedures in Medical Field Operations 39, 1–16 (2010)Google Scholar
  40. 40.
    Sarlabous, L., Torres, A., Fiz, J.A., Gea, J., Martínez-Llorens, J.M., Morera, J., Jané, R.: Interpretation of the approximate entropy using fixed tolerance values as a measure of amplitude variations in biomedical signals. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5967–5970 (2010)Google Scholar
  41. 41.
    Yentes, J., Hunt, N., Schmid, K., Kaipust, J., McGrath, D., Stergiou, N.: The appropriate use of approximate entropy and sample entropy with short data sets. Annals of Biomedical Engineering 41(2), 349–365 (2013)CrossRefGoogle Scholar
  42. 42.
    Roerdink, M., De Haart, M., Daffertshofer, A., Donker, S.F., Geurts, A.C., Beek, P.J.: Dynamical structure of center-of-pressure trajectories in patients recovering from stroke. Exp. Brain Res. 174(2), 256–269 (2006)CrossRefGoogle Scholar
  43. 43.
    Clift, B., Haussler, D., McConnell, R., Schneider, T.D., Stormo, G.D.: Sequence landscapes. Nucleic Acids Research 14(1), 141–158 (1986)CrossRefGoogle Scholar
  44. 44.
    Schneider, T.D., Stephens, R.M.: Sequence logos: A new way to display consensus sequences. Nucleic Acids Research 18(20), 6097–6100 (1990)CrossRefGoogle Scholar
  45. 45.
    Pinho, A.J., Garcia, S.P., Pratas, D., Ferreira, P.J.S.G.: DNA sequences at a glance. PLoS ONE 8(11), e79922 (2013)Google Scholar
  46. 46.
    Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Research 18(8), 2163–2170 (1990)CrossRefGoogle Scholar
  47. 47.
    Goldman, N.: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Research 21(10), 2487–2491 (1993)CrossRefGoogle Scholar
  48. 48.
    Oliver, J.L., Bernaola-Galván, P., Guerrero-García, J., Román-Roldán, R.: Entropic profiles of DNA sequences through chaos-game-derived images. Journal of Theoretical Biology 160, 457–470 (1993)CrossRefGoogle Scholar
  49. 49.
    Vinga, S., Almeida, J.S.: Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 8(393) (2007)Google Scholar
  50. 50.
    Crochemore, M., Vérin, R.: Zones of low entropy in genomic sequences. Computers & Chemistry, 275–282 (1999)Google Scholar
  51. 51.
    Allison, L., Stern, L., Edgoose, T., Dix, T.I.: Sequence complexity for biological sequence analysis. Computers & Chemistry 24, 43–55 (2000)CrossRefGoogle Scholar
  52. 52.
    Stern, L., Allison, L., Coppel, R.L., Dix, T.I.: Discovering patterns in Plasmodium falciparum genomic DNA. Molecular & Biochemical Parasitology 118, 174–186 (2001)CrossRefGoogle Scholar
  53. 53.
    Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)Google Scholar
  54. 54.
    Dix, T.I., Powell, D.R., Allison, L., Bernal, J., Jaeger, S., Stern, L.: Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 8(Suppl 8(suppl. 2), 10 (2007)CrossRefGoogle Scholar
  55. 55.
    Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proc. of the Data Compression Conf., DCC 93, Snowbird, Utah, pp. 340–350 (1993)Google Scholar
  56. 56.
    Rivals, E., Delgrange, O., Delahaye, J.-P., Dauchet, M., Delorme, M.-O., Hénaut, A., Ollivier, E.: Detection of significant patterns by compression algorithms: The case of approximate tandem repeats in DNA sequences. Computer Applications in the Biosciences 13, 131–136 (1997)Google Scholar
  57. 57.
    Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)CrossRefGoogle Scholar
  58. 58.
    Nan, F., Adjeroh, D.: On the complexity measures for biological sequences. In: Proc. of the IEEE Computational Systems Bioinformatics Conference, CSB-2004, Stanford, CA (August 2004 )Google Scholar
  59. 59.
    Pirhaji, L., Kargar, M., Sheari, A., Poormohammadi, H., Sadeghi, M., Pezeshk, H., Eslahchi, C.: The performances of the chi-square test and complexity measures for signal recognition in biological sequences. Journal of Theoretical Biology 251(2), 380–387 (2008)MathSciNetCrossRefGoogle Scholar
  60. 60.
    Turing, A.: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42(2), 230–265 (1936)MathSciNetMATHGoogle Scholar
  61. 61.
    Li, M., Vitányi, P.: An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer (2008)Google Scholar
  62. 62.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: Asai, K., Miyano, S., Takagi, T. (eds.) Proc. of the 10th Workshop, Genome Informatics 1999, Tokyo, Japan, pp. 51–61 (1999)Google Scholar
  63. 63.
    Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)Google Scholar
  64. 64.
    Pinho, A.J., Garcia, S.P., Ferreira, P.J.S.G., Afreixo, V., Bastos, C.A.C., Neves, A.J.R., Rodrigues, J.M.O.S.: Exploring homology using the concept of three-state entropy vector. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds.) PRIB 2010. LNCS (LNBI), vol. 6282, pp. 161–170. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  65. 65.
    Garcia, S.P., Rodrigues, J.M.O.S., Santos, S., Pratas, D., Afreixo, V., Bastos, C.A.C., Ferreira, P.J.S.G., Pinho, A.J.: A genomic distance for assembly comparison based on compressed maximal exact matches. IEEE/ACM Trans. on Computational Biology and Bioinformatics 10(3), 793–798 (2013)CrossRefGoogle Scholar
  66. 66.
    Holzinger, A., Stocker, C., Peischl, B., Simonic, K.M.: On using entropy for enhancing handwriting preprocessing. Entropy 14(11), 2324–2350 (2012)CrossRefMATHGoogle Scholar
  67. 67.
    Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in bioinformatics - state-of-the-art, future challenges and research directions. BMC Bioinformatics 15(suppl. 6), 11 (2014)Google Scholar
  68. 68.
    Zhou, Z., Feng, L.: Twelve open problems on the exact value of the hausdorff measure and on topological entropy: A brief survey of recent results. Nonlinearity 17(2), 493–502 (2004)MathSciNetCrossRefMATHGoogle Scholar
  69. 69.
    Chon, K., Scully, C.G., Lu, S.: Approximate entropy for all signals. IEEE Eng. Med. Biol. Mag. 28(6), 18–23 (2009)CrossRefGoogle Scholar
  70. 70.
    Liu, C., Liu, C., Shao, P., Li, L., Sun, X., Wang, X., Liu, F.: Comparison of different threshold values r for approximate entropy: Application to investigate the heart rate variability between heart failure and healthy control groups. Physiol. Meas. 32(2), 167–180 (2011)CrossRefGoogle Scholar
  71. 71.
    Mayer, C., Bachler, M., Hörtenhuber, M., Stocker, C., Holzinger, A., Wassertheurer, S.: Selection of entropy-measure parameters for knowledge discovery in heart rate variability data. BMC Bioinformatics 15Google Scholar
  72. 72.
    Boskovic, A., Loncar-Turukalo, T., Japundzic-Zigon, N., Bajic, D.: The flip-flop effect in entropy estimation, pp. 227–230 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Andreas Holzinger
    • 1
  • Matthias Hörtenhuber
    • 2
  • Christopher Mayer
    • 2
  • Martin Bachler
    • 2
  • Siegfried Wassertheurer
    • 2
  • Armando J. Pinho
    • 3
  • David Koslicki
    • 4
  1. 1.Institute for Medical Informatics, Statistics & Documentation, Research Unit Human-Computer InteractionMedical University GrazGrazAustria
  2. 2.Health & Environment Department, Biomedical SystemsAIT Austrian Institute of Technology GmbHViennaAustria
  3. 3.IEETA / Department of Electronics, Telecommunications and InformaticsUniversity of AveiroAveiroPortugal
  4. 4.Mathematics DepartmentOregon State UniversityCorvallisUSA

Personalised recommendations