Advertisement

An Empirical Study of Strategies Boosts Performance of Mutual Information Similarity

  • Ole Kristian EksethEmail author
  • Svein-Olav Hvasshovd
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10842)

Abstract

In the recent years, the application of mutual information based measures has received broad popularity. The mutual information MINE measure is asserted to be the best strategy for identification of relationships in challenging data sets. A major weakness of the MINE similarity metric concerns its high execution time. To address the performance issue numerous approaches are suggested both with respect to improvement of software implementations and with respect to the application of simplified heuristics. However, none of the approaches manage to address the high execution-time of MINE computation.

In this work, we address the latter issue. This paper presents a novel MINE implementation which manages a 530x+ performance increase when compared to established approaches. The novel high-performance approach is the result of a structural evaluation of 30+ different MINE software implementations, implementations which do not make use of simplified heuristics. Hence, the proposed strategy for computation of MINE mutual information is both accurate and fast. The novel mutual information MINE software is available at https://bitbucket.org/oekseth/mine-data-analysis/downloads/. To broaden the applicability the high-performance MINE metric is integrated into the hpLysis machine learning library (https://bitbucket.org/oekseth/hplysis-cluster-analysis-software).

Notes

Acknowledgements

The authors would like to thank MD K.I. Ekseth at UIO, Dr. O.V. Solberg at SINTEF, Dr. S.A. Aase at GE Healthcare, MD B.H. Helleberg at NTNU–medical, Dr. Y. Dahl, Dr. T. Aalberg, and K.T. Dragland at NTNU, and Professor P. Sætrom and the High Performance Computing Group at NTNU for their support.

References

  1. 1.
    Ehsani, R., Drabløs, F.: TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinform. 17(1), 296 (2016)Google Scholar
  2. 2.
    Faith, J.J., Hayete, B., Thaden, J.T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J.J., Gardner, T.S.: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5(1), 8 (2007)Google Scholar
  3. 3.
    Leach, S.M., Tipney, H., Feng, W., Baumgartner Jr., W.A., Kasliwal, P., Schuyler, R.P., Williams, T., Spritz, R.A., Hunter, L.: Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5(3), 1000215 (2009)Google Scholar
  4. 4.
    Fraser, A.M., Swinney, H.L.: Independent coordinates for strange attractors from mutual information. Phys. Rev. A 33(2), 1134 (1986)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011)zbMATHGoogle Scholar
  6. 6.
    Liepe, J., Filippi, S., Komorowski, M., Stumpf, M.P.: Maximizing the information content of experiments in systems biology. PLoS Comput. Biol. 9(1), 1002888 (2013)MathSciNetGoogle Scholar
  7. 7.
    Villaverde, A.F., Ross, J., Morán, F., Banga, J.R.: MIDER: network inference with mutual information distance and entropy reduction. PLoS ONE 9(5), 96732 (2014)Google Scholar
  8. 8.
    Tang, D., Wang, M., Zheng, W., Wang, H.: RapidMic: rapid computation of the maximal information coefficient. Evol. Bioinform. 10, 11 (2014)Google Scholar
  9. 9.
    Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., Furlanello, C.: Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 707 (2012)Google Scholar
  10. 10.
    Chen, Y., Zeng, Y., Luo, F., Yuan, Z.: A new algorithm to optimize maximal information coefficient. PLoS ONE 11(6), 0157567 (2016)Google Scholar
  11. 11.
    Wang, K., Phillips, C.A., Saxton, A.M., Langston, M.A.: EntropyExplorer: an R package for computing and comparing differential Shannon entropy, differential coefficient of variation and differential expression. BMC Res. Notes 8(1), 832 (2015)Google Scholar
  12. 12.
    Hausser, J., Strimmer, K.: Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 10(July), 1469–1484 (2009)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Marcon, E., Hérault, B.: Entropart: an R package to measure and partition diversity. J. Stat. Softw. 67(8), 1–26 (2015)Google Scholar
  14. 14.
    Guevara, M.R., Hartmann, D., Mendoza, M.: diverse: an R package to analyze diversity in complex systems. R J. 8(2), 60–78 (2016)Google Scholar
  15. 15.
    Ince, R.A., Mazzoni, A., Petersen, R.S., Panzeri, S.: Open source tools for the information theoretic analysis of neural data. Front. Neurosci. 3, 11 (2010)Google Scholar
  16. 16.
    Mazandu, G.K., Mulder, N.J.: Information content-based gene ontology functional similarity measures: which one to use for a given biological data type? PLoS ONE 9(12), 113859 (2014)Google Scholar
  17. 17.
    Morgan, H.D., Sutherland, H.G., Martin, D.I., Whitelaw, E.: Epigenetic inheritance at the agouti locus in the mouse. Nat. Genet. 23(3), 314–318 (1999)Google Scholar
  18. 18.
    Lee, H.-S., Chen, Z.J.: Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc. Nat. Acad. Sci. 98(12), 6753–6758 (2001)Google Scholar
  19. 19.
    Carro, M., Lim, W., Alvarez, M., Bollo, R., Zhao, X., Snyder, E., Sulman, E., Anne, S., Doetsch, F., Colman, H., et al.: The transcriptional network for mesenchymal transformation of brain tumours. Nature 463(7279), 318 (2010)Google Scholar
  20. 20.
    Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R.Y., Alon, U., Margalit, H.: Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc. Nat. Acad. Sci. U.S.A. 101(16), 5934–5939 (2004)Google Scholar
  21. 21.
    Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004)Google Scholar
  22. 22.
    Sommerfelt, R.M., Feuerherm, A.J., Jones, K., Johansen, B.: Cytosolic phospholipase A2 regulates TNF-induced production of joint destructive effectors in synoviocytes. PLoS ONE 8(12), 83555 (2013)Google Scholar
  23. 23.
    Lee, W.-P., Tzou, W.-S.: Computational methods for discovering gene networks from expression data. Brief. Bioinform. 10(4), 408–423 (2009)Google Scholar
  24. 24.
    Riccadonna, S., Jurman, G., Visintainer, R., Filosi, M., Furlanello, C.: DTW-MIC coexpression networks from time-course data. PLoS ONE 11(3), 0152648 (2016)Google Scholar
  25. 25.
    Ekseth, K., Hvasshovd, S.: hpLysis similarity: a high-performance software-approach for computation of 320+ simliarty-metrics (2017)Google Scholar
  26. 26.
    Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)MathSciNetGoogle Scholar
  27. 27.
    Lord, E., Diallo, A.B., Makarenkov, V.: Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms. BMC Bioinform. 16(1), 1 (2015)Google Scholar
  28. 28.
    Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)zbMATHGoogle Scholar
  29. 29.
    Ekseth, O.K., Hvasshovd, S.-O.: How an optimized DB-SCAN implementation reduce execution-time and memory-requirements for large data-sets (2017)Google Scholar
  30. 30.
    Intel: SSE computer-hardware-low-level parallelism. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 06 June 2017
  31. 31.
    Chao, A., Shen, T.-J.: Nonparametric estimation of Shannons index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10(4), 429–443 (2003)MathSciNetGoogle Scholar
  32. 32.
    Frery, A.C., Cintra, R.J., Nascimento, A.D.: Entropy-based statistical analysis of PolSAR data. IEEE Trans. Geosci. Remote Sens. 51(6), 3733–3743 (2013)Google Scholar
  33. 33.
    Moon, Y.-I., Rajagopalan, B., Lall, U.: Estimation of mutual information using kernel density estimators. Phys. Rev. E 52(3), 2318 (1995)Google Scholar
  34. 34.
    Jiao, J., Venkat, K., Han, Y., Weissman, T.: Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 61(5), 2835–2885 (2015)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Jourdan, J.-H.: Vectorizable, approximated, portable implementations of some mathematical functions. https://github.com/jhjourdan/SIMD-math-prims. Accessed 06 June 2017
  36. 36.
    Open-MP: Open-MP: a parallel software-wrapper. http://www.openmp.org/. Accessed 17 Nov 2017

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science (IDI)NTNUTrondheimNorway

Personalised recommendations