Skip to main content
Log in

An Information-theoretic approach to dimensionality reduction in data science

  • Applications
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Data reduction is crucial in order to turn large datasets into information, the major purpose of data science. The classic and richer area of dimensionality reduction (DR) has traditionally been based on feature extraction by combining primary features in a linear fashion, aiming to preserve or maintain covariance/correlations between the features. Nonlinear alternatives have been developed, including information-theoretic approaches using mutual information as well and conditional entropy based on target features. Here, we further this approach to feature selection or reduction strategy based on the concept of conditional Shannon entropy of two random variables. Novel results include (a) a dimensionality reduction method based on conditional entropy between predictors themselves along two variants, disregarding the influence of the target feature; (b) an error-prevention method inspired by error-detection and correction in information theory for DR with genomic data that can be used for abiotic data as well; and (c) a comparative assessment of the performance of several machine learning models on input features selected by these methods. We assess the quality of the techniques based on their performance in solving three application problems (Malware Classification, BioTaxonomy, and Noisy Classification) of various degrees of difficulty with competitive outcomes. Some useful heuristics arise from the analysis of the results and also suggest some problems of interest for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

All data for the MC and BT problem are publicly available. The synthetic dataset used in the Noisy classification problem is available in the supplementary materials.

Code Availability Statement

The software used to run these applications is a part of the sklearn package, or is publicly available at bmc.memphis.edu/DSAx/, where 3D plots that allow full comparison of the results on all models and data sets can also be found.

References

  1. Adam, B.L., Qu, Y., Davis, J.W., Ward, M.D., Clements, M.A., Cazares, L.H., Semmes, O.J., Schellhammer, P.F., Yasui, Y., Feng, Z., et al.: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 62(13), 3609–3614 (2002)

    Google Scholar 

  2. Aiello, S., Click, C., Roark, H., Rehak, L.: Machine learning with python and h2o. Edited by Lanford, J, Published by H 20 (2016)

  3. Boltzmann, L.: On some problems of the mechanical theory of heat. Lond. Edinburgh Dublin Philos. Mag. J. Sci. 6(36), 236–237 (1878)

  4. Bouzas, D., Arvanitopoulos, N., Tefas, A.: Graph embedded nonparametric mutual information for supervised dimensionality reduction. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 951–963 (2014)

    Article  MathSciNet  Google Scholar 

  5. Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)

  6. Chanda, P., Costa, E., Hu, J., Sukumar, S., Van Hemert, J., Walia, R.: Information theory in computational biology: where We stand today. Entropy 22(6), 627 (2020)

    Article  MathSciNet  Google Scholar 

  7. Chang, C.H., Hsieh, L.C., Chen, T.Y., Chen, H.D., Luo, L., Lee, H.C.: Shannon information in complete genomes. J. Bioinform. Comput. Biol. 3(03), 587–608 (2005)

    Article  Google Scholar 

  8. Chen, S., Deng, L.Y., Bowman, D., Shiau, J.J.H., Wong, T.Y., Madahian, B., Lu, H.H.S.: Phylogenetic tree construction using trinucleotide usage profile (TUP). BMC BIoinform. 17(13), 117–130 (2016)

    Google Scholar 

  9. Clausius, R.: The mechanical theory of heat, nine memoirs on the development of concept of “entropy” (1850)

  10. Colorado-Garzón, F.A., Adler, P.H., García, L.F., Muñoz de Hoyos, P., Bueno, M.L., Matta, N.E.: Estimating diversity of black flies in the simulium ignescens and simulium tunja complexes in colombia: chromosomal rearrangements as the core of integrative taxonomy. J. Hered. 108(1), 12–24 (2017)

  11. De Queiroz, K.: Ernst Mayr and the modern concept of species. Proc. Natl. Acad. Sci. 102(suppl 1), 6600–6607 (2005)

    Article  Google Scholar 

  12. Diaz, S.A., Moncada, L.I., Murcia, C.H., Lotta, I.A., Matta, N.E., Adler, P.H.: Integrated taxonomy of a new species of black fly in the subgenus trichodagmia (diptera: Simuliidae) from the páramo region of colombia. Zootaxa 3914(5), 541–557 (2015)

  13. Faivishevsky, L., Goldberger, J.: Dimensionality reduction based on non-parametric mutual information. Neurocomputing 80, 31–37 (2012)

    Article  Google Scholar 

  14. Garraffoni, A.R., Araújo, T.Q., Lourenço, A.P., Guidi, L., Balsamo, M.: Integrative taxonomy of a new redudasys species (gastrotricha: Macrodasyida) sheds light on the invasion of fresh water habitats by macrodasyids. Sci. Rep. 9(1), 1–15 (2019)

  15. Garzon, M., Neathery, P., Deaton, R., Murphy, R.C., Franceschetti, D.R., Stevens Jr, S.: A new metric for DNA computing. In: Proceedings of the 2nd Genetic Programming Conference, Morgan Kaufman, pp 472–478 (1997)

  16. Garzon, M.H.: DNA codeword design: theory and applications. Parallel Process. Lett. 24(02), 1440001 (2014)

    Article  MathSciNet  Google Scholar 

  17. Garzon, M.H., Bobba, K.C.: A geometric approach to Gibbs energy landscapes and optimal DNA codeword design. In: International Workshop on DNA-Based Computers, pp. 73–85. Springer (2012)

  18. Garzon, M.H., Mainali, S.: Towards a universal genomic positioning system: phylogenetics and species Identification. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 469–479. Springer (2017a)

  19. Garzon, M.H., Mainali, S.: Towards reliable microarray analysis and design. In: 9th International Conference on Bioinformatics and Computational Biology, ISCA, 6p (2017b)

  20. Goldberger, A.L., Peng, C.K.: Genomic classification using an information-based similarity index: application to the SARS coronavirus. J. Comput. Biol. 12(8), 1103–1116 (2005)

    Article  Google Scholar 

  21. Guyon, I.: Design of experiments of the nips 2003 variable selection benchmark. In: NIPS 2003 workshop on feature extraction and feature selection, vol. 253 (2003)

  22. van der Heijden, F.: Edge and line feature extraction based on covariance models. IEEE Trans. Pattern Anal. Mach. Intell. 17(1), 16–33 (1995)

    Article  Google Scholar 

  23. Hsieh, P.F., Wang, D.S., Hsu, C.W.: A linear feature extraction for multiclass classification problems based on class mean and covariance discriminant information. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 223–235 (2005)

    Article  Google Scholar 

  24. Kumar, S., Stecher, G., Suleski, M., Hedges, S.B.: TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34(7), 1812–1819 (2017)

    Article  Google Scholar 

  25. Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)

    Article  Google Scholar 

  26. Linnaeus, C.: Systema naturae, vol 1. Stockholm Laurentii Salvii (1758)

  27. Machado, J.T., Costa, A.C., Quelhas, M.D.: Shannon, Rényie and Tsallis entropy analysis of DNA using phase plane. Nonlinear Anal. Real World Appl. 12(6), 3135–3144 (2011)

  28. Mainali, S., Colorado-Garzon, F.A., Garzon, M.: Foretelling the phenotype of a genomic sequence. IEEE/ACM Trans. Comput. Biol, Bioinform (2020a)

    Google Scholar 

  29. Mainali, S., Garzon, M.H., Colorado, F.A.: New genomic information systems (GenISs): species delimitation and identification. In: International Work-Conference on Bioinformatics and Biomedical Engineering, Springer, pp 163–174 (2020b)

  30. Mainali, S., Garzon, M.H., Colorado, F.A.: Profiling environmental conditions from DNA. In: International Work-Conference on Bioinformatics and Biomedical Engineering, pp. 647–658. Springer (2020c)

  31. Melzer, T., Reiter, M., Bischof, H.: Nonlinear feature extraction using generalized canonical correlation analysis. In: International Conference on Artificial Neural Networks, pp. 353–360. Springer (2001)

  32. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.C., Lin, C.C., Meyer, M.D.: Package ‘e1071’. R. J. (2019)

  33. Mizrachi, I.: GenBank: the nucleotide sequence database. The NCBI handbook [Internet], updated 22 (2007)

  34. Ou, J.: Theory of portfolio and risk based on incremental entropy. J. Risk Finance (2005)

  35. Parr, C.S., Wilson, M..N., Leary, M..P., Schulz, K.S., Lans, M.K., Walley, M.L., Hammock, J.A., Goddard, M.A., Rice, M.J., Studer, M.M., et al.: The encyclopedia of life v2: providing global access to knowledge about life on earth. Biodivers. Data J. (2) (2014)

  36. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  37. Petit, R.J., Excoffier, L.: Gene flow and species delimitation. Trends Ecol. Evol. 24(7), 386–393 (2009)

    Article  Google Scholar 

  38. Petricoin, E.F., Ornstein, D.K., Paweletz, C.P., Ardekani, A., Hackett, P.S., Hitt, B.A., Velassco, A., Trucco, C., Wiegand, L., Wood, K., et al.: Serum proteomic patterns for detection of prostate cancer. J. Natl. Cancer Inst. 94(20), 1576–1578 (2002)

    Article  Google Scholar 

  39. Phan, V., Garzon, M.H.: On codeword design in metric DNA spaces. Natural Comput. 8(3), 571 (2009)

    Article  MathSciNet  Google Scholar 

  40. Philippatos, G.C., Wilson, C.J.: Entropy, market risk, and the selection of efficient portfolios. Appl. Econ. 4(3), 209–220 (1972)

    Article  Google Scholar 

  41. Pramual, P., Kuvangkadilok, C.: Integrated cytogenetic, ecological, and DNA barcode study reveals cryptic diversity in simulium (gomphostilbia) angulistylum (diptera: Simuliidae). Genome 55(6), 447–458 (2012)

    Article  Google Scholar 

  42. Ripley, B., Venables, W., Ripley, M.B.: Package ‘nnet’. R package version 7, 3–12 (2016)

    Google Scholar 

  43. Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M.: Microsoft malware classification challenge. CoRR abs/1802.10135, arXiv:1802.10135 (2018)

  44. SantaLucia, J.: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. 95(4), 1460–1465 (1998)

    Article  Google Scholar 

  45. Schena, M.: DNA microarrays: a practical approach. 205, Practical approach series (1999)

  46. Shadvar, A.: Dimension reduction by mutual information feature extraction. arXiv preprint arXiv:1207.3394 (2012)

  47. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  48. Shannon, C.E.: A note on the concept of entropy. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  Google Scholar 

  49. Sherwin, W.B.: Genes are information, so information theory is coming to the aid of evolutionary biology. Mol. Ecol. Resour. 15(6), 1259–1261 (2015)

    Article  Google Scholar 

  50. Smouse, P.E., Whitehead, M.R., Peakall, R.: An informational diversity framework, illustrated with sexually deceptive orchids in early stages of speciation. Mol. Ecol. Resour. 15(6), 1375–1384 (2015)

    Article  Google Scholar 

  51. Sulaiman, M.A., Labadin, J.: Feature selection based on mutual information. In: 2015 9th International Conference on IT in Asia (CITA), IEEE, pp 1–6 (2015)

  52. Tsimring, L.S.: Noise in biology. Rep. Progr. Phys. 77(2), (2014)

  53. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4/, iSBN 0-387-95457-0 (2002)

  54. Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24(1), 175–186 (2014)

  55. Vinga, S.: Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. Adv. Comput. Methods Biocomput. Bioimaging 71, 107 (2007)

    Google Scholar 

  56. Vinga, S.: Information theory applications for biological sequence analysis. Brief. Bioinform. 15(3), 376–389 (2014)

    Article  Google Scholar 

  57. Wake, M.H.: Integrative biology: science for the 21st century. BioScience 58(4), 349–353 (2008)

    Article  Google Scholar 

  58. Wang, X., Liu, J., Chen, X.: Microsoft malware classification challenge (big 2015) first place team: say no to overfitting. no Big (2015)

  59. Wilkins, J.S.: What is systematics and what is taxonomy. Evolving Thoughts (2011)

  60. Xu, J., Zhou, X., Wu, D.D.: Portfolio selection using \(\lambda \) mean and hybrid entropy. Ann. Oper. Res. 185(1), 213–229 (2011)

    Article  MathSciNet  Google Scholar 

  61. Yan, J., Qi, Y., Rao, Q.: Detecting malware with an ensemble method based on deep neural network. Secur. Commun, Netw (2018)

    Book  Google Scholar 

  62. Yang, C.H., Wu, K.C., Chuang, L.Y., Chang, H.W.: Deepbarcoding: deep learning for species classification using DNA barcoding. IEEE/ACM Trans. Comput. Biol, Bioinform (2021)

    Google Scholar 

  63. Yang, P., Zhou, H., Zhu, Y., Liu, L., Zhang, L.: Malware classification based on shallow neural network. Future Internet 12(12), 219 (2020)

    Article  Google Scholar 

  64. Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., Sangaiah, A.K.: Classification of ransomware families with machine learning based on n-gram of opcodes. Future Gener. Comput. Syst. 90, 211–221 (2019)

    Article  Google Scholar 

  65. Zhou, R., Cai, R., Tong, G.: Applications of entropy in finance: a review. Entropy 15(11), 4909–4931 (2013)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The use of HPC at the U of Memphis for processing datasets and training models is gratefully acknowledged. We are also grateful to the reviewers for valuable comments that resulted in substantial improvements to the quality and presentation of this work.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sambriddhi Mainali.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 5 Performance of DR methods to 6 features across various solution methods of Malware Classification
Table 6 Performance of DR methods to 12 features across various solution methods of Malware Classification
Table 7 Performance of DR methods to 6 features across various solution methods of BioTaxonomy
Table 8 Performance of DR methods to 12 features across various solution methods of BioTaxonomy
Table 9 Performance of DR methods to 6 features across various solution methods of Noisy Classification (SYN13)
Table 10 Performance of DR methods to 12 features across various solution methods of Noisy Classification (SYN13)
Table 11 Performance of DR methods to 6 features across various solution methods of Noisy Classification (SYN22)
Table 12 Performance of DR methods to 12 features across various solution methods of Noisy Classification (SYN22)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mainali, S., Garzon, M., Venugopal, D. et al. An Information-theoretic approach to dimensionality reduction in data science. Int J Data Sci Anal 12, 185–203 (2021). https://doi.org/10.1007/s41060-021-00272-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-021-00272-2

Keywords

Navigation