Textual data dimensionality reduction - a deep learning approach

  • Neetu KushwahaEmail author
  • Millie Pant


The growth of Internet has produced a high volume of natural language textual data. Such data can be sparse and may contain uninformative features which increase the dimensions of the data. This high dimensionality in turn, decreases the efficiency of text mining tasks such as clustering. Transforming the high dimensional data into a lower dimension is an important pre-processing step before applying clustering. In this paper, dimensionality reduction method based on deep Autoencoder neural network named as DRDAE, is proposed to provide optimized and robust features for text clustering. DRDAE selects less correlated and salient feature space from the high dimensional feature space. To evaluate proposed algorithm, k-means is used to cluster text documents. The proposed method is tested on five benchmark text datasets. Simulation results demonstrate that the proposed algorithm clearly outperforms other conventional dimensionality reduction methods in the literature in terms of RI measure.


Autoencoder Clustering Feature extraction Dimensionality reduction 



  1. 1.
    Abualigah LM, Khader AT, Al-Betar MA (2016) Multi-objectives-based text clustering technique using K-mean algorithm. In: 2016 7th international conference on computer science and information technology (CSIT). IEEE: 1–6Google Scholar
  2. 2.
    Agarwal B, Mittal N (2014) Text classification using machine learning methods-a survey. Springer, New Delhi, pp 701–709Google Scholar
  3. 3.
    Arzeno NM, Vikalo H (2015) Semi-supervised affinity propagation with soft instance-level constraints. IEEE Trans Pattern Anal Mach Intell 37:1041–1052. CrossRefGoogle Scholar
  4. 4.
    Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42:3105–3114. CrossRefGoogle Scholar
  5. 5.
    Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput J 43:20–34. CrossRefGoogle Scholar
  6. 6.
    Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. Proc 16th ACM SIGKDD Int Conf Knowl Discov data Min - KDD 10:333. CrossRefGoogle Scholar
  7. 7.
    Chouaib H, Terrades OR, Tabbone S, et al (2008) Feature selection combining genetic algorithm and Adaboost classifiers. In: 2008 19th international conference on pattern recognition. IEEE: 1–4Google Scholar
  8. 8.
    Cover TM, Thomas JA, Bellamy J, et al (1991) Elements of Information Theory WILEY SERIES IN Expert System Applications to TelecommunicationsGoogle Scholar
  9. 9.
    Duda RO, Hart PE, Stork DG PATTERN CLASSIFICATION Second EditionGoogle Scholar
  10. 10.
    Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysisGoogle Scholar
  11. 11.
    Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25:345–366. CrossRefGoogle Scholar
  12. 12.
    Hartigan JA (175AD) Clustering Algorithems. a Wiley Publ Appl Stat 1–351. doi:
  13. 13.
    Hull DA (2013) Stemming algorithms - a case study for detailed Evalation. J Chem Inf Model 53:1689–1699. CrossRefGoogle Scholar
  14. 14.
    Jolliffe IT (2002) Principal component analysis. SpringerGoogle Scholar
  15. 15.
    Kant S, Mahara T, Kumar Jain V et al (2018) LeaderRank based k-means clustering initialization method for collaborative filtering. Comput Electr Eng 69:598–609. CrossRefGoogle Scholar
  16. 16.
    Koller D, Sahami M (1996) Toward optimal feature selection. Int Conf Mach learn 284–292 . doi: citeulike-article-id:393144Google Scholar
  17. 17.
    Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput 1–10. doi:
  18. 18.
    Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Futur Gener Comput Syst 82. doi:
  19. 19.
    Kushwaha N, Pant M, Kant S, Kumar V (2017) Magnetic optimization algorithm for data clustering. Pattern Recogn Lett 0:1–7. CrossRefGoogle Scholar
  20. 20.
    Lee Rodgers J, Alan Nice Wander W (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42:59–66. CrossRefGoogle Scholar
  21. 21.
    Li YH, Jain AK Classification of Text DocumentsGoogle Scholar
  22. 22.
    Li M, Yuan B (2005) 2D-LDA: a statistical linear discriminant analysis for image matrix. Pattern Recogn Lett 26:527–532. CrossRefGoogle Scholar
  23. 23.
    Li Z, Yang Y, Liu J, et al (2012) Unsupervised Feature Selection Using Nonnegative Spectral Analysis. Twenty-Sixth AAAI Conf Artif Intell Unsupervised 1026–1032Google Scholar
  24. 24.
    Liu H, Yu L, Member SS et al (2005) Toward integrating feature selection algorithms for classification and clustering. Knowl Data Eng IEEE Trans 17:491–502. CrossRefGoogle Scholar
  25. 25.
    Ludwig C (2007) Text Retrieval 24:1–21Google Scholar
  26. 26.
    Nie F, Xiang S, Jia Y, et al (2008) Trace ratio criterion for feature selection. Twenty-third AAAI Conf Artif Intell 671–676Google Scholar
  27. 27.
    Xu R, Member S, Ii DW (2005) Survey of clustering. Algorithms 16:645–678Google Scholar
  28. 28.
    Yang Y, Shen HT, Ma Z, et al (2011) ℓ2,1-norm regularized discriminative feature selection for unsupervised learning. IJCAI Int Jt Conf Artif Intell 1589–1594. doi:
  29. 29.
    Zareapoor M, Yang J, Jain DK, et al (2018) Deep semantic preserving hashing for large scale image retrieval. Multimed Tools Appl 1–16 . doi:
  30. 30.
    Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. Icml 1151–1157Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of ASEIndian Institute of Technology RoorkeeRoorkeeIndia

Personalised recommendations