Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique

  • A. ChristyEmail author
  • G. Meera Gandhi
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 37)


In this digital world, information exists in abundance everywhere. Document clustering and instances involving pattern mining requires a subset of relevant features for analysis. Feature sets, which constitute one or more words, from the content of a document play a vital role in document clustering. A filter-based method for feature selection algorithm called random feature set generation (RFG) is proposed in this paper for document clustering. In feature selection, the selected features are checked for its quality using a quality metric and the best quality terms are used as the basis for document clustering ignoring the worst terms as well as frequently occurring common terms from the corpus. Exhaustive experimentation for identification of feature sets is impossible due to its increased demand for computational effort. The advantage of RFG approach lies in selecting “good” quality terms for document clustering. The feature sets thus obtained are used for document clustering using K-means and X-means clustering algorithms. Experimental results have shown “good” quality terms filtered using random feature set generation (RFG) do not rely on monotonicity assumptions, and it shows positive correlation than the feature sets obtained using sequential forward feature set generation.


Feature sets Clustering Sequential forward Random feature set K-means X-means 


  1. 1.
    H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1–2), 279–305 (1994)MathSciNetCrossRefGoogle Scholar
  2. 2.
    M. Dash, H. Liu, Consistency-based search in feature selection. Artif. Intell. 151(1–2), 155–176 (2003)MathSciNetCrossRefGoogle Scholar
  3. 3.
    M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proceedings of 17th International Conference on Machine Learning (2000), pp. 359–366Google Scholar
  4. 4.
    R. Kohavi, G.H. John, Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)CrossRefGoogle Scholar
  5. 5.
    M. Modrzejewski, Feature selection using rough sets theory, in Proceedings of the European Conference on Machine Learning (1993), pp. 213–226Google Scholar
  6. 6.
    K.C.A. Wong, Simultaneous pattern and data clustering for pattern cluster analysis. IEEE Trans. Knowl. Data Eng. 20(7), 911–923 (2008)CrossRefGoogle Scholar
  7. 7.
    B.D. Fulcher, N.S. Jones, Highly comparative feature-based time-series classification. IEEE Trans. Knowl. Data Eng. 26(12), 3026–3037 (2014)CrossRefGoogle Scholar
  8. 8.
    C. Silva, B. Ribeiro, Margin-based active learning and background knowledge in text mining, in Proceedings of the 4th International Conference of Hybrid Intelligent System (2004). 0-7695-2291-2/04Google Scholar
  9. 9.
    C.C. Chen, M.C. Chen, TSCAN: a content anatomy approach to temporal topic summarization. IEEE Trans. Knowl. Data Eng. 24(1), 170–183 (2012)CrossRefGoogle Scholar
  10. 10.
    C. Clifton, R. Cooley, TopCat: data mining for topic identification in a text corpus (2000)Google Scholar
  11. 11.
    H. Frigui, O. Nasraoui, Simultaneous clustering and dynamic keyword weighting for Text documents (2004)CrossRefGoogle Scholar
  12. 12.
    I. Al-Jadir, K.W. Wong, C.C. Fung, H. Xie, Text document clustering using memetic feature selection, in Proceedings of the 9th International Conference on Machine Learning and Computing (2017), pp. 415–420Google Scholar
  13. 13.
    I. Dhillon, J. Kogan, C. Nicholas, Feature selection and document clustering (2004)Google Scholar
  14. 14.
    J. Liang, F. Wang, C. Dang, Y. Qian, A group incremental approach to feature selection applying rough set technique. IEEE Trans. Knowl. Data Eng. 26(2), 294–308 (2014)CrossRefGoogle Scholar
  15. 15.
    J.C. Reynar, A. Ratnaparkhi, A maximum entropy approach to identifying sentence boundaries, in Proceedings of the Conference on Applied Natural Language (1994)Google Scholar
  16. 16.
    L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, T.-S. Chua, Bridging the vocabulary gap between health seekers and healthcare knowledge. IEEE Trans. Knowl. Data Eng. 27(2), 396–409 (2015)CrossRefGoogle Scholar
  17. 17.
    M. Castellanos, HotMiner: discovering hot topics from dirty text. Surv. Text Min., 123–157 (2004)Google Scholar
  18. 18.
    N.K. Nagwani, A comment on the similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 29(9), 2589–2590 (2015)Google Scholar
  19. 19.
    N. Zhong, Y. Li, S.-T. Wu, Effective pattern discovery for text mining. IEEE Trans. Knowl. Data Eng. 24(1), 30–44 (2012)CrossRefGoogle Scholar
  20. 20.
    A. Christy, G. Meeragandhi, Combining bitemporal conceptual datamodel with multiway join relations for forecasting. Procedia Comput. Sci. 57, 1104–1114 (2015)CrossRefGoogle Scholar
  21. 21.
    A. Christy, M.G. Gandhi, S. Vaithyasubramanian, Cluster based outlier detection algorithm for healthcare data. Procedia Comput. Sci. 50, 209–215 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.School of ComputingSathyabama Institute of Science and TechnologyChennaiIndia

Personalised recommendations