Skip to main content

Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique

  • Conference paper
  • First Online:
Advances in Data Science and Management

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 37))

Abstract

In this digital world, information exists in abundance everywhere. Document clustering and instances involving pattern mining requires a subset of relevant features for analysis. Feature sets, which constitute one or more words, from the content of a document play a vital role in document clustering. A filter-based method for feature selection algorithm called random feature set generation (RFG) is proposed in this paper for document clustering. In feature selection, the selected features are checked for its quality using a quality metric and the best quality terms are used as the basis for document clustering ignoring the worst terms as well as frequently occurring common terms from the corpus. Exhaustive experimentation for identification of feature sets is impossible due to its increased demand for computational effort. The advantage of RFG approach lies in selecting “good” quality terms for document clustering. The feature sets thus obtained are used for document clustering using K-means and X-means clustering algorithms. Experimental results have shown “good” quality terms filtered using random feature set generation (RFG) do not rely on monotonicity assumptions, and it shows positive correlation than the feature sets obtained using sequential forward feature set generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1–2), 279–305 (1994)

    Article  MathSciNet  Google Scholar 

  2. M. Dash, H. Liu, Consistency-based search in feature selection. Artif. Intell. 151(1–2), 155–176 (2003)

    Article  MathSciNet  Google Scholar 

  3. M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proceedings of 17th International Conference on Machine Learning (2000), pp. 359–366

    Google Scholar 

  4. R. Kohavi, G.H. John, Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)

    Article  Google Scholar 

  5. M. Modrzejewski, Feature selection using rough sets theory, in Proceedings of the European Conference on Machine Learning (1993), pp. 213–226

    Google Scholar 

  6. K.C.A. Wong, Simultaneous pattern and data clustering for pattern cluster analysis. IEEE Trans. Knowl. Data Eng. 20(7), 911–923 (2008)

    Article  Google Scholar 

  7. B.D. Fulcher, N.S. Jones, Highly comparative feature-based time-series classification. IEEE Trans. Knowl. Data Eng. 26(12), 3026–3037 (2014)

    Article  Google Scholar 

  8. C. Silva, B. Ribeiro, Margin-based active learning and background knowledge in text mining, in Proceedings of the 4th International Conference of Hybrid Intelligent System (2004). 0-7695-2291-2/04

    Google Scholar 

  9. C.C. Chen, M.C. Chen, TSCAN: a content anatomy approach to temporal topic summarization. IEEE Trans. Knowl. Data Eng. 24(1), 170–183 (2012)

    Article  Google Scholar 

  10. C. Clifton, R. Cooley, TopCat: data mining for topic identification in a text corpus (2000)

    Google Scholar 

  11. H. Frigui, O. Nasraoui, Simultaneous clustering and dynamic keyword weighting for Text documents (2004)

    Chapter  Google Scholar 

  12. I. Al-Jadir, K.W. Wong, C.C. Fung, H. Xie, Text document clustering using memetic feature selection, in Proceedings of the 9th International Conference on Machine Learning and Computing (2017), pp. 415–420

    Google Scholar 

  13. I. Dhillon, J. Kogan, C. Nicholas, Feature selection and document clustering (2004)

    Google Scholar 

  14. J. Liang, F. Wang, C. Dang, Y. Qian, A group incremental approach to feature selection applying rough set technique. IEEE Trans. Knowl. Data Eng. 26(2), 294–308 (2014)

    Article  Google Scholar 

  15. J.C. Reynar, A. Ratnaparkhi, A maximum entropy approach to identifying sentence boundaries, in Proceedings of the Conference on Applied Natural Language (1994)

    Google Scholar 

  16. L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, T.-S. Chua, Bridging the vocabulary gap between health seekers and healthcare knowledge. IEEE Trans. Knowl. Data Eng. 27(2), 396–409 (2015)

    Article  Google Scholar 

  17. M. Castellanos, HotMiner: discovering hot topics from dirty text. Surv. Text Min., 123–157 (2004)

    Google Scholar 

  18. N.K. Nagwani, A comment on the similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 29(9), 2589–2590 (2015)

    Google Scholar 

  19. N. Zhong, Y. Li, S.-T. Wu, Effective pattern discovery for text mining. IEEE Trans. Knowl. Data Eng. 24(1), 30–44 (2012)

    Article  Google Scholar 

  20. A. Christy, G. Meeragandhi, Combining bitemporal conceptual datamodel with multiway join relations for forecasting. Procedia Comput. Sci. 57, 1104–1114 (2015)

    Article  Google Scholar 

  21. A. Christy, M.G. Gandhi, S. Vaithyasubramanian, Cluster based outlier detection algorithm for healthcare data. Procedia Comput. Sci. 50, 209–215 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Christy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Christy, A., Gandhi, G.M. (2020). Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique. In: Borah, S., Emilia Balas, V., Polkowski, Z. (eds) Advances in Data Science and Management. Lecture Notes on Data Engineering and Communications Technologies, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-15-0978-0_6

Download citation

Publish with us

Policies and ethics