Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique

Christy, A.; Gandhi, G. Meera

doi:10.1007/978-981-15-0978-0_6

A. Christy⁵ &
G. Meera Gandhi⁵

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 37))

864 Accesses
1 Citations

Abstract

In this digital world, information exists in abundance everywhere. Document clustering and instances involving pattern mining requires a subset of relevant features for analysis. Feature sets, which constitute one or more words, from the content of a document play a vital role in document clustering. A filter-based method for feature selection algorithm called random feature set generation (RFG) is proposed in this paper for document clustering. In feature selection, the selected features are checked for its quality using a quality metric and the best quality terms are used as the basis for document clustering ignoring the worst terms as well as frequently occurring common terms from the corpus. Exhaustive experimentation for identification of feature sets is impossible due to its increased demand for computational effort. The advantage of RFG approach lies in selecting “good” quality terms for document clustering. The feature sets thus obtained are used for document clustering using K-means and X-means clustering algorithms. Experimental results have shown “good” quality terms filtered using random feature set generation (RFG) do not rely on monotonicity assumptions, and it shows positive correlation than the feature sets obtained using sequential forward feature set generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1–2), 279–305 (1994)
Article MathSciNet Google Scholar
M. Dash, H. Liu, Consistency-based search in feature selection. Artif. Intell. 151(1–2), 155–176 (2003)
Article MathSciNet Google Scholar
M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proceedings of 17th International Conference on Machine Learning (2000), pp. 359–366
Google Scholar
R. Kohavi, G.H. John, Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
Article Google Scholar
M. Modrzejewski, Feature selection using rough sets theory, in Proceedings of the European Conference on Machine Learning (1993), pp. 213–226
Google Scholar
K.C.A. Wong, Simultaneous pattern and data clustering for pattern cluster analysis. IEEE Trans. Knowl. Data Eng. 20(7), 911–923 (2008)
Article Google Scholar
B.D. Fulcher, N.S. Jones, Highly comparative feature-based time-series classification. IEEE Trans. Knowl. Data Eng. 26(12), 3026–3037 (2014)
Article Google Scholar
C. Silva, B. Ribeiro, Margin-based active learning and background knowledge in text mining, in Proceedings of the 4th International Conference of Hybrid Intelligent System (2004). 0-7695-2291-2/04
Google Scholar
C.C. Chen, M.C. Chen, TSCAN: a content anatomy approach to temporal topic summarization. IEEE Trans. Knowl. Data Eng. 24(1), 170–183 (2012)
Article Google Scholar
C. Clifton, R. Cooley, TopCat: data mining for topic identification in a text corpus (2000)
Google Scholar
H. Frigui, O. Nasraoui, Simultaneous clustering and dynamic keyword weighting for Text documents (2004)
Chapter Google Scholar
I. Al-Jadir, K.W. Wong, C.C. Fung, H. Xie, Text document clustering using memetic feature selection, in Proceedings of the 9th International Conference on Machine Learning and Computing (2017), pp. 415–420
Google Scholar
I. Dhillon, J. Kogan, C. Nicholas, Feature selection and document clustering (2004)
Google Scholar
J. Liang, F. Wang, C. Dang, Y. Qian, A group incremental approach to feature selection applying rough set technique. IEEE Trans. Knowl. Data Eng. 26(2), 294–308 (2014)
Article Google Scholar
J.C. Reynar, A. Ratnaparkhi, A maximum entropy approach to identifying sentence boundaries, in Proceedings of the Conference on Applied Natural Language (1994)
Google Scholar
L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, T.-S. Chua, Bridging the vocabulary gap between health seekers and healthcare knowledge. IEEE Trans. Knowl. Data Eng. 27(2), 396–409 (2015)
Article Google Scholar
M. Castellanos, HotMiner: discovering hot topics from dirty text. Surv. Text Min., 123–157 (2004)
Google Scholar
N.K. Nagwani, A comment on the similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 29(9), 2589–2590 (2015)
Google Scholar
N. Zhong, Y. Li, S.-T. Wu, Effective pattern discovery for text mining. IEEE Trans. Knowl. Data Eng. 24(1), 30–44 (2012)
Article Google Scholar
A. Christy, G. Meeragandhi, Combining bitemporal conceptual datamodel with multiway join relations for forecasting. Procedia Comput. Sci. 57, 1104–1114 (2015)
Article Google Scholar
A. Christy, M.G. Gandhi, S. Vaithyasubramanian, Cluster based outlier detection algorithm for healthcare data. Procedia Comput. Sci. 50, 209–215 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Sathyabama Institute of Science and Technology, Chennai, India
A. Christy & G. Meera Gandhi

Authors

A. Christy
View author publications
You can also search for this author in PubMed Google Scholar
G. Meera Gandhi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Christy .

Editor information

Editors and Affiliations

Department of Computer Application, SMIT, Sikkim Manipal University, Sikkim, India
Samarjeet Borah
Department of Automatics and Applied Software at the Faculty of Engineering, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
Faculty of Technical Sciences, Jan Wyzykowski University, Polkowice, Poland
Zdzislaw Polkowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Christy, A., Gandhi, G.M. (2020). Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique. In: Borah, S., Emilia Balas, V., Polkowski, Z. (eds) Advances in Data Science and Management. Lecture Notes on Data Engineering and Communications Technologies, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-15-0978-0_6

Download citation

DOI: https://doi.org/10.1007/978-981-15-0978-0_6
Published: 14 January 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0977-3
Online ISBN: 978-981-15-0978-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics