Abstract
In this digital world, information exists in abundance everywhere. Document clustering and instances involving pattern mining requires a subset of relevant features for analysis. Feature sets, which constitute one or more words, from the content of a document play a vital role in document clustering. A filter-based method for feature selection algorithm called random feature set generation (RFG) is proposed in this paper for document clustering. In feature selection, the selected features are checked for its quality using a quality metric and the best quality terms are used as the basis for document clustering ignoring the worst terms as well as frequently occurring common terms from the corpus. Exhaustive experimentation for identification of feature sets is impossible due to its increased demand for computational effort. The advantage of RFG approach lies in selecting “good” quality terms for document clustering. The feature sets thus obtained are used for document clustering using K-means and X-means clustering algorithms. Experimental results have shown “good” quality terms filtered using random feature set generation (RFG) do not rely on monotonicity assumptions, and it shows positive correlation than the feature sets obtained using sequential forward feature set generation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1–2), 279–305 (1994)
M. Dash, H. Liu, Consistency-based search in feature selection. Artif. Intell. 151(1–2), 155–176 (2003)
M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proceedings of 17th International Conference on Machine Learning (2000), pp. 359–366
R. Kohavi, G.H. John, Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
M. Modrzejewski, Feature selection using rough sets theory, in Proceedings of the European Conference on Machine Learning (1993), pp. 213–226
K.C.A. Wong, Simultaneous pattern and data clustering for pattern cluster analysis. IEEE Trans. Knowl. Data Eng. 20(7), 911–923 (2008)
B.D. Fulcher, N.S. Jones, Highly comparative feature-based time-series classification. IEEE Trans. Knowl. Data Eng. 26(12), 3026–3037 (2014)
C. Silva, B. Ribeiro, Margin-based active learning and background knowledge in text mining, in Proceedings of the 4th International Conference of Hybrid Intelligent System (2004). 0-7695-2291-2/04
C.C. Chen, M.C. Chen, TSCAN: a content anatomy approach to temporal topic summarization. IEEE Trans. Knowl. Data Eng. 24(1), 170–183 (2012)
C. Clifton, R. Cooley, TopCat: data mining for topic identification in a text corpus (2000)
H. Frigui, O. Nasraoui, Simultaneous clustering and dynamic keyword weighting for Text documents (2004)
I. Al-Jadir, K.W. Wong, C.C. Fung, H. Xie, Text document clustering using memetic feature selection, in Proceedings of the 9th International Conference on Machine Learning and Computing (2017), pp. 415–420
I. Dhillon, J. Kogan, C. Nicholas, Feature selection and document clustering (2004)
J. Liang, F. Wang, C. Dang, Y. Qian, A group incremental approach to feature selection applying rough set technique. IEEE Trans. Knowl. Data Eng. 26(2), 294–308 (2014)
J.C. Reynar, A. Ratnaparkhi, A maximum entropy approach to identifying sentence boundaries, in Proceedings of the Conference on Applied Natural Language (1994)
L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, T.-S. Chua, Bridging the vocabulary gap between health seekers and healthcare knowledge. IEEE Trans. Knowl. Data Eng. 27(2), 396–409 (2015)
M. Castellanos, HotMiner: discovering hot topics from dirty text. Surv. Text Min., 123–157 (2004)
N.K. Nagwani, A comment on the similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 29(9), 2589–2590 (2015)
N. Zhong, Y. Li, S.-T. Wu, Effective pattern discovery for text mining. IEEE Trans. Knowl. Data Eng. 24(1), 30–44 (2012)
A. Christy, G. Meeragandhi, Combining bitemporal conceptual datamodel with multiway join relations for forecasting. Procedia Comput. Sci. 57, 1104–1114 (2015)
A. Christy, M.G. Gandhi, S. Vaithyasubramanian, Cluster based outlier detection algorithm for healthcare data. Procedia Comput. Sci. 50, 209–215 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Christy, A., Gandhi, G.M. (2020). Feature Selection and Clustering of Documents Using Random Feature Set Generation Technique. In: Borah, S., Emilia Balas, V., Polkowski, Z. (eds) Advances in Data Science and Management. Lecture Notes on Data Engineering and Communications Technologies, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-15-0978-0_6
Download citation
DOI: https://doi.org/10.1007/978-981-15-0978-0_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0977-3
Online ISBN: 978-981-15-0978-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)