Abstract
Web document clustering identifies the relevant and useful information like comparing shopping service provider from flipkart.com, information retrieval from web search engines and so on. Choosing the best representation and enhancing knowledge discovery for a given task in very large textual data stores is the most critical step in web document clustering. In this work, considering the problem of discovering the most predominant word with similar semantic model and measuring relative strength of predominant word of web document. This paper presents an efficient technique called Rayleigh Clustering with Self Organizing Map (RC-SOM) for web document domain using generation of self organizing patterns, clustering of predominant word and Rayleigh distribution. Self organizing patterns are generated to identify the most predominant word from web document. Then clustering of predominant word with similar semantic are organized for all the web documents. Finally, the efficiency of web document clustering is improved by applying Rayleigh distribution that lists out the relative strength of predominant word for each web document. The experimental is presented for RC-SOM technique on Anonymous Microsoft Web Data dataset and performs evolution factor such as cluster accuracy, execution time and computation space for cluster.
Similar content being viewed by others
References
Liu, K., Liheng, X., & Zhao, J. (2015). Co-extracting opinion targets and opinion words from online reviews based on the word alignment model. IEEE Transactions on Knowledge and Data Engineering, 27(3), 636–650.
Skabar, A., & Abdalgader, K. (2013). Clustering sentence-level text using a novel fuzzy relational clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 25(1), 62–75.
Tao, X., Li, Y., & Zhong, N. (2011). A personalized ontology model for web information gathering. IEEE Transactions on Knowledge and Data Engineering, 23(4), 496–511.
Habibi, M., & Popescu-Belis, A. (2015). Keyword extraction and clustering for document recommenda in conversations. IEEE Transactions on Audio, Speech and Language Processing, 23(4), 746–759.
Kim, C., & Shim, K. (2011). TEXT: Automatic template extraction from heterogeneous web pages. IEEE Transactions on Knowledge and Data Engineering, 23(4), 612–626.
Yang, C., Cao, Y., Nie, Z., Zhou, J., & Wen, J.-R. (2010). Closing the loop in webpage understanding. IEEE Transactions on Knowledge and Data Engineering, 22(5), 639–650.
Wong, T.-L., & Lam, W. (2010). Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Transactions on Knowledge and Data Engineering, 22(4), 523–536.
Minku, L. L., White, A. P., & Yao, X. (2010). The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Transactions on Knowledge and Data Engineering, 22(5), 730–742.
Shirani-Mehr, H., Li, C., Liang, G., Shmueli-Scheuer, M. (2008). Quality-aware retrieval of data objects from autonomous sources for web-based repositories. In Data engineering, IEEE 24th international conference on 2008. ICDE 2008 (pp. 1492–1494).
Chu, Y.-H., Huang, J.-W., Chuang, K.-T., Yang, D.-N., & Chen, M.-S. (2010). Density conscious subspace clustering for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 22(1), 16–30.
Hwang, M., Choi, C., & Kim, P. (2011). Automatic enrichment of semantic relation network and its application to word sense disambiguation. IEEE Transactions on Knowledge and Data Engineering, 23(6), 845–858.
Nguyen, T. T. S., Lu, H. Y., & Lu, J. (2010). Web-page recommendation based on web usage and domain knowledge. IEEE Transactions on Knowledge and Data Engineering, 26(10), 2574–2587.
Yu, G., Gao, C., Cong, G., & Ge, Yu. (2014). Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Transactions on Knowledge and Data Engineering, 26(5), 1117–1130.
Nguyen, D. T., Chen, L., & Chan, C. K. (2012). Clustering with multiviewpoint-based similarity measure. IEEE Transactions on Knowledge and Data Engineering, 26(6), 988–1001.
Hassan, M. T., Karim, A., Kim, J.-B., & Jeon, M. (2015). CDIM: Document clustering by discrimination information maximization. Elsevier, Information Sciences, 316, 87–106.
Guan, R., Shi, X., Marchese, M., Yang, C., & Liang, Y. (2011). Text clustering with seeds affinity propagation. IEEE Transactions on Knowledge and Data Engineering, 23(4), 627–637.
Cai, D., He, X., & Han, J. (2011). Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6), 902–913.
Lecue, F., & Mehandjiev, N. (2011). Seeking quality of web service composition in a semantic dimension. IEEE Transactions on Knowledge and Data Engineering, 23(6), 942–959.
Li, Z., Lee, K. C. K., Zheng, B., Lee, W.-C., Lee, D. L., & Wang, X. (2011). IR-Tree: An efficient index for geographic document search. IEEE Transactions on Knowledge and Data Engineering, 23(4), 585–599.
Yorek, N., Ugulu, I., & Aydin, H. (2015). Using self-organizing neural network map combined with ward’s clustering algorithm for visualization of students’ cognitive structural models about aliveness concept. Hindawi Publishing Corporation, Computational Intelligence and Neuroscience, 2015, 1–15.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Srikanth, D., Sakthivel, S. Time and Space Efficient Web Document Clustering Using Rayleigh Distribution. Wireless Pers Commun 102, 3255–3268 (2018). https://doi.org/10.1007/s11277-018-5366-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-018-5366-5