Skip to main content
Log in

Time and Space Efficient Web Document Clustering Using Rayleigh Distribution

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Web document clustering identifies the relevant and useful information like comparing shopping service provider from flipkart.com, information retrieval from web search engines and so on. Choosing the best representation and enhancing knowledge discovery for a given task in very large textual data stores is the most critical step in web document clustering. In this work, considering the problem of discovering the most predominant word with similar semantic model and measuring relative strength of predominant word of web document. This paper presents an efficient technique called Rayleigh Clustering with Self Organizing Map (RC-SOM) for web document domain using generation of self organizing patterns, clustering of predominant word and Rayleigh distribution. Self organizing patterns are generated to identify the most predominant word from web document. Then clustering of predominant word with similar semantic are organized for all the web documents. Finally, the efficiency of web document clustering is improved by applying Rayleigh distribution that lists out the relative strength of predominant word for each web document. The experimental is presented for RC-SOM technique on Anonymous Microsoft Web Data dataset and performs evolution factor such as cluster accuracy, execution time and computation space for cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Liu, K., Liheng, X., & Zhao, J. (2015). Co-extracting opinion targets and opinion words from online reviews based on the word alignment model. IEEE Transactions on Knowledge and Data Engineering, 27(3), 636–650.

    Article  Google Scholar 

  2. Skabar, A., & Abdalgader, K. (2013). Clustering sentence-level text using a novel fuzzy relational clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 25(1), 62–75.

    Article  Google Scholar 

  3. Tao, X., Li, Y., & Zhong, N. (2011). A personalized ontology model for web information gathering. IEEE Transactions on Knowledge and Data Engineering, 23(4), 496–511.

    Article  Google Scholar 

  4. Habibi, M., & Popescu-Belis, A. (2015). Keyword extraction and clustering for document recommenda in conversations. IEEE Transactions on Audio, Speech and Language Processing, 23(4), 746–759.

    Article  Google Scholar 

  5. Kim, C., & Shim, K. (2011). TEXT: Automatic template extraction from heterogeneous web pages. IEEE Transactions on Knowledge and Data Engineering, 23(4), 612–626.

    Article  Google Scholar 

  6. Yang, C., Cao, Y., Nie, Z., Zhou, J., & Wen, J.-R. (2010). Closing the loop in webpage understanding. IEEE Transactions on Knowledge and Data Engineering, 22(5), 639–650.

    Article  Google Scholar 

  7. Wong, T.-L., & Lam, W. (2010). Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Transactions on Knowledge and Data Engineering, 22(4), 523–536.

    Article  Google Scholar 

  8. Minku, L. L., White, A. P., & Yao, X. (2010). The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Transactions on Knowledge and Data Engineering, 22(5), 730–742.

    Article  Google Scholar 

  9. Shirani-Mehr, H., Li, C., Liang, G., Shmueli-Scheuer, M. (2008). Quality-aware retrieval of data objects from autonomous sources for web-based repositories. In Data engineering, IEEE 24th international conference on 2008. ICDE 2008 (pp. 1492–1494).

  10. Chu, Y.-H., Huang, J.-W., Chuang, K.-T., Yang, D.-N., & Chen, M.-S. (2010). Density conscious subspace clustering for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 22(1), 16–30.

    Article  Google Scholar 

  11. Hwang, M., Choi, C., & Kim, P. (2011). Automatic enrichment of semantic relation network and its application to word sense disambiguation. IEEE Transactions on Knowledge and Data Engineering, 23(6), 845–858.

    Article  Google Scholar 

  12. Nguyen, T. T. S., Lu, H. Y., & Lu, J. (2010). Web-page recommendation based on web usage and domain knowledge. IEEE Transactions on Knowledge and Data Engineering, 26(10), 2574–2587.

    Article  Google Scholar 

  13. Yu, G., Gao, C., Cong, G., & Ge, Yu. (2014). Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Transactions on Knowledge and Data Engineering, 26(5), 1117–1130.

    Article  Google Scholar 

  14. Nguyen, D. T., Chen, L., & Chan, C. K. (2012). Clustering with multiviewpoint-based similarity measure. IEEE Transactions on Knowledge and Data Engineering, 26(6), 988–1001.

    Article  Google Scholar 

  15. Hassan, M. T., Karim, A., Kim, J.-B., & Jeon, M. (2015). CDIM: Document clustering by discrimination information maximization. Elsevier, Information Sciences, 316, 87–106.

    Article  Google Scholar 

  16. Guan, R., Shi, X., Marchese, M., Yang, C., & Liang, Y. (2011). Text clustering with seeds affinity propagation. IEEE Transactions on Knowledge and Data Engineering, 23(4), 627–637.

    Article  Google Scholar 

  17. Cai, D., He, X., & Han, J. (2011). Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6), 902–913.

    Article  Google Scholar 

  18. Lecue, F., & Mehandjiev, N. (2011). Seeking quality of web service composition in a semantic dimension. IEEE Transactions on Knowledge and Data Engineering, 23(6), 942–959.

    Article  Google Scholar 

  19. Li, Z., Lee, K. C. K., Zheng, B., Lee, W.-C., Lee, D. L., & Wang, X. (2011). IR-Tree: An efficient index for geographic document search. IEEE Transactions on Knowledge and Data Engineering, 23(4), 585–599.

    Article  Google Scholar 

  20. Yorek, N., Ugulu, I., & Aydin, H. (2015). Using self-organizing neural network map combined with ward’s clustering algorithm for visualization of students’ cognitive structural models about aliveness concept. Hindawi Publishing Corporation, Computational Intelligence and Neuroscience, 2015, 1–15.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Srikanth.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srikanth, D., Sakthivel, S. Time and Space Efficient Web Document Clustering Using Rayleigh Distribution. Wireless Pers Commun 102, 3255–3268 (2018). https://doi.org/10.1007/s11277-018-5366-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-018-5366-5

Keywords

Navigation