Skip to main content

A Review on Web Pages Clustering Techniques

  • Conference paper
Trends in Network and Communications (WeST 2011, NeCoM 2011, WiMoN 2011)

Abstract

World Wide Web (WWW) has become largest source of information. This abundance of information with dynamic and heterogeneous nature of the web makes information retrieval a difficult process for the average user. A technique is required that can help the users to organize, summarize and browse the available information from web with the goal of satisfying their information need effectively. Clustering process organizes the collection of objects into related groups. Web page clustering is the key concept for getting desired information quickly from the massive storage of web pages on WWW. Many researchers have proposed various web document clustering techniques. In this paper, we present detail survey on existing web document clustering techniques along with document representation techniques. We have also described some evaluation measures to evaluate the cluster qualities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Leuski, A.: Evaluating Document Clustering for Interactive. In: 10th International Conference on Information and Knowledge Management, New York, pp. 33–40 (2001)

    Google Scholar 

  2. Grouper, E.O., Zamir, O.: Web Document Clustering: A Feasibility Demonstration. In: 19th International ACM SIGIR Conference on Research and Development of Information Retrieval, pp. 46–54 (1998)

    Google Scholar 

  3. Govardhan, A., Suresh, K., Vasumathi, D.: Effective Web Personalization Using Clustering. In: International Conference Intelligent Agent & Multi-Agent Systems, pp. 1–7 (2009)

    Google Scholar 

  4. Yang, Q., Zhang, H., Xu, X., Hu, Y.-H., Ma, S., Su, Z.: Correlation-Based Web Document Clustering for Adaptive Web Interface Design. Knowledge and Information Systems, 151–167 (2002)

    Google Scholar 

  5. Nan-Feng, X., Qion, C., Han, W.: Web Snippets Clustering Based on an Improved Suffix Tree Algorithm. In: Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 1, pp. 542–547 (2009)

    Google Scholar 

  6. Fresno, V., Martinez, R., Garcia-Plaza, A.P.: Web Page Clustering Using a Fuzzy Logic Based Representation and Self-Organizing Maps. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 851–854 (2008)

    Google Scholar 

  7. Buckley, C., Salton, G.: Term Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage 24(5), 513–523 (1988)

    Article  Google Scholar 

  8. Keselj, V., Milios, E., Miao, Y.: Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering. In: 14th ACM International Conference on Information and Knowledge Management, New York, USA, pp. 357–358 (2005)

    Google Scholar 

  9. Grouper, E.O., Zamir, O.: A Dynamic Clustering Interface to Web Search Results. In: Eighth International World Wide Web Conference, pp. 283–296 (1999)

    Google Scholar 

  10. Cavnar, W.B.: Using an n-gram-based document representation with a vector processing retrieval model. In: Third Text Retrieval Conference (TREC-3), pp. 269–278 (1994)

    Google Scholar 

  11. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, New Jersey (1998)

    MATH  Google Scholar 

  12. Ismail, S.Z., Selim, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transaction on Pattern Analysis and Machine Intelligence 6(1), 81–87 (1984)

    MATH  Google Scholar 

  13. Pathakota, S.R., Srinivasa, T.M., Yedla, M.: Enhancing K-means Clustering Algorithm with Improved Initial Center. International Journal of Computer Science and Information Technologies 1(2), 121–125 (2010)

    Google Scholar 

  14. Abdul Nazeer, K.A., Sebastian, M.P.: Improving the accuracy and efficiency of the k-means clustering algorithm. In: Data Mining and Knowledge Engineering (ICDMKE), London, UK, pp. 308–312 (2009)

    Google Scholar 

  15. Barakbah, A.R., Arai, K.: Hierarchical K-means- an algorithm for centroids initialization for K-means. Technical Reports, Faculty of Science and Engineering 36(1), 25–31 (2007)

    Google Scholar 

  16. Duraiswamy, K., Mumtaz, K.: A Novel Density based improved k-means. (IJCSE) International Journal on Computer Science and Engineering 2(2), 213–218 (2010)

    Google Scholar 

  17. Wang, K., Ester, M., Fung, B.C.M.: Hierarchical document clustering using frequent itemsets. In: SIAM International Conference on Data Mining, San Francisco, CA, United States, pp. 59–70 (2003)

    Google Scholar 

  18. Kender, H.H., Malik, J.R.: High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets. In: Sixth International Conference on Data Mining, Washington, DC, USA, pp. 991–996 (2006)

    Google Scholar 

  19. He, Q., Chen, Z., Ma, W., Ma, J., Zeng, H.: Learning to Cluster Web Search Results. In: 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 210–217. Sheffield, South Yorkshire (2004)

    Google Scholar 

  20. Mo, Y., Huang, B., Wen, J., He, L., Wang, J.: Web Search Results Clustering Based on a Novel Suffix Tree Structure. In: Rong, C., Jaatun, M.G., Sandnes, F.E., Yang, L.T., Ma, J. (eds.) ATC 2008. LNCS, vol. 5060, pp. 540–554. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  21. Masoud, A., Setayeshi, S., Hossaini, Z.: A web page classification and clustering by means of Genetic Algorithm- a variable size page representation approach. In: Computational Intelligence for Modelling, pp. 436–440 (2008)

    Google Scholar 

  22. Wei, J.-X., Huai, L., Yue-hong, S., Xin-Ning, S.: Application of Genetic Algorithm in Document Clustering. In: Information Technology and Computer Science, pp. 145–148 (2009)

    Google Scholar 

  23. Zhengyu, Z., Ping, H., Chunlei, Y., Li, L.: A dynamic genetic algorithm for clustering web pages. In: 2nd International Conference on Software Engineering and Data Mining, pp. 506–511 (2010)

    Google Scholar 

  24. Zhenkui, P., Xia, H., Jinfeng, H.: The Clustering Algorithm Based on Particle Swarm Optimization Algorithm. In: International Conference on Intelligent Computation Technology and Automation, Washington, DC, USA, pp. 148–151 (2008)

    Google Scholar 

  25. Chen, F., Ye, C.-Y.: Alternative KPSO-Clustering Algorithm. Tamkang Journal of Science and Engineering 8(2), 165–174 (2005)

    Google Scholar 

  26. Natarajan, A.M., Premalatha, K.: Procreant PSO for fastening the convergence to optimal solution in the application of document clustering. Current Science 96(1), 137–143 (2009)

    Google Scholar 

  27. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  28. Hossaini, Z., Setayeshi, S., Rahmani, A.M.: Link Processing for Fuzzy Web Pages Clustering and Classification. European Journal of Scientific Research 27(4), 620–627 (2009)

    Google Scholar 

  29. Zha, H., Ding, C.H.Q., Simon, H.D., He, X.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41(1), 19–45 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  30. Zilberstein, S., Allan, J., Bekkerman, R.: Web Page Clustering using Heuristic Search in the Web Graph. In: 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, pp. 2280–2285 (2007)

    Google Scholar 

  31. Chau, P.Y.K., Hu, P., Chau, M.: Incorporating Hyperlink Analysis in Web Page Clustering. In: Sixth Workshop on E-Business, Montreal, Quebec, Canada, pp. 102–109 (2007)

    Google Scholar 

  32. Lin, C., Yu, Y., Han, J., Liu, B.: Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 222–229. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patel, D., Zaveri, M. (2011). A Review on Web Pages Clustering Techniques. In: Wyld, D.C., Wozniak, M., Chaki, N., Meghanathan, N., Nagamalai, D. (eds) Trends in Network and Communications. WeST NeCoM WiMoN 2011 2011 2011. Communications in Computer and Information Science, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22543-7_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22543-7_72

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22542-0

  • Online ISBN: 978-3-642-22543-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics