Skip to main content

An Overview of Web Data Clustering Practices

  • Conference paper
Current Trends in Database Technology - EDBT 2004 Workshops (EDBT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

Abstract

Clustering is a challenging topic in the area of Web data management. Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way. Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries). Important efforts have focused on mining Web access logs and to cluster search engine results on the fly. Online methods based on link structure and text have been applied successfully to finding pages on related topics. This paper presents an overview of the most popular methodologies and implementations in terms of clustering either Web users or Web sources and presents a survey about current status and future trends in clustering employed over the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  2. Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web. Wiley, Chichester (2003)

    Google Scholar 

  3. Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common Subsequences. In: Proceedings of Workshop on Web Mining, SIAM Conference on Data Mining, Chicago, USA, pp. 33–40 (April 2001)

    Google Scholar 

  4. Cadez, I.V., Heckerman, D., Meek, C., Smyth, P., White, S.: Model-based clustering and visualization of navigation patterns on a Web site. Data Mining and Knowledge Discovery 7(4), 399–424 (2003)

    Article  MathSciNet  Google Scholar 

  5. Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)

    Google Scholar 

  6. Chen, Z., Wai-Chee Fu, A., Chi-Hung Tong, F.: Optimal algorithms for finding user access sessions from very large Web logs. World Wide Web: Internet and Information Systems 6, 259–279 (2003)

    Google Scholar 

  7. Cobena, G., Abdessalem, T., Hinnach, Y.: A comparative study for XML change detection. Technical Report, INRIA, France (2000)

    Google Scholar 

  8. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining World Wide Web browsing patterns. Knowledge Information Systems 1, 5–32 (1999)

    Google Scholar 

  9. Cui, H., Wen, J.-R.: Hierarchical indexing and flexible element retrieval for structured document. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 73–87. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Statistics Society B 39, 1–22 (1997)

    MathSciNet  Google Scholar 

  11. Eiron, N., McCurley, K.S.: Untangling compound documents on the Web. In: Proceedings of ACM Hypertext, pp. 85–94 (2003)

    Google Scholar 

  12. Flake, G.W., Lawrence, S., Lee Giles, C., Coetzee, F.: Self-organization and identification of Web Communities. IEEE Computer 35(3) (2002)

    Google Scholar 

  13. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proceedings of WebDB Workshop (2002)

    Google Scholar 

  14. Fraley, C., Raftery, A.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41 (1998)

    Google Scholar 

  15. Fuhr, N., Groβjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of ACM SIGIR (2001)

    Google Scholar 

  16. Fu, Y., Sandhu, K., Shih, M.-Y.: Clustering of Web users based on access patterns. In: Proceedings of WEBKDD (1999)

    Google Scholar 

  17. Grabs, T., Org Schek, H.-J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of XML and IR Workshop (2002)

    Google Scholar 

  18. Greco, G., Greco, S., Zumpano, E.: Web communities: models and algorithms. World Wide Web 7(1), 58–82 (2004)

    Article  Google Scholar 

  19. Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. ACM SIGMOD Record 25(5) (2000)

    Google Scholar 

  20. Hay, B., Vanhoof, K., Wetsr, G.: Clustering navigation patterns on a Website using a sequence alignment method. In: Proceedings of 17th International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (August 2001)

    Google Scholar 

  21. Kleinberg, J.M.: Authoritative sources in a hyper-linked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm (1998)

    Google Scholar 

  22. Mass, Y., Mandelbrod, M., Amitay, E., Maarek, Y., Soffer, A.: Juru XML - an XML retrieval system at INEX 2002. In: Proceedings of INEX, Dagstuhl, Germany (December 2002)

    Google Scholar 

  23. Myaeng, S.H., Jang, D.-H.: A flexible model for retrieval of SGML documents. In: Proceedings of ACM SIGIR (1998)

    Google Scholar 

  24. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)

    Google Scholar 

  25. Kothari, R., Mittal, P.A., Jain, V., Mohania, M.K.: On using page cooccurrences for computing clickstream similarity. In: Proceedings of the 3rd SIAM International Conference on Data Mining, San Francisco, USA (May 2003)

    Google Scholar 

  26. Sankoff, D., Kruskal, J.: Time warps, string edits and macromolecules, the theory and practice of sequence comparison. CSLI Publications, Stanford (1999)

    Google Scholar 

  27. Sarukkai, R.R.: Link prediction and path analysis using Markov chains. Computer Networks 33, 377–386 (2000)

    Article  Google Scholar 

  28. Su, Z., Yang, Q., Zhang, H.H., Xu, X., Hu, Y.: Correlation-based document clustering using Web logs. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), Maui, Hawaii (January 2001)

    Google Scholar 

  29. Tajima, K., Hatano, K., Matsukura, T., Sano, R., Tanaka, K.: Discovery and retrieval of logical information units in Web. In: Proceedings of the Workshop on Organizing Web Space (WOWS 1999), Berkeley, USA, pp. 13–23 (August 1999)

    Google Scholar 

  30. Theobald, A., Weikum, G.: The Index-Based XXL Search engine for querying XML data with relevance ranking. In: Jensen, C.S., Jeffery, K., Pokorný, J., Šaltenis, S., Bertino, E., Böhm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, p. 477. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  31. Baeza-Yates, R., Ribiero-Neto, B.: Modern information retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  32. Baeza-Yates, R., Navarro, G.: Integrating contents and structure in text retrieval. ACM SIGMOD Record 25(1) (1996)

    Google Scholar 

  33. Yoon, J., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A three-dimensional bitmap indexing for XML documents. Journal of Intelligent Information Systems 17 (2001)

    Google Scholar 

  34. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the International Conference Management of Data (ACM-SIGMOD), Montreal, Canada, pp. 103–114 (June 1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vakali, A., Pokorný, J., Dalamagas, T. (2004). An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30192-9_59

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23305-3

  • Online ISBN: 978-3-540-30192-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics