Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution

  • Filippo Geraci
  • Marco Pellegrini
  • Marco Maggini
  • Fabrizio Sebastiani
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)


This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.


User Evaluation External Knowledge Document Cluster Cluster Labelling Jaccard Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Geraci, F., Pellegrini, M., Pisati, P., Sebastiani, F.: A scalable algorithm for high-quality clustering of Web snippets. In: Proceedings of SAC-06, 21st ACM Symposium on Applied Computing, Dijon, FR, pp. 1058–1062 (2006)Google Scholar
  2. 2.
    Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley & Sons, New York (1991)MATHCrossRefGoogle Scholar
  3. 3.
    Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Special Interest Tracks and Poster Proceedings of WWW 2005, 14th International Conference on the World Wide Web, Chiba, JP, pp. 801–810 (2005)Google Scholar
  4. 4.
    Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for Web searches. In: Proceedings of SIGIR 2003, 26th ACM International Conference on Research and Development in Information Retrieval, pp. 457–458 (2003)Google Scholar
  5. 5.
    Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of WWW 2004, 13th International Conference on the World Wide Web, New York, pp. 658–665 (2004)Google Scholar
  6. 6.
    Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of Web documents. In: Proceedings of KDD 1997, 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, US, pp. 287–290 (1997)Google Scholar
  7. 7.
    Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2/3), 293–306 (1985)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Geraci, F., Pellegrini, M., Sebastiani, F., Maggini, M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution. Technical Report IIT TR-1/2006, Institute for Informatics and Telematics of CNR (2006)Google Scholar
  9. 9.
    Kural, Y., Robertson, S., Jones, S.: Clustering information retrieval search outputs. In: Proceedings of the 21st BCS IRSG Colloquium on Information Retrieval, Glasgow, UK (1999)Google Scholar
  10. 10.
    Kural, Y., Robertson, S., Jones, S.: Deciphering cluster representations. Information Processing and Management 37, 593–601 (1993)CrossRefGoogle Scholar
  11. 11.
    Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)MATHCrossRefGoogle Scholar
  12. 12.
    Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 46–54 (1998)Google Scholar
  13. 13.
    Cheng, D., Kannan, R., Vempala, S., Wang, G.: On a recursive spectral algorithm for clustering from pairwise similarities. Technical Report MIT-LCS-TR-906, Massachusetts Institute of Technology, Cambridge, US (2003)Google Scholar
  14. 14.
    Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., et al. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for Web applications. Technical Report RJ 10186, IBM, San Jose (2000)Google Scholar
  16. 16.
    Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster Web search results. In: Proceedings of SIGIR-04, 27th ACM International Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 210–217 (2004)Google Scholar
  17. 17.
    Osinski, S., Weiss, D.: Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data. In: Proceedings of IIPWM 2004, 5th Conference on Intelligent Information Processing and Web Mining, Zakopane, PL, pp. 369–377 (2004)Google Scholar
  18. 18.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  19. 19.
    Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 318–329 (1992)Google Scholar
  20. 20.
    Hochbaum, D.S., Shmoys, D.B.: A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research 10(2), 180–184 (1985)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Indyk, P.: Sublinear time algorithms for metric space problems. In: Proceedings of STOC 1999, ACM Symposium on Theory of Computing, pp. 428–434 (1999)Google Scholar
  22. 22.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 2002, 34th Annual ACM Symposium on the Theory of Computing, Montreal, CA, pp. 380–388 (2002)Google Scholar
  23. 23.
    Strehl, A.: Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas, Austin, US (2002)Google Scholar
  24. 24.
    Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the Web. In: Proceedings of WWW 2002, 11th International Conference on the World Wide Web, Honolulu, US, pp. 432–442 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Filippo Geraci
    • 1
    • 2
  • Marco Pellegrini
    • 1
  • Marco Maggini
    • 2
  • Fabrizio Sebastiani
    • 3
  1. 1.Istituto di Informatica e TelematicaConsiglio Nazionale delle RicerchePisaItaly
  2. 2.Dipartimento di Ingegneria dell’InformazioneUniversità di SienaSienaItaly
  3. 3.Istituto di Scienza e Tecnologia dell’InformazioneConsiglio Nazionale delle RicerchePisaItaly

Personalised recommendations