Knowledge and Information Systems

, Volume 36, Issue 3, pp 693–729 | Cite as

Query directed clustering

  • Daniel Crabtree
  • Xiaoying GaoEmail author
  • Peter Andreae
Regular Paper


This paper identifies the conditions under which web page clustering algorithms are effective and identifies the problems that cause them to fail. It then presents Query Directed Clustering (QDC), a web page clustering algorithm that produces higher-quality clusterings than other clustering algorithms for easy ambiguous queries, while performing at least as well as other clustering algorithms on queries for which clustering is not well suited. QDC has the five key innovations: a new cluster quality guide that is based on the relationship between clusters and the query; an improved cluster merging method that considers both cluster overlap and cluster description similarity; a new cluster splitting method that addresses the cluster chaining (drifting) problem; an improved heuristic for selecting good clusters; a new method that improves the clusters by ranking the pages in each cluster. Our experiments evaluate QDC both quantitatively and qualitatively and show that QDC significantly improves clustering performance, while being substantially more efficient than existing approaches.


Web page clustering Data mining Clustering 


  1. 1.
    Allan J (2005), Hard track overview in trec 2005 high accuracy retrieval from documents. In: The 14th Text REtrieval conference (TREC’05)Google Scholar
  2. 2.
    Anastasiu DC, Gao BJ, Buttler D (2011) A framework for personalized and collaborative clustering of search results. In: Proceedings of the 20th ACM international conference on Information and, knowledge management, CIKM’11, pp 573–582Google Scholar
  3. 3.
    Asirvatham AP, Ravi KK (2001) Web page classification based on document structureGoogle Scholar
  4. 4.
    Back J, Oppenheim C (2001) A model of cognitive load for IR: implications for user relevance feedback interaction. Inf Res 6(2).
  5. 5.
    Balachandran V, Deepak P, Khemani D (2012) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst 32(3):475–503Google Scholar
  6. 6.
    Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CAGoogle Scholar
  7. 7.
    Boley D, Gini M, Gross R, Han E-H, Karypis G, Kumar V, Mobasher B, Moore J, Hastings K (1999) Partitioning-based clustering for web document categorization. Decis Support Syst (Special issue on WITS ’97) 27(3):329–341CrossRefGoogle Scholar
  8. 8.
    Carpineto C, Romano G (2010) Optimal meta search results clustering. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in, information retrieval. SIGIR’10, pp 170–177Google Scholar
  9. 9.
    Chakrabarti S (2003) Mining the web—discovering knowledge from hypertext data. Morgan Kaufmann, Los Altos, CAGoogle Scholar
  10. 10.
    Chen H, Dumais S (2000) Bringing order to the web: automatically categorizing search results. In: The SIGCHI conference on human factors in computing systems, pp 145–152Google Scholar
  11. 11.
    Cilibrasi RL, Vitanyi PM (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370– 383CrossRefGoogle Scholar
  12. 12.
    Cilibrasi R, Vitanyi PMB (2004) Automatic meaning discovery using google.
  13. 13.
    Cock MD, Cornelis C (2005) Fuzzy rough set based web query expansion. In: International workshop on rough sets and soft computing in intelligent agent and web technologies, pp 9–16Google Scholar
  14. 14.
    Crabtree D, Andreae P, Gao X (2006) Query directed web page clustering. In: The 2006 IEEE/WIC/ACM international conference on web, intelligence (WI’06), pp 202–210Google Scholar
  15. 15.
    Crabtree D, Andreae P, Gao X (2007) Qc4—a clustering evaluation method. In: The 2007 Pacific-Asia conference on knowledge discovery and data mining (PAKDD’07), pp 59–70Google Scholar
  16. 16.
    Crabtree D, Gao X, Andreae P (2005a) Improving web clustering by cluster selection. In: The 2005 IEEE/WIC/ACM International Conference on Web, Intelligence (WI’05), pp 172–178Google Scholar
  17. 17.
    Crabtree D, Gao X, Andreae P (2005b) Standardized evaluation method for web clustering results. In: The 2005 IEEE/WIC/ACM international conference on web, intelligence (WI’05), pp 280–283Google Scholar
  18. 18.
    Cui H, Wen J-R, Nie J-Y, Ma W-Y (2002) Probabilistic query expansion using query logs. In: The 11th international conference on, world wide web (WWW’02), pp 325–332Google Scholar
  19. 19.
    Farahat A, Kamel M (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28:365–393CrossRefGoogle Scholar
  20. 20.
    Fodeh S, Punch B, Tan P-N (2011) On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28:395–421CrossRefGoogle Scholar
  21. 21.
    Fung BC, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: The SIAM international conference on data miningGoogle Scholar
  22. 22.
    Gelgi F, Davulcu H, Vadrevu S (2007) Term ranking for clustering web search results. In: The 10th international workshop on the web and databases (WebDB’07)Google Scholar
  23. 23.
    Goo (2007)
  24. 24.
    Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145zbMATHCrossRefGoogle Scholar
  25. 25.
    Hawking D, Craswell N (2005) Very large scale retrieval and web search. In: TREC: experiment and evaluation in information retrieval, MIT Press, pp 199–231Google Scholar
  26. 26.
    Hawking D, Craswell N, Bailey P, Griffiths K (2001) Measuring search engine quality. Inf Retr 4(1):33–59zbMATHCrossRefGoogle Scholar
  27. 27.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv CSUR 31(3):264–323CrossRefGoogle Scholar
  28. 28.
    Joachims T, Radlinski F (2007) Search engines that learn from implicit feedback. Computer 40(8):34–40CrossRefGoogle Scholar
  29. 29.
    Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31:455–474CrossRefGoogle Scholar
  30. 30.
    Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In The 13th international conference on, world wide web (WWW’04), pp 658–665Google Scholar
  31. 31.
    Mam (2007)
  32. 32.
    Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  33. 33.
    Menczer F (2004) Lexical and semantic clustering by web links. J Am Soc Inf Sci Technol 55(14):1261–1269CrossRefGoogle Scholar
  34. 34.
    Murayama N, Saito S, Okumura M (2004) Are web pages characterized by color?. In: The 13th international conference on world wide web—alternate track papers and posters (WWW’04), pp 248–249Google Scholar
  35. 35.
    Osiński S, Stefanowski J, Weiss D (2004) Lingo: aearch results clustering algorithm based on singular value decomposition. In: The international IIS: intelligent information processing and web mining conference, advances in soft computing, Springer, pp 359–368Google Scholar
  36. 36.
    Osiński S, Weiss D (2004) Conceptual clustering using lingo algorithm: evaluation on open directory project data. In: The international IIS: intelligent information processing and web mining conference, advances in soft computing, Springer, pp 369–378Google Scholar
  37. 37.
    Osinski S, Weiss D (2005) A concept-driven algorithm for clustering search results. IEEE Intell Syst 20(3):48–54CrossRefGoogle Scholar
  38. 38.
    Ruthven I, Lalmas M (2003) A survey on the use of relevance feedback for information access systems. Knowl Eng Rev 19(2):95–145CrossRefGoogle Scholar
  39. 39.
    Shipeng Y, Cai D, Wen J, Ma W (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: The 12th international world wide web conference, pp 11–18Google Scholar
  40. 40.
    Smyth B (2007) A community-based approach to personalizing web search. Computer 40(8):42–50CrossRefGoogle Scholar
  41. 41.
    Spink A, Koshman S, Park M, Field C, Jansen BJ (2005) Multitasking web search on In: International conference on information technology: coding and computing (ITCC’05). Vol II, pp 486–490Google Scholar
  42. 42.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text miningGoogle Scholar
  43. 43.
    Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining, PhD thesis, Faculty of the Graduate School of The University of Texas at AustinGoogle Scholar
  44. 44.
    Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: The 17th national conference on artificial intelligence: workshop of artificial intelligence for web, search (AAAI’00), pp 58–64Google Scholar
  45. 45.
    Su Z, Yang Q, Zhang H, Xu X, Hu Y (2001) Correlation-based document clustering using web logs. In: The 34th annual Hawaii international conference on system sciences (HICSS-34) 5(34), p 5022Google Scholar
  46. 46.
    van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Buterworths, LondonGoogle Scholar
  47. 47.
    Wang Y, Kitsuregawa M (2002) On combining link and contents information for web page clustering. In: The 13th international conference on database and expert systems applications (DEXA’02), pp 902–913Google Scholar
  48. 48.
    Woon W, Madnick S (2009) Asymmetric information distances for automated taxonomy construction. Knowl Inf Syst 21:91–111CrossRefGoogle Scholar
  49. 49.
    Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’10, pp 763–772Google Scholar
  50. 50.
    Zamir OE (1999) Clustering Web documents: a phrase-based method for grouping search engine results, PhD thesis, University of WashingtonGoogle Scholar
  51. 51.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: The 21st annual international ACM SIGIR conference on research and development, in Information retrieval (SIGIR’98), pp 46–54Google Scholar
  52. 52.
    Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: The 27th annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR’04), pp 210–217Google Scholar
  53. 53.
    Zhang J, Sun L, Lv Y, Zhang W (2005) Relevance feedback by exploring the different feedback source and collection structure. In: The 14th text REtrieval conference (TREC’05)Google Scholar
  54. 54.
    Zhao W, He Q, Ma H, Shi Z (2012) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30:569–587CrossRefGoogle Scholar
  55. 55.
    Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. Department of Computer Science, University of Minnesota, Minneapolis, MN, Technical reportGoogle Scholar
  56. 56.
    Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168MathSciNetCrossRefGoogle Scholar
  57. 57.
    Zu Eissen SM, Stein B, Potthast M (2005) The suffix tree document model revisited. In: The 5th international conference on, knowledge management (I-KNOW’05)Google Scholar

Copyright information

© Springer-Verlag London 2012

Authors and Affiliations

  1. 1.School of Engineering and Computer ScienceVictoria University of WellingtonWellingtonNew Zealand

Personalised recommendations