Clustering Retrieved Web Documents to Speed Up Web Searches

  • Rani Qumsiyeh
  • Yiu-Kai NgEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9155)


Current web search engines, such as Google, Bing, and Yahoo!, rank the set of documents S retrieved in response to a user query and display the URL of each document D in S with a title and a snippet, which serves as an abstract of D. Snippets, however, are not as useful as they are designed for, which is supposed to assist its users to quickly identify results of interest, if they exist. These snippets fail to (i) provide distinct information and (ii) capture the main contents of the corresponding documents. Moreover, when the intended information need specified in a search query is ambiguous, it is very difficult, if not impossible, for a search engine to identify precisely the set of documents that satisfy the user’s intended request without requiring additional inputs. Furthermore, a document title is not always a good indicator of the content of the corresponding document. All of these design problems can be solved by our proposed query-based cluster and labeler, called QClus. QClus generates concise clusters of documents covering various subject areas retrieved in response to a user query, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one. Experimental results show that QClus is effective and efficient in generating high-quality clusters of documents on specific topics with informative labels.


Clustering Cluster labels User queries Web documents 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Braschler, M., Schäuble, P.: Multilingual information retrieval based on document alignment techniques. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 183–197. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  2. 2.
    Chen, L.: Using a New Relational Concept to Improve the Clustering Performance of Search Engines. IPM 47, 287–299 (2011)Google Scholar
  3. 3.
    Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: WWW, pp. 121–130 (2008)Google Scholar
  4. 4.
    Dunlavy, D., O’Leary, D., Conroy, J., Schlesinger, J.: QCS: A System for Querying, Clustering, and Summarizing Documents. IPM 43, 1588–1605 (2007)Google Scholar
  5. 5.
    Ferragina, P., Guli, A.: A Personalized Search Engine Based on Web-snippet Hierarchical Clustering. Software-Practice & Experience 38(2), 189–225 (2008)CrossRefGoogle Scholar
  6. 6.
    Geraci, F., Pellegrini, M., Pisati, P., Sebastiani, F.: A scalable algorithm for high quality clustering of web snippets. In: ACM SAC, pp. 1058–1062 (2006)Google Scholar
  7. 7.
    Giacomo, E., Didimo, W., Grilli, L., Liotta, G.: Graph Visualization Techniques for Web Clustering Engines. TVCG 13(2), 294–304 (2007)Google Scholar
  8. 8.
    Hearst, M., Pedersen, J.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: ACM SIGIR, pp. 76–84 (1996)Google Scholar
  9. 9.
    Jansen, B., Spink, A., Saracevic, T.: Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. IPM 36(2), 207–227 (2000)Google Scholar
  10. 10.
    Jones, B., Kenward, M.: Design and Analysis of Cross-Over Trials, 2nd edn. Chapman and Hall (2003)Google Scholar
  11. 11.
    Kang, H., Kim, G.: Query type classification for web document retrieval. In: ACM SIGIR, pp. 64–71 (2003)Google Scholar
  12. 12.
    Kazmier, L.: Schaum’s Outline of Business Statistics. McGraw-Hill (2003)Google Scholar
  13. 13.
    Li, H., Sun, C., Wang, K.: Clustering web search results using conceptual grouping. In: ICMLC, pp. 1499–1503 (2009)Google Scholar
  14. 14.
    Luger, G.: Artificial Intelligence: Structures and Strategies for Complex Problem Solving. Addison-Wesley (2008)Google Scholar
  15. 15.
    Rozakis, L.: Test Taking Strategies and Study Skills for the Utterly Confused. McGraw Hill (2002)Google Scholar
  16. 16.
    Selberg, E.: Towards Comprehensive Web Search. PhD thesis, University of Washington (1999)Google Scholar
  17. 17.
    Shekhar, S., Agrawal, R.: An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections. In: ACE, pp. 29–30 (2010)Google Scholar
  18. 18.
    Shen, D., Pan, R.: Query Enrichment for Web-Query Classification. ACM TOIS 24(3), 320–352 (2006)CrossRefGoogle Scholar
  19. 19.
    Lin, C.X., Yu, Y., Han, J., Liu, B.: Hierarchical web-page clustering via in-page and cross-page link structures. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 222–229. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  20. 20.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR, pp. 46–54 (1998)Google Scholar
  21. 21.
    Zamir, O., Etzioni, O.: Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks 31(11–16), 1361–1374 (1999)CrossRefGoogle Scholar
  22. 22.
    Zeng, H., He, Q., Chen, Z., Ma, W.: Learning to cluster web search results. In: ACM SIGIR, pp. 210–217 (2004)Google Scholar
  23. 23.
    Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Computer Science DepartmentBrigham Young UniversityProvoUSA

Personalised recommendations