Skip to main content

Website Community Mining from Query Logs with Two-Phase Clustering

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information about several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result in a particular website clicked can be regarded as the summarization of that website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.

The work described in this paper is also supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: CUHK413510).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Beitzel, S.M., Jensen, E.C., Chowdhury, A., Frieder, O., Grossman, D.: Temporal analysis of a very large topically categorized web query log. J. Am. Soc. Inf. Sci. Technol. 58(2), 166–178 (2007)

    Article  Google Scholar 

  2. Beitzel, S.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O.: Hourly analysis of a very large topically categorized web query log. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 321–328 (2004)

    Google Scholar 

  3. Beitzel, S.M., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kolcz, A.: Improving automatic query classification via semi-supervised learning. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 42–49 (2005)

    Google Scholar 

  4. Bing, L., Lam, W., Wong, T.L.: Using query log and social tagging to refine queries based on latent topics. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 583–592 (2011)

    Google Scholar 

  5. Broder, A.Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 231–238 (2007)

    Google Scholar 

  6. Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)

    Article  Google Scholar 

  7. Gravano, L., Hatzivassiloglou, V., Lichtenstein, R.: Categorizing web queries according to geographical locality. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 325–333 (2003)

    Google Scholar 

  8. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 179–186 (2008)

    Google Scholar 

  9. Jansen, B.J., Spink, A.: An analysis of web searching by european alltheweb. com users. Inf. Process. Manage. 41(2), 361–381 (2005)

    Article  Google Scholar 

  10. Jansen, B.J., Spink, A.: How are we searching the world wide web?: a comparison of nine search engine transaction logs. Inf. Process. Manage. 42(1), 248–263 (2006)

    Article  Google Scholar 

  11. Jansen, B.J., Spink, A., Koshman, S.: Web searcher interaction with the dogpile.com metasearch engine. J. Am. Soc. Inf. Sci. Technol. 58(5), 744–755 (2007)

    Article  Google Scholar 

  12. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)

    Google Scholar 

  13. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  14. Koshman, S., Spink, A., Jansen, B.J.: Web searching on the vivisimo search engine. J. Am. Soc. Inf. Sci. Technol. 57(14), 1875–1887 (2006)

    Article  Google Scholar 

  15. Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: Proceedings of the 12th International Conference on World Wide Web, pp. 19–28 (2003)

    Google Scholar 

  16. Mat-Hassan, M., Levene, M.: Associating search and navigation behavior through log analysis: Research articles. J. Am. Soc. Inf. Sci. Technol. 56(9), 913–934 (2005)

    Article  Google Scholar 

  17. Ni, X., Sun, J.T., Hu, J., Chen, Z.: Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM International Conference on Web Search and Data Mining, pp. 375–384 (2011)

    Google Scholar 

  18. Ozmutlu, H.C., Spink, A., Ozmutlu, S.: Analysis of large data logs: an application of poisson sampling on excite web queries. Inf. Process. Manage. 38(4), 473–490 (2002)

    Article  MATH  Google Scholar 

  19. Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)

    Article  Google Scholar 

  20. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1998)

    Google Scholar 

  21. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: InfoScale 2006 (2006)

    Google Scholar 

  22. Ponzetto, S.P., Navigli, R.: Large-scale taxonomy mapping for restructuring and integrating wikipedia. In: Proceedings of the 21st International Jont Conference on Artifical Intelligence, pp. 2083–2088 (2009)

    Google Scholar 

  23. Radlinski, F., Joachims, T.: Query chains: learning to rank from implicit feedback. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 239–248 (2005)

    Google Scholar 

  24. Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)

    Article  Google Scholar 

  25. Spink, A., Ozmutlu, H.C., Lorence, D.P.: Web searching for sexual information: an exploratory study. Inf. Process. Manage. 40(1), 113–123 (2004)

    Article  Google Scholar 

  26. Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., Scheffer, T.: Classifying search engine queries using the web as background knowledge. SIGKDD Explor. Newsl. 7(2), 117–122 (2005)

    Article  Google Scholar 

  27. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 713–721 (2008)

    Google Scholar 

  28. Wang, X., Zhai, C.: Mining term association patterns from search logs for effective query reformulation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 479–488 (2008)

    Google Scholar 

  29. Wong, T.L., Bing, L., Lam, W.: Normalizing web product attributes and discovering domain ontology with minimal effort. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 805–814 (2011)

    Google Scholar 

  30. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bing, L., Lam, W., Jameel, S., Lu, C. (2014). Website Community Mining from Query Logs with Two-Phase Clustering. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54903-8_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54902-1

  • Online ISBN: 978-3-642-54903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics