Website Community Mining from Query Logs with Two-Phase Clustering

Bing, Lidong; Lam, Wai; Jameel, Shoaib; Lu, Chunliang

doi:10.1007/978-3-642-54903-8_17

Website Community Mining from Query Logs with Two-Phase Clustering

Lidong Bing¹⁷,
Wai Lam¹⁷,
Shoaib Jameel¹⁷ &
…
Chunliang Lu¹⁷

Conference paper

1693 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information about several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result in a particular website clicked can be regarded as the summarization of that website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.

The work described in this paper is also supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: CUHK413510).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Beitzel, S.M., Jensen, E.C., Chowdhury, A., Frieder, O., Grossman, D.: Temporal analysis of a very large topically categorized web query log. J. Am. Soc. Inf. Sci. Technol. 58(2), 166–178 (2007)
Article Google Scholar
Beitzel, S.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O.: Hourly analysis of a very large topically categorized web query log. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 321–328 (2004)
Google Scholar
Beitzel, S.M., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kolcz, A.: Improving automatic query classification via semi-supervised learning. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 42–49 (2005)
Google Scholar
Bing, L., Lam, W., Wong, T.L.: Using query log and social tagging to refine queries based on latent topics. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 583–592 (2011)
Google Scholar
Broder, A.Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 231–238 (2007)
Google Scholar
Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)
Article Google Scholar
Gravano, L., Hatzivassiloglou, V., Lichtenstein, R.: Categorizing web queries according to geographical locality. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 325–333 (2003)
Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 179–186 (2008)
Google Scholar
Jansen, B.J., Spink, A.: An analysis of web searching by european alltheweb. com users. Inf. Process. Manage. 41(2), 361–381 (2005)
Article Google Scholar
Jansen, B.J., Spink, A.: How are we searching the world wide web?: a comparison of nine search engine transaction logs. Inf. Process. Manage. 42(1), 248–263 (2006)
Article Google Scholar
Jansen, B.J., Spink, A., Koshman, S.: Web searcher interaction with the dogpile.com metasearch engine. J. Am. Soc. Inf. Sci. Technol. 58(5), 744–755 (2007)
Article Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Koshman, S., Spink, A., Jansen, B.J.: Web searching on the vivisimo search engine. J. Am. Soc. Inf. Sci. Technol. 57(14), 1875–1887 (2006)
Article Google Scholar
Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: Proceedings of the 12th International Conference on World Wide Web, pp. 19–28 (2003)
Google Scholar
Mat-Hassan, M., Levene, M.: Associating search and navigation behavior through log analysis: Research articles. J. Am. Soc. Inf. Sci. Technol. 56(9), 913–934 (2005)
Article Google Scholar
Ni, X., Sun, J.T., Hu, J., Chen, Z.: Cross lingual text classification by mining multilingual topics from wikipedia. In: Proceedings of the fourth ACM International Conference on Web Search and Data Mining, pp. 375–384 (2011)
Google Scholar
Ozmutlu, H.C., Spink, A., Ozmutlu, S.: Analysis of large data logs: an application of poisson sampling on excite web queries. Inf. Process. Manage. 38(4), 473–490 (2002)
Article MATH Google Scholar
Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1998)
Google Scholar
Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: InfoScale 2006 (2006)
Google Scholar
Ponzetto, S.P., Navigli, R.: Large-scale taxonomy mapping for restructuring and integrating wikipedia. In: Proceedings of the 21st International Jont Conference on Artifical Intelligence, pp. 2083–2088 (2009)
Google Scholar
Radlinski, F., Joachims, T.: Query chains: learning to rank from implicit feedback. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 239–248 (2005)
Google Scholar
Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)
Article Google Scholar
Spink, A., Ozmutlu, H.C., Lorence, D.P.: Web searching for sexual information: an exploratory study. Inf. Process. Manage. 40(1), 113–123 (2004)
Article Google Scholar
Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., Scheffer, T.: Classifying search engine queries using the web as background knowledge. SIGKDD Explor. Newsl. 7(2), 117–122 (2005)
Article Google Scholar
Wang, P., Domeniconi, C.: Building semantic kernels for text classification using wikipedia. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 713–721 (2008)
Google Scholar
Wang, X., Zhai, C.: Mining term association patterns from search logs for effective query reformulation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 479–488 (2008)
Google Scholar
Wong, T.L., Bing, L., Lam, W.: Normalizing web product attributes and discovering domain ontology with minimal effort. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 805–814 (2011)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of High Confidence Software Technologies Ministry of Education (CUHK Sub-Lab) Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin N.T., Hong Kong
Lidong Bing, Wai Lam, Shoaib Jameel & Chunliang Lu

Authors

Lidong Bing
View author publications
You can also search for this author in PubMed Google Scholar
Wai Lam
View author publications
You can also search for this author in PubMed Google Scholar
Shoaib Jameel
View author publications
You can also search for this author in PubMed Google Scholar
Chunliang Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bing, L., Lam, W., Jameel, S., Lu, C. (2014). Website Community Mining from Query Logs with Two-Phase Clustering. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics