Semantic Clustering of Website Based on Its Hypertext Structure
Abstract
The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish different clusters of related webpages. Such clusters are used, for example, for knowledge extraction, named entity recognition, and recommendation algorithms. A variety of applications (such as semantic analysis systems, crawlers and search engines) utilizes semantic clustering algorithms to recognize thematically connected webpages. The majority of them relies on text analysis of the web documents content, and this leads to certain limitations, such as long processing time, need of representative text content, or vagueness of natural language. In this article, we present a framework for unsupervised domain and language independent semantic clustering of the website, which utilizes its internal hypertext structure and does not require text analysis. As a basis, we represent the hypertext structure as a graph and apply known flow simulation clustering algorithms to the graph to produce a set of webpage clusters. We assume these clusters contain thematically connected webpages. We evaluate our clustering approach with a corpus of real-world webpages and compare the approach with well-known text document clustering algorithms.
Preview
Unable to display preview. Download preview PDF.
References
- 1.Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM (2003)Google Scholar
- 2.Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007) CrossRefGoogle Scholar
- 3.Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998). Proceedings of the Seventh International World Wide Web Conference. http://www.sciencedirect.com/science/article/pii/S016975529800110X CrossRefGoogle Scholar
- 4.Carlson, A., Betteridge, J., Wang, R.C., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Conference on Artificial Intelligence (AAAI) (2010)Google Scholar
- 5.Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41(3), July 2009. http://doi.acm.org/10.1145/1541880.1541884
- 6.Chakrabarti, D., Mehta, R.: The paths more taken: matching dom trees to search logs for accurate webpage clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 211–220. ACM (2010)Google Scholar
- 7.Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval in practice, chap. 4.5. Addison-Wesley Reading (2010)Google Scholar
- 8.Devika, K., Surendran, S.: An overview of web data extraction techniques. International Journal of Scientific Engineering and Technology 2(4) (2013)Google Scholar
- 9.Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. CoRR abs/1207.0246 (2012)Google Scholar
- 10.Hollink, V., van Someren, M., Wielinga, B.J.: Navigation behavior models for link structure optimization. User Modeling and User-Adapted Interaction 17(4), 339–377 (2007)CrossRefGoogle Scholar
- 11.Kosala, R., Blockeel, H.: Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2(1), 1–15 (2000)CrossRefGoogle Scholar
- 12.Lehmann, J., Völker, J. (eds.): Studies on the Semantic Web, chap. Information Extraction for Ontology Learning. Akademische Verlagsgesellschaft - AKA GmbH, P.O. Box 41 07 05, 12117 Berlin, Germany (2014)Google Scholar
- 13.Ngomo, A.C.N., Lyko, K., Christen, V.: Coala-correlation-aware active learning of link specifications. In: The Semantic Web: Semantics and Big Data, pp. 442–456. Springer (2013)Google Scholar
- 14.Ngonga Ngomo, A.-C., Schumacher, F.: Borderflow: a local graph clustering algorithm for natural language processing. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 547–558. Springer, Heidelberg (2009) CrossRefGoogle Scholar
- 15.Osinski, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Proceedings of the International Conference on Intelligent Information Systems (IIPWM 2004), Zakopane, Poland, pp. 359–368 (2004)Google Scholar
- 16.Osiński, S., Weiss, D.: Carrot\(^{2}\): design of a flexible and efficient web information retrieval framework. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 439–444. Springer, Heidelberg (2005) CrossRefGoogle Scholar
- 17.Poon, H., Domingos, P.: Unsupervised ontology induction from text. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 296–305. ACL 2010, Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1858681.1858712
- 18.Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)Google Scholar
- 19.Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web, pp. 631–640. ACM (2009)Google Scholar
- 20.Van Dongen, S.M.: Graph clustering by flow simulation (2001)Google Scholar
- 21.Wu, F., Weld, D.S.: Automatically refining the wikipedia infobox ontology. In: Proceedings of the 17th International Conference on World Wide Web, pp. 635–644. ACM (2008)Google Scholar
- 22.Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24–28 1998, pp. 46–54 (1998). http://doi.acm.org/10.1145/290941.290956