Abstract
One of the key elements of a website are Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.
This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad under grant TIN2013-44742-C4-1-R and TIN2016-76843-C4-1-R, and by the Generalitat Valenciana under grant PROMETEO-II/2015/013 (SmartLogic).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
HTML Unordered List.
- 2.
We consider the nav tag because it is the specific tag (and recommendation) in HTML5 for representing menus. However, note that it can be changed if we want to focus on other technologies.
- 3.
We designed and implemented the suite of benchmarks before we constructed our technique to avoid their interference.
References
Bar-Yossef, Z., Rajagopalan, S.: Template Detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW 2002), pp. 580–591. ACM, New York (2002)
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS 2009), pp. 67–72. IEEE Computer Society, Washington, DC (2009)
Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P.: An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM Symposium on Document Engineering (DocEng 2011), pp. 121–128. ACM, New York (2011)
World Wide Web Consortium: Document Object Model (DOM) (1997). http://www.w3.org/DOM/
Gottron, T.: Content code blurring: a new approach to Content Extraction. In: Tjoa, A.M., Wagner, R.R. (eds.) Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), pp. 29–33. IEEE Computer Society, September 2008
Insa, D., Silva, J., Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Logic Algebraic Program. 82(8), 311–325 (2013)
Kohlschütter, C.: A densitometric analysis of web template content. In: Quemada, J., León, G., Maarek, Y.S., Nejdl, W. (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), pp. 1165–1166. ACM, April 2009
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text teatures. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM 2010), pp. 441–450. ACM, February 2010
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Shanahan, J.G., Amer-Yahia S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A. (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 1173–1182. ACM, October 2008
Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: A web page segmentation method based on page layouts and title blocks. IJCSNS Int. J. Comput. Sci. Netw. Secur. 11(10), 84–90 (2011)
Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM 2006), pp. 258–267. ACM, New York (2006)
Weninger, T., Henry Hsu, W., Han, J.: CETR: content Extraction via tag ratios. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) Proceedings of the 19th International Conference on World Wide Web (WWW 2010), pp. 971–980. ACM, April 2010
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD 2003), pp. 296–305. ACM, New York (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Alarte, J., Insa, D., Silva, J. (2017). Webpage Menu Detection Based on DOM. In: Steffen, B., Baier, C., van den Brand, M., Eder, J., Hinchey, M., Margaria, T. (eds) SOFSEM 2017: Theory and Practice of Computer Science. SOFSEM 2017. Lecture Notes in Computer Science(), vol 10139. Springer, Cham. https://doi.org/10.1007/978-3-319-51963-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-51963-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51962-3
Online ISBN: 978-3-319-51963-0
eBook Packages: Computer ScienceComputer Science (R0)