Webpage Menu Detection Based on DOM

Alarte, Julian; Insa, David; Silva, Josep

doi:10.1007/978-3-319-51963-0_32

Julian Alarte¹⁹,
David Insa¹⁹ &
Josep Silva¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10139))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Informatics

1203 Accesses
7 Citations

Abstract

One of the key elements of a website are Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad under grant TIN2013-44742-C4-1-R and TIN2016-76843-C4-1-R, and by the Generalitat Valenciana under grant PROMETEO-II/2015/013 (SmartLogic).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
HTML Unordered List.
2.
We consider the nav tag because it is the specific tag (and recommendation) in HTML5 for representing menus. However, note that it can be changed if we want to focus on other technologies.
3.
We designed and implemented the suite of benchmarks before we constructed our technique to avoid their interference.

References

Bar-Yossef, Z., Rajagopalan, S.: Template Detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW 2002), pp. 580–591. ACM, New York (2002)
Google Scholar
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS 2009), pp. 67–72. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P.: An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM Symposium on Document Engineering (DocEng 2011), pp. 121–128. ACM, New York (2011)
Google Scholar
World Wide Web Consortium: Document Object Model (DOM) (1997). http://www.w3.org/DOM/
Gottron, T.: Content code blurring: a new approach to Content Extraction. In: Tjoa, A.M., Wagner, R.R. (eds.) Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), pp. 29–33. IEEE Computer Society, September 2008
Google Scholar
Insa, D., Silva, J., Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Logic Algebraic Program. 82(8), 311–325 (2013)
Article MATH Google Scholar
Kohlschütter, C.: A densitometric analysis of web template content. In: Quemada, J., León, G., Maarek, Y.S., Nejdl, W. (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), pp. 1165–1166. ACM, April 2009
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text teatures. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM 2010), pp. 441–450. ACM, February 2010
Google Scholar
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Shanahan, J.G., Amer-Yahia S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A. (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 1173–1182. ACM, October 2008
Google Scholar
Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: A web page segmentation method based on page layouts and title blocks. IJCSNS Int. J. Comput. Sci. Netw. Secur. 11(10), 84–90 (2011)
Google Scholar
Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM 2006), pp. 258–267. ACM, New York (2006)
Google Scholar
Weninger, T., Henry Hsu, W., Han, J.: CETR: content Extraction via tag ratios. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) Proceedings of the 19th International Conference on World Wide Web (WWW 2010), pp. 971–980. ACM, April 2010
Google Scholar
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD 2003), pp. 296–305. ACM, New York (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia, Spain
Julian Alarte, David Insa & Josep Silva

Authors

Julian Alarte
View author publications
You can also search for this author in PubMed Google Scholar
David Insa
View author publications
You can also search for this author in PubMed Google Scholar
Josep Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josep Silva .

Editor information

Editors and Affiliations

TU Dortmund , Dortmund, Germany
Bernhard Steffen
TU Dresden , Dresden, Germany
Christel Baier
Eindhoven University of Technology , Eindhoven, The Netherlands
Mark van den Brand
Alpen Adria University Klagenfurt , Klagenfurt, Austria
Johann Eder
Lero - Irish Software Research Center , Limerick, Ireland
Mike Hinchey
Lero - Irish Software Research Center , Limerick, Ireland
Tiziana Margaria

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alarte, J., Insa, D., Silva, J. (2017). Webpage Menu Detection Based on DOM. In: Steffen, B., Baier, C., van den Brand, M., Eder, J., Hinchey, M., Margaria, T. (eds) SOFSEM 2017: Theory and Practice of Computer Science. SOFSEM 2017. Lecture Notes in Computer Science(), vol 10139. Springer, Cham. https://doi.org/10.1007/978-3-319-51963-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-51963-0_32
Published: 11 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51962-3
Online ISBN: 978-3-319-51963-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics