Skip to main content

Webpage Menu Detection Based on DOM

  • Conference paper
  • First Online:
SOFSEM 2017: Theory and Practice of Computer Science (SOFSEM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10139))

Abstract

One of the key elements of a website are Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad under grant TIN2013-44742-C4-1-R and TIN2016-76843-C4-1-R, and by the Generalitat Valenciana under grant PROMETEO-II/2015/013 (SmartLogic).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    HTML Unordered List.

  2. 2.

    We consider the nav tag because it is the specific tag (and recommendation) in HTML5 for representing menus. However, note that it can be changed if we want to focus on other technologies.

  3. 3.

    We designed and implemented the suite of benchmarks before we constructed our technique to avoid their interference.

References

  1. Bar-Yossef, Z., Rajagopalan, S.: Template Detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW 2002), pp. 580–591. ACM, New York (2002)

    Google Scholar 

  2. Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS 2009), pp. 67–72. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  3. Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P.: An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM Symposium on Document Engineering (DocEng 2011), pp. 121–128. ACM, New York (2011)

    Google Scholar 

  4. World Wide Web Consortium: Document Object Model (DOM) (1997). http://www.w3.org/DOM/

  5. Gottron, T.: Content code blurring: a new approach to Content Extraction. In: Tjoa, A.M., Wagner, R.R. (eds.) Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), pp. 29–33. IEEE Computer Society, September 2008

    Google Scholar 

  6. Insa, D., Silva, J., Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Logic Algebraic Program. 82(8), 311–325 (2013)

    Article  MATH  Google Scholar 

  7. Kohlschütter, C.: A densitometric analysis of web template content. In: Quemada, J., León, G., Maarek, Y.S., Nejdl, W. (eds.) Proceedings of the 18th International Conference on World Wide Web (WWW 2009), pp. 1165–1166. ACM, April 2009

    Google Scholar 

  8. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text teatures. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM 2010), pp. 441–450. ACM, February 2010

    Google Scholar 

  9. Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Shanahan, J.G., Amer-Yahia S., Manolescu, I., Zhang, Y., Evans, D.A., Kolcz, A., Choi, K.-S., Chowdhury, A. (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 1173–1182. ACM, October 2008

    Google Scholar 

  10. Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: A web page segmentation method based on page layouts and title blocks. IJCSNS Int. J. Comput. Sci. Netw. Secur. 11(10), 84–90 (2011)

    Google Scholar 

  11. Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM 2006), pp. 258–267. ACM, New York (2006)

    Google Scholar 

  12. Weninger, T., Henry Hsu, W., Han, J.: CETR: content Extraction via tag ratios. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) Proceedings of the 19th International Conference on World Wide Web (WWW 2010), pp. 971–980. ACM, April 2010

    Google Scholar 

  13. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD 2003), pp. 296–305. ACM, New York (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josep Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Alarte, J., Insa, D., Silva, J. (2017). Webpage Menu Detection Based on DOM. In: Steffen, B., Baier, C., van den Brand, M., Eder, J., Hinchey, M., Margaria, T. (eds) SOFSEM 2017: Theory and Practice of Computer Science. SOFSEM 2017. Lecture Notes in Computer Science(), vol 10139. Springer, Cham. https://doi.org/10.1007/978-3-319-51963-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-51963-0_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-51962-3

  • Online ISBN: 978-3-319-51963-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics