Skip to main content

Flexible Detection of Similar DOM Elements

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2020, WEBIST 2021)

Abstract

Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/juliangrigera/scoring-map.

  2. 2.

    https://addons.mozilla.org/en-GB/firefox/addon/greasemonkey/.

  3. 3.

    https://pharo.org.

  4. 4.

    https://github.com/hyperopt/hyperopt/.

References

  1. Amagasa, T., Wen, L., Kitagawa, H.: Proximity search of XML data using ontology and XPath edit similarity. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 298–307. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74469-6_30

    Chapter  Google Scholar 

  2. Asakawa, C., Takagi, H.: Web accessibility: a foundation for research, chapter transcoding (2008)

    Google Scholar 

  3. Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 301–312 (2005)

    Google Scholar 

  4. Buttler, D.: A short survey of document structure similarity algorithms. Technical report Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States) (2004)

    Google Scholar 

  5. Díaz, O.: Understanding web augmentation. In: Grossniklaus, M., Wimmer, M. (eds.) ICWE 2012. LNCS, vol. 7703, pp. 79–80. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35623-0_8

    Chapter  Google Scholar 

  6. Distante, D., Garrido, A., Camelier-Carvajal, J., Giandini, R., Rossi, G.: Business processes refactoring to improve usability in e-commerce applications. Electron. Commer. Res. 14(4), 497–529 (2014). https://doi.org/10.1007/s10660-014-9149-0

    Article  Google Scholar 

  7. Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), pp. 278–287 (2013). https://doi.org/10.1109/ISSRE.2013.6698880

  8. Fernandez, A., Insfran, E., Abrahão, S.: Usability evaluation methods for the web: a systematic mapping study. Inf. Softw. Technol. 53(8), 789–817 (2011)

    Article  Google Scholar 

  9. Garrido, A., Firmenich, S., Rossi, G., Grigera, J., Medina-Medina, N., Harari, I.: Personalized web accessibility using client-side refactoring. IEEE Internet Comput. 17(4), 58–66 (2012)

    Article  Google Scholar 

  10. Garrido, A., Rossi, G., Distante, D.: Refactoring for usability in web applications. IEEE Softw. 28(3), 60–67 (2010)

    Article  Google Scholar 

  11. Griazev, K., Ramanauskaitė, S.: HTML block similarity estimation. In: 2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), pp. 1–4. IEEE (2018)

    Google Scholar 

  12. Grigalis, T., Čenys, A.: Using XPaths of inbound links to cluster template-generated web pages. Comput. Sci. Inf. Syst. 11(1), 111–131 (2014)

    Article  Google Scholar 

  13. Grigera, J., Gardey, J.C., Garrido, A., Rossi, G.: A scoring map algorithm for automatically detecting structural similarity of DOM elements (2021)

    Google Scholar 

  14. Grigera, J., Garrido, A., Panach, J.I., Distante, D., Rossi, G.: Assessing refactorings for usability in e-commerce applications. Empir. Softw. Eng. 21(3), 1224–1271 (2016). https://doi.org/10.1007/s10664-015-9384-6

    Article  Google Scholar 

  15. Grigera, J., Garrido, A., Rivero, J.M.: A tool for detecting bad usability smells in an automatic way. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 490–493. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_34

    Chapter  Google Scholar 

  16. Hachenberg, C., Gottron, T.: Locality sensitive hashing for scalable structural classification and clustering of web documents. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 359–368 (2013)

    Google Scholar 

  17. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 577–582 (2003)

    Google Scholar 

  18. Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)

    Google Scholar 

  19. Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1), 9–29 (2001). https://doi.org/10.1023/A:1007648401407

    Article  MATH  Google Scholar 

  20. Mesbah, A., van Deursen, A., Roest, D.: Invariant-based automatic testing of modern web applications. IEEE Trans. Softw. Eng. 38(1), 35–53 (2012)

    Article  Google Scholar 

  21. Mesbah, A., Prasad, M.R.: Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE (2011), pp. 561–570. Association for Computing Machinery, New York, NY, USA (2011)

    Google Scholar 

  22. Nebeling, M., Speicher, M., Norrie, M.: W3Touch: metrics-based web page adaptation for touch. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2311–2320 (2013)

    Google Scholar 

  23. Norrie, M.C., Nebeling, M., Di Geronimo, L., Murolo, A.: X-themes: supporting design-by-example. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 480–489. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_33

    Chapter  Google Scholar 

  24. Omer, B., Ruth, B., Shahar, G.: A new frequent similar tree algorithm motivated by DOM mining using RTDM and its new variant-SiSTeR (2012)

    Google Scholar 

  25. Reis, D.D.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511 (2004)

    Google Scholar 

  26. Tai, K.C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)

    Article  MATH  Google Scholar 

  27. Valiente, G.: An efficient bottom-up distance between trees. In: spire, pp. 212–219. Citeseer (2001)

    Google Scholar 

  28. Xu, Z., Miller, J.: Estimating similarity of rich internet pages using visual information. Int. J. Web Eng. Technol. 12(2), 97–119 (2017)

    Article  Google Scholar 

  29. Zanotti, M.: Accessibility and crowdsourcing: use of semantic tags to improve web application accessibility. University of La Plata, Argentina (2016)

    Google Scholar 

  30. Zeng, J., Flanagan, B., Hirokawa, S.: Layout-tree-based approach for identifying visually similar blocks in a web page. In: 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), pp. 65–70. IEEE (2013)

    Google Scholar 

  31. Zheng, S., Song, R., Wen, J.R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 47–56 (2009)

    Google Scholar 

Download references

Acknowledgements

The authors acknowledge the support from the Argentinian National Agency for Scientific and Technical Promotion (ANPCyT), grant number PICT-2019-02485.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julián Grigera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Grigera, J., Gardey, J.C., Rossi, G., Garrido, A. (2023). Flexible Detection of Similar DOM Elements. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST WEBIST 2020 2021. Lecture Notes in Business Information Processing, vol 469. Springer, Cham. https://doi.org/10.1007/978-3-031-24197-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24197-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24196-3

  • Online ISBN: 978-3-031-24197-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics