Flexible Detection of Similar DOM Elements

Grigera, Julián; Gardey, Juan Cruz; Rossi, Gustavo; Garrido, Alejandra

doi:10.1007/978-3-031-24197-0_10

Julián Grigera^9,10,11,
Juan Cruz Gardey^9,10,
Gustavo Rossi^9,10 &
…
Alejandra Garrido^9,10

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 469))

Included in the following conference series:

200 Accesses
1 Citations

Abstract

Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Amagasa, T., Wen, L., Kitagawa, H.: Proximity search of XML data using ontology and XPath edit similarity. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 298–307. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74469-6_30
Chapter Google Scholar
Asakawa, C., Takagi, H.: Web accessibility: a foundation for research, chapter transcoding (2008)
Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 301–312 (2005)
Google Scholar
Buttler, D.: A short survey of document structure similarity algorithms. Technical report Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States) (2004)
Google Scholar
Díaz, O.: Understanding web augmentation. In: Grossniklaus, M., Wimmer, M. (eds.) ICWE 2012. LNCS, vol. 7703, pp. 79–80. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35623-0_8
Chapter Google Scholar
Distante, D., Garrido, A., Camelier-Carvajal, J., Giandini, R., Rossi, G.: Business processes refactoring to improve usability in e-commerce applications. Electron. Commer. Res. 14(4), 497–529 (2014). https://doi.org/10.1007/s10660-014-9149-0
Article Google Scholar
Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), pp. 278–287 (2013). https://doi.org/10.1109/ISSRE.2013.6698880
Fernandez, A., Insfran, E., Abrahão, S.: Usability evaluation methods for the web: a systematic mapping study. Inf. Softw. Technol. 53(8), 789–817 (2011)
Article Google Scholar
Garrido, A., Firmenich, S., Rossi, G., Grigera, J., Medina-Medina, N., Harari, I.: Personalized web accessibility using client-side refactoring. IEEE Internet Comput. 17(4), 58–66 (2012)
Article Google Scholar
Garrido, A., Rossi, G., Distante, D.: Refactoring for usability in web applications. IEEE Softw. 28(3), 60–67 (2010)
Article Google Scholar
Griazev, K., Ramanauskaitė, S.: HTML block similarity estimation. In: 2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), pp. 1–4. IEEE (2018)
Google Scholar
Grigalis, T., Čenys, A.: Using XPaths of inbound links to cluster template-generated web pages. Comput. Sci. Inf. Syst. 11(1), 111–131 (2014)
Article Google Scholar
Grigera, J., Gardey, J.C., Garrido, A., Rossi, G.: A scoring map algorithm for automatically detecting structural similarity of DOM elements (2021)
Google Scholar
Grigera, J., Garrido, A., Panach, J.I., Distante, D., Rossi, G.: Assessing refactorings for usability in e-commerce applications. Empir. Softw. Eng. 21(3), 1224–1271 (2016). https://doi.org/10.1007/s10664-015-9384-6
Article Google Scholar
Grigera, J., Garrido, A., Rivero, J.M.: A tool for detecting bad usability smells in an automatic way. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 490–493. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_34
Chapter Google Scholar
Hachenberg, C., Gottron, T.: Locality sensitive hashing for scalable structural classification and clustering of web documents. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 359–368 (2013)
Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 577–582 (2003)
Google Scholar
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Google Scholar
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1), 9–29 (2001). https://doi.org/10.1023/A:1007648401407
Article MATH Google Scholar
Mesbah, A., van Deursen, A., Roest, D.: Invariant-based automatic testing of modern web applications. IEEE Trans. Softw. Eng. 38(1), 35–53 (2012)
Article Google Scholar
Mesbah, A., Prasad, M.R.: Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE (2011), pp. 561–570. Association for Computing Machinery, New York, NY, USA (2011)
Google Scholar
Nebeling, M., Speicher, M., Norrie, M.: W3Touch: metrics-based web page adaptation for touch. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2311–2320 (2013)
Google Scholar
Norrie, M.C., Nebeling, M., Di Geronimo, L., Murolo, A.: X-themes: supporting design-by-example. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 480–489. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_33
Chapter Google Scholar
Omer, B., Ruth, B., Shahar, G.: A new frequent similar tree algorithm motivated by DOM mining using RTDM and its new variant-SiSTeR (2012)
Google Scholar
Reis, D.D.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511 (2004)
Google Scholar
Tai, K.C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)
Article MATH Google Scholar
Valiente, G.: An efficient bottom-up distance between trees. In: spire, pp. 212–219. Citeseer (2001)
Google Scholar
Xu, Z., Miller, J.: Estimating similarity of rich internet pages using visual information. Int. J. Web Eng. Technol. 12(2), 97–119 (2017)
Article Google Scholar
Zanotti, M.: Accessibility and crowdsourcing: use of semantic tags to improve web application accessibility. University of La Plata, Argentina (2016)
Google Scholar
Zeng, J., Flanagan, B., Hirokawa, S.: Layout-tree-based approach for identifying visually similar blocks in a web page. In: 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), pp. 65–70. IEEE (2013)
Google Scholar
Zheng, S., Song, R., Wen, J.R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 47–56 (2009)
Google Scholar

Download references

Acknowledgements

The authors acknowledge the support from the Argentinian National Agency for Scientific and Technical Promotion (ANPCyT), grant number PICT-2019-02485.

Author information

Authors and Affiliations

LIFIA, Fac. de Informática, Univ. Nac. La Plata, 1900, La Plata, CP, Argentina
Julián Grigera, Juan Cruz Gardey, Gustavo Rossi & Alejandra Garrido
CONICET, Buenos Aires, Argentina
Julián Grigera, Juan Cruz Gardey, Gustavo Rossi & Alejandra Garrido
CICPBA, La Plata, Argentina
Julián Grigera

Authors

Julián Grigera
View author publications
You can also search for this author in PubMed Google Scholar
Juan Cruz Gardey
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Rossi
View author publications
You can also search for this author in PubMed Google Scholar
Alejandra Garrido
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julián Grigera .

Editor information

Editors and Affiliations

University of Padua, Padua, Italy
Massimo Marchiori
University of Seville, Seville, Spain
Francisco José Domínguez Mayo
Polytechnic Institute of Setúbal/INSTICC, Setubal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grigera, J., Gardey, J.C., Rossi, G., Garrido, A. (2023). Flexible Detection of Similar DOM Elements. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST WEBIST 2020 2021. Lecture Notes in Business Information Processing, vol 469. Springer, Cham. https://doi.org/10.1007/978-3-031-24197-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-24197-0_10
Published: 18 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24196-3
Online ISBN: 978-3-031-24197-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Flexible Detection of Similar DOM Elements