Abstract
Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amagasa, T., Wen, L., Kitagawa, H.: Proximity search of XML data using ontology and XPath edit similarity. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 298–307. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74469-6_30
Asakawa, C., Takagi, H.: Web accessibility: a foundation for research, chapter transcoding (2008)
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 301–312 (2005)
Buttler, D.: A short survey of document structure similarity algorithms. Technical report Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States) (2004)
Díaz, O.: Understanding web augmentation. In: Grossniklaus, M., Wimmer, M. (eds.) ICWE 2012. LNCS, vol. 7703, pp. 79–80. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35623-0_8
Distante, D., Garrido, A., Camelier-Carvajal, J., Giandini, R., Rossi, G.: Business processes refactoring to improve usability in e-commerce applications. Electron. Commer. Res. 14(4), 497–529 (2014). https://doi.org/10.1007/s10660-014-9149-0
Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), pp. 278–287 (2013). https://doi.org/10.1109/ISSRE.2013.6698880
Fernandez, A., Insfran, E., Abrahão, S.: Usability evaluation methods for the web: a systematic mapping study. Inf. Softw. Technol. 53(8), 789–817 (2011)
Garrido, A., Firmenich, S., Rossi, G., Grigera, J., Medina-Medina, N., Harari, I.: Personalized web accessibility using client-side refactoring. IEEE Internet Comput. 17(4), 58–66 (2012)
Garrido, A., Rossi, G., Distante, D.: Refactoring for usability in web applications. IEEE Softw. 28(3), 60–67 (2010)
Griazev, K., Ramanauskaitė, S.: HTML block similarity estimation. In: 2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), pp. 1–4. IEEE (2018)
Grigalis, T., Čenys, A.: Using XPaths of inbound links to cluster template-generated web pages. Comput. Sci. Inf. Syst. 11(1), 111–131 (2014)
Grigera, J., Gardey, J.C., Garrido, A., Rossi, G.: A scoring map algorithm for automatically detecting structural similarity of DOM elements (2021)
Grigera, J., Garrido, A., Panach, J.I., Distante, D., Rossi, G.: Assessing refactorings for usability in e-commerce applications. Empir. Softw. Eng. 21(3), 1224–1271 (2016). https://doi.org/10.1007/s10664-015-9384-6
Grigera, J., Garrido, A., Rivero, J.M.: A tool for detecting bad usability smells in an automatic way. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 490–493. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_34
Hachenberg, C., Gottron, T.: Locality sensitive hashing for scalable structural classification and clustering of web documents. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 359–368 (2013)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 577–582 (2003)
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. Mach. Learn. 42(1), 9–29 (2001). https://doi.org/10.1023/A:1007648401407
Mesbah, A., van Deursen, A., Roest, D.: Invariant-based automatic testing of modern web applications. IEEE Trans. Softw. Eng. 38(1), 35–53 (2012)
Mesbah, A., Prasad, M.R.: Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE (2011), pp. 561–570. Association for Computing Machinery, New York, NY, USA (2011)
Nebeling, M., Speicher, M., Norrie, M.: W3Touch: metrics-based web page adaptation for touch. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2311–2320 (2013)
Norrie, M.C., Nebeling, M., Di Geronimo, L., Murolo, A.: X-themes: supporting design-by-example. In: Casteleyn, S., Rossi, G., Winckler, M. (eds.) ICWE 2014. LNCS, vol. 8541, pp. 480–489. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08245-5_33
Omer, B., Ruth, B., Shahar, G.: A new frequent similar tree algorithm motivated by DOM mining using RTDM and its new variant-SiSTeR (2012)
Reis, D.D.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511 (2004)
Tai, K.C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)
Valiente, G.: An efficient bottom-up distance between trees. In: spire, pp. 212–219. Citeseer (2001)
Xu, Z., Miller, J.: Estimating similarity of rich internet pages using visual information. Int. J. Web Eng. Technol. 12(2), 97–119 (2017)
Zanotti, M.: Accessibility and crowdsourcing: use of semantic tags to improve web application accessibility. University of La Plata, Argentina (2016)
Zeng, J., Flanagan, B., Hirokawa, S.: Layout-tree-based approach for identifying visually similar blocks in a web page. In: 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), pp. 65–70. IEEE (2013)
Zheng, S., Song, R., Wen, J.R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 47–56 (2009)
Acknowledgements
The authors acknowledge the support from the Argentinian National Agency for Scientific and Technical Promotion (ANPCyT), grant number PICT-2019-02485.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Grigera, J., Gardey, J.C., Rossi, G., Garrido, A. (2023). Flexible Detection of Similar DOM Elements. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST WEBIST 2020 2021. Lecture Notes in Business Information Processing, vol 469. Springer, Cham. https://doi.org/10.1007/978-3-031-24197-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-24197-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24196-3
Online ISBN: 978-3-031-24197-0
eBook Packages: Computer ScienceComputer Science (R0)