Abstract
The emerging drift toward real-time applications generates massive amounts of JSON data exponentially over the web. Dealing with the heterogeneous structures of JSON document collections is challenging for efficient data management and knowledge discovery. Clustering JSON documents has become a significant issue in organizing large data collections. Existing research has focused on clustering JSON documents using structural or semantic similarity measures. However, differently annotated JSON structures are also related by the context of the JSON attributes. As a result, existing research work is unable to identify the context hidden in the schemas, emphasizing the importance of leveraging the syntactic, semantic, and contextual properties of heterogeneous JSON schemas. To address the specific research gap, this work proposes JSON Similarity (JSim), a novel approach for clustering JSON documents by combining the structural and semantic similarity scores of JSON schemas. In order to capture more semantics, the semantic fusion method is proposed, which correlates schemas using semantic as well as contextual similarity measures. The JSON documents are clustered based on the weighted similarity matrix. The results and findings show that the proposed approach outperforms the current approaches significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
References
Bourhis P, Reutter JL, Suárez F, Vrgoč D (2017) JSON: data model, query languages and schema specification. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS’17. ACM, New York, NY, pp 123–135
Wang L, Zhang S, Shi J, Jiao L, Hassanzadeh O, Zou J, Wangz C (2015) Schema management for document stores. Proc VLDB Endow 8(9):922–933
Gallinucci E, Golfarelli M, Rizzi S (2019) Approximate OLAP of document-oriented databases: a variety-aware approach. Inf Syst 85:114–130
Bawakid F (2019) A schema exploration approach for document-oriented data using unsupervised techniques. PhD thesis, University of Southampton
Miller GA (1998) WordNet: an electronic lexical database. MIT Press
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl Based Syst 182:104842
Uma Priya D, Santhi Thilagam P (2022) JSON document clustering based on schema embeddings. J Inf Sci 01655515221116522
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25
Blaselbauer VM, Josko JMB (2020) JSONGlue: a hybrid matcher for JSON schema matching. In: Proceedings of the Brazilian symposium on databases
Uma Priya D, Santhi Thilagam P (2022) ClustVariants: an approach for schema variants extraction from JSON document collections. In: 2022 IEEE IAS global conference on emerging technologies (GlobConET), pp 515–520
Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031
Laddha A, Joshi S, Shaikh S, Mehta S (2018) Joint distributed representation of text and structure of semi-structured documents. In: Proceedings of the 29th hypertext and social media, pp 25–32
Costa G, Ortale R (2019) Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Yang Q, Zhou Z-H, Gong Z, Zhang M-L, Huang S-J (eds) Advances in knowledge discovery and data mining. Springer International Publishing, Cham, pp 237–248
Wu H, Liu Y, Wu Q (2020) Stylistic syntactic structure extraction and semantic clustering for different registers. In: 2020 international conference on Asian language processing (IALP). IEEE, pp 66–74
Dongo I, Ticona-Herrera R, Cadinale Y, Guzmán R (2020) Semantic similarity of XML documents based on structural and content analysis. In: Proceedings of the 2020 4th international symposium on computer science and intelligent control, ISCSIC 2020. Association for Computing Machinery, New York, NY
Piernik M, Brzezinski D, Morzy T (2016) Clustering XML documents by patterns. Knowl Inf Syst 46(1):185–212
Costa G, Ortale R (2017) XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int J Artif Intell Tools 26(01):1760002
Accottillam T, Remya KTV, Raju G (2021) TreeXP: an instantiation of xpattern framework. In: Data science and security. Springer, pp 61–69
Costa G, Ortale R (2018) Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf Retr J 21(1):24–55
Hennig C, Hausdorf B (2006) Design of dissimilarity measures: a new dissimilarity between species distribution areas. In: Data science and classification. Springer, pp 29–37
Wu Z, Palmer M (1994) Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/9406033
Von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416
Chouder ML, Rizzi S, Chalal R (2017) JSON datasets for exploratory OLAP. https://doi.org/10.17632/ct8f9skv97.1. Accessed 21 Dec 2020
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, Sept 2017. Association for Computational Linguistics, pp 670–680
Cer D, Yang Y, Kong S, Hua N, Limtiaco N, St John R, Constant N, Guajardo-Céspedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Uma Priya, D., Santhi Thilagam, P. (2023). JSON Document Clustering Based on Structural Similarity and Semantic Fusion. In: Chaki, N., Devarakonda, N., Cortesi, A. (eds) Proceedings of International Conference on Computational Intelligence and Data Engineering. ICCIDE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 163. Springer, Singapore. https://doi.org/10.1007/978-981-99-0609-3_4
Download citation
DOI: https://doi.org/10.1007/978-981-99-0609-3_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0608-6
Online ISBN: 978-981-99-0609-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)