JSON Document Clustering Based on Structural Similarity and Semantic Fusion

Uma Priya, D.; Santhi Thilagam, P.

doi:10.1007/978-981-99-0609-3_4

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 163))

Included in the following conference series:

International Conference on Computational Intelligence and Data Engineering

146 Accesses

Abstract

The emerging drift toward real-time applications generates massive amounts of JSON data exponentially over the web. Dealing with the heterogeneous structures of JSON document collections is challenging for efficient data management and knowledge discovery. Clustering JSON documents has become a significant issue in organizing large data collections. Existing research has focused on clustering JSON documents using structural or semantic similarity measures. However, differently annotated JSON structures are also related by the context of the JSON attributes. As a result, existing research work is unable to identify the context hidden in the schemas, emphasizing the importance of leveraging the syntactic, semantic, and contextual properties of heterogeneous JSON schemas. To address the specific research gap, this work proposes JSON Similarity (JSim), a novel approach for clustering JSON documents by combining the structural and semantic similarity scores of JSON schemas. In order to capture more semantics, the semantic fusion method is proposed, which correlates schemas using semantic as well as contextual similarity measures. The JSON documents are clustered based on the weighted similarity matrix. The results and findings show that the proposed approach outperforms the current approaches significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.ieee.org.
2.
www.acm.org.

References

Bourhis P, Reutter JL, Suárez F, Vrgoč D (2017) JSON: data model, query languages and schema specification. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS’17. ACM, New York, NY, pp 123–135
Google Scholar
Wang L, Zhang S, Shi J, Jiao L, Hassanzadeh O, Zou J, Wangz C (2015) Schema management for document stores. Proc VLDB Endow 8(9):922–933
Google Scholar
Gallinucci E, Golfarelli M, Rizzi S (2019) Approximate OLAP of document-oriented databases: a variety-aware approach. Inf Syst 85:114–130
Google Scholar
Bawakid F (2019) A schema exploration approach for document-oriented data using unsupervised techniques. PhD thesis, University of Southampton
Google Scholar
Miller GA (1998) WordNet: an electronic lexical database. MIT Press
Google Scholar
Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl Based Syst 182:104842
Article Google Scholar
Uma Priya D, Santhi Thilagam P (2022) JSON document clustering based on schema embeddings. J Inf Sci 01655515221116522
Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25
Google Scholar
Blaselbauer VM, Josko JMB (2020) JSONGlue: a hybrid matcher for JSON schema matching. In: Proceedings of the Brazilian symposium on databases
Google Scholar
Uma Priya D, Santhi Thilagam P (2022) ClustVariants: an approach for schema variants extraction from JSON document collections. In: 2022 IEEE IAS global conference on emerging technologies (GlobConET), pp 515–520
Google Scholar
Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031
Google Scholar
Laddha A, Joshi S, Shaikh S, Mehta S (2018) Joint distributed representation of text and structure of semi-structured documents. In: Proceedings of the 29th hypertext and social media, pp 25–32
Google Scholar
Costa G, Ortale R (2019) Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Yang Q, Zhou Z-H, Gong Z, Zhang M-L, Huang S-J (eds) Advances in knowledge discovery and data mining. Springer International Publishing, Cham, pp 237–248
Google Scholar
Wu H, Liu Y, Wu Q (2020) Stylistic syntactic structure extraction and semantic clustering for different registers. In: 2020 international conference on Asian language processing (IALP). IEEE, pp 66–74
Google Scholar
Dongo I, Ticona-Herrera R, Cadinale Y, Guzmán R (2020) Semantic similarity of XML documents based on structural and content analysis. In: Proceedings of the 2020 4th international symposium on computer science and intelligent control, ISCSIC 2020. Association for Computing Machinery, New York, NY
Google Scholar
Piernik M, Brzezinski D, Morzy T (2016) Clustering XML documents by patterns. Knowl Inf Syst 46(1):185–212
Google Scholar
Costa G, Ortale R (2017) XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int J Artif Intell Tools 26(01):1760002
Article Google Scholar
Accottillam T, Remya KTV, Raju G (2021) TreeXP: an instantiation of xpattern framework. In: Data science and security. Springer, pp 61–69
Google Scholar
Costa G, Ortale R (2018) Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf Retr J 21(1):24–55
Google Scholar
Hennig C, Hausdorf B (2006) Design of dissimilarity measures: a new dissimilarity between species distribution areas. In: Data science and classification. Springer, pp 29–37
Google Scholar
Wu Z, Palmer M (1994) Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/9406033
Von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416
Google Scholar
Chouder ML, Rizzi S, Chalal R (2017) JSON datasets for exploratory OLAP. https://doi.org/10.17632/ct8f9skv97.1. Accessed 21 Dec 2020
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Google Scholar
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, Sept 2017. Association for Computational Linguistics, pp 670–680
Google Scholar
Cer D, Yang Y, Kong S, Hua N, Limtiaco N, St John R, Constant N, Guajardo-Céspedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

Download references

Author information

Authors and Affiliations

National Institute of Technology Karnataka, Mangalore, India
D. Uma Priya & P. Santhi Thilagam

Authors

D. Uma Priya
View author publications
You can also search for this author in PubMed Google Scholar
P. Santhi Thilagam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. Uma Priya .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, University of Calcutta, Kolkata, India
Nabendu Chaki
VIT-AP University, Amaravati, Andhra Pradesh, India
Nagaraju Devarakonda
Ca’ Foscari Univeristy, Venice, Italy
Agostino Cortesi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uma Priya, D., Santhi Thilagam, P. (2023). JSON Document Clustering Based on Structural Similarity and Semantic Fusion. In: Chaki, N., Devarakonda, N., Cortesi, A. (eds) Proceedings of International Conference on Computational Intelligence and Data Engineering. ICCIDE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 163. Springer, Singapore. https://doi.org/10.1007/978-981-99-0609-3_4

Download citation

DOI: https://doi.org/10.1007/978-981-99-0609-3_4
Published: 18 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0608-6
Online ISBN: 978-981-99-0609-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics