Skip to main content

JSON Document Clustering Based on Structural Similarity and Semantic Fusion

  • Conference paper
  • First Online:
Proceedings of International Conference on Computational Intelligence and Data Engineering (ICCIDE 2022)

Abstract

The emerging drift toward real-time applications generates massive amounts of JSON data exponentially over the web. Dealing with the heterogeneous structures of JSON document collections is challenging for efficient data management and knowledge discovery. Clustering JSON documents has become a significant issue in organizing large data collections. Existing research has focused on clustering JSON documents using structural or semantic similarity measures. However, differently annotated JSON structures are also related by the context of the JSON attributes. As a result, existing research work is unable to identify the context hidden in the schemas, emphasizing the importance of leveraging the syntactic, semantic, and contextual properties of heterogeneous JSON schemas. To address the specific research gap, this work proposes JSON Similarity (JSim), a novel approach for clustering JSON documents by combining the structural and semantic similarity scores of JSON schemas. In order to capture more semantics, the semantic fusion method is proposed, which correlates schemas using semantic as well as contextual similarity measures. The JSON documents are clustered based on the weighted similarity matrix. The results and findings show that the proposed approach outperforms the current approaches significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    www.ieee.org.

  2. 2.

    www.acm.org.

References

  1. Bourhis P, Reutter JL, Suárez F, Vrgoč D (2017) JSON: data model, query languages and schema specification. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS’17. ACM, New York, NY, pp 123–135

    Google Scholar 

  2. Wang L, Zhang S, Shi J, Jiao L, Hassanzadeh O, Zou J, Wangz C (2015) Schema management for document stores. Proc VLDB Endow 8(9):922–933

    Google Scholar 

  3. Gallinucci E, Golfarelli M, Rizzi S (2019) Approximate OLAP of document-oriented databases: a variety-aware approach. Inf Syst 85:114–130

    Google Scholar 

  4. Bawakid F (2019) A schema exploration approach for document-oriented data using unsupervised techniques. PhD thesis, University of Southampton

    Google Scholar 

  5. Miller GA (1998) WordNet: an electronic lexical database. MIT Press

    Google Scholar 

  6. Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl Based Syst 182:104842

    Article  Google Scholar 

  7. Uma Priya D, Santhi Thilagam P (2022) JSON document clustering based on schema embeddings. J Inf Sci 01655515221116522

    Google Scholar 

  8. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  9. Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25

    Google Scholar 

  10. Blaselbauer VM, Josko JMB (2020) JSONGlue: a hybrid matcher for JSON schema matching. In: Proceedings of the Brazilian symposium on databases

    Google Scholar 

  11. Uma Priya D, Santhi Thilagam P (2022) ClustVariants: an approach for schema variants extraction from JSON document collections. In: 2022 IEEE IAS global conference on emerging technologies (GlobConET), pp 515–520

    Google Scholar 

  12. Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031

    Google Scholar 

  13. Laddha A, Joshi S, Shaikh S, Mehta S (2018) Joint distributed representation of text and structure of semi-structured documents. In: Proceedings of the 29th hypertext and social media, pp 25–32

    Google Scholar 

  14. Costa G, Ortale R (2019) Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Yang Q, Zhou Z-H, Gong Z, Zhang M-L, Huang S-J (eds) Advances in knowledge discovery and data mining. Springer International Publishing, Cham, pp 237–248

    Google Scholar 

  15. Wu H, Liu Y, Wu Q (2020) Stylistic syntactic structure extraction and semantic clustering for different registers. In: 2020 international conference on Asian language processing (IALP). IEEE, pp 66–74

    Google Scholar 

  16. Dongo I, Ticona-Herrera R, Cadinale Y, Guzmán R (2020) Semantic similarity of XML documents based on structural and content analysis. In: Proceedings of the 2020 4th international symposium on computer science and intelligent control, ISCSIC 2020. Association for Computing Machinery, New York, NY

    Google Scholar 

  17. Piernik M, Brzezinski D, Morzy T (2016) Clustering XML documents by patterns. Knowl Inf Syst 46(1):185–212

    Google Scholar 

  18. Costa G, Ortale R (2017) XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int J Artif Intell Tools 26(01):1760002

    Article  Google Scholar 

  19. Accottillam T, Remya KTV, Raju G (2021) TreeXP: an instantiation of xpattern framework. In: Data science and security. Springer, pp 61–69

    Google Scholar 

  20. Costa G, Ortale R (2018) Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf Retr J 21(1):24–55

    Google Scholar 

  21. Hennig C, Hausdorf B (2006) Design of dissimilarity measures: a new dissimilarity between species distribution areas. In: Data science and classification. Springer, pp 29–37

    Google Scholar 

  22. Wu Z, Palmer M (1994) Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/9406033

  23. Von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416

    Google Scholar 

  24. Chouder ML, Rizzi S, Chalal R (2017) JSON datasets for exploratory OLAP. https://doi.org/10.17632/ct8f9skv97.1. Accessed 21 Dec 2020

  25. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    Google Scholar 

  26. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, Sept 2017. Association for Computational Linguistics, pp 670–680

    Google Scholar 

  27. Cer D, Yang Y, Kong S, Hua N, Limtiaco N, St John R, Constant N, Guajardo-Céspedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175

  28. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Uma Priya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Uma Priya, D., Santhi Thilagam, P. (2023). JSON Document Clustering Based on Structural Similarity and Semantic Fusion. In: Chaki, N., Devarakonda, N., Cortesi, A. (eds) Proceedings of International Conference on Computational Intelligence and Data Engineering. ICCIDE 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 163. Springer, Singapore. https://doi.org/10.1007/978-981-99-0609-3_4

Download citation

Publish with us

Policies and ethics