Parametric schema inference for massive JSON datasets

Baazizi, Mohamed-Amine; Colazzo, Dario; Ghelli, Giorgio; Sartiani, Carlo

doi:10.1007/s00778-018-0532-7

Parametric schema inference for massive JSON datasets

Regular Paper
Published: 05 January 2019

Volume 28, pages 497–521, (2019)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Mohamed-Amine Baazizi¹,
Dario Colazzo²,
Giorgio Ghelli³ &
…
Carlo Sartiani⁴

1554 Accesses
39 Citations
Explore all metrics

Abstract

In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big SQL systems: an experimental evaluation

Article 11 February 2019

$$\varvec{\textsc {Orpheus}}$$ DB: bolt-on versioning for relational databases (extended version)

Article 20 December 2019

A Comparison of Systems to Large-Scale Data Access

Notes

We are here ignoring empty structural types, which are record types where one mandatory field has type $\emptyset $, since they are never inferred, and we could even forbid them in the syntax.
https://developer.nytimes.com.
https://dumps.wikimedia.org/wikidatawiki/entities/.
https://www.kaggle.com/borisch/russian-election-2018-twitter.
https://www.kaggle.com/borisch/russian-election-2018-vkcom-user-activity/feed.
https://vk.com/dev/streaming_api_docs_2.
https://core.ac.uk.
https://core.ac.uk/services#dump-structure.
The inferred types for each dataset are reported in [19].
We may be more formal, as follows: Consider n keys and a space where every point is a set of shapes, that is, a set of subsets of . In this setting, every ${\mathcal {L}}$-reduced type exactly indicates one point of a space whose size is $2^{2^n}$; hence, each ${\mathcal {L}}$-reduced type brings exactly the same amount of information: $2^n$ bits. On the other side, a ${\mathcal {K}}$-reduced type is, in general, compatible with many different points in this space; hence, it brings a lower number of bits, which depends on the number of optional keys, and may be computed for each specific ${\mathcal {K}}$-reduced type. We may compare this number with $2^n$ in order to mathematically quantify the information gain. We do not pursue this avenue because this model embeds the unrealistic idea that every distribution of shapes has the same probability, and because we do not believe that this model, although mathematically coherent, is a useful model of the information needs of the data analyst.
In the case of VK, we multiplied the original datasets 4, 6, ..., 20 times to reach a minimum size of 100 GB as the largest size.

References

Apache Spark. http://spark.apache.org
Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)
Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)
Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)
Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Proofs for parametric schema inference for massive JSON datasets. Working paper or preprint (2018). https://hal.archives-ouvertes.fr/hal-01960464/
Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)
Google Scholar
Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)
Google Scholar
Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)
Bray, T.: The JavaScript object notation (JSON) data interchange format (2014). https://tools.ietf.org/html/rfc7159
Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)
Google Scholar
Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)
Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)
Article MathSciNet MATH Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)
http://webia.lip6.fr/~baazizi/rs/js/vj18
JSON schema definition language. http://jsoniq.org/docs/JSound/html-single/
JSON schema language. http://json-schema.org
Labs, T.S.: Studio 3T, 2017. https://studio3t.com
Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)
Google Scholar
Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)
Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)
McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)
Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)
Article Google Scholar
Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)
Article Google Scholar
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)
Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)
Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)
Schmidt, P.: mongodb-schema (2017). https://github.com/mongodb-js/mongodb-schema
scrapinghub. Skinfer (2015). https://github.com/scrapinghub/skinfer
Spark dataframe. https://spark.apache.org/docs/latest/sql-programming-guide.html
The JSON Query Language. http://www.jsoniq.org
Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, Laboratoire d’Informatique de Paris 6, Sorbonne Université, 75005, Paris, France
Mohamed-Amine Baazizi
CNRS, LAMSADE - Université Paris Dauphine, PSL Research University, 75016, Paris, France
Dario Colazzo
Dipartimento di Informatica, Università di Pisa, Pisa, Italy
Giorgio Ghelli
DIMIE, Università della Basilicata, Potenza, Italy
Carlo Sartiani

Authors

Mohamed-Amine Baazizi
View author publications
You can also search for this author in PubMed Google Scholar
Dario Colazzo
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Ghelli
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Sartiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed-Amine Baazizi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baazizi, MA., Colazzo, D., Ghelli, G. et al. Parametric schema inference for massive JSON datasets. The VLDB Journal 28, 497–521 (2019). https://doi.org/10.1007/s00778-018-0532-7

Download citation

Received: 01 February 2018
Revised: 06 November 2018
Accepted: 29 November 2018
Published: 05 January 2019
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s00778-018-0532-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parametric schema inference for massive JSON datasets

Abstract

Access this article

Similar content being viewed by others

Big SQL systems: an experimental evaluation

$$\varvec{\textsc {Orpheus}}$$ DB: bolt-on versioning for relational databases (extended version)

A Comparison of Systems to Large-Scale Data Access

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parametric schema inference for massive JSON datasets

Abstract

Access this article

Similar content being viewed by others

Big SQL systems: an experimental evaluation

$$\varvec{\textsc {Orpheus}}$$ DB: bolt-on versioning for relational databases (extended version)

A Comparison of Systems to Large-Scale Data Access

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation