Abstract
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
Similar content being viewed by others
Notes
We are here ignoring empty structural types, which are record types where one mandatory field has type \(\emptyset \), since they are never inferred, and we could even forbid them in the syntax.
The inferred types for each dataset are reported in [19].
We may be more formal, as follows: Consider n keys and a space where every point is a set of shapes, that is, a set of subsets of . In this setting, every \({\mathcal {L}}\)-reduced type exactly indicates one point of a space whose size is \(2^{2^n}\); hence, each \({\mathcal {L}}\)-reduced type brings exactly the same amount of information: \(2^n\) bits. On the other side, a \({\mathcal {K}}\)-reduced type is, in general, compatible with many different points in this space; hence, it brings a lower number of bits, which depends on the number of optional keys, and may be computed for each specific \({\mathcal {K}}\)-reduced type. We may compare this number with \(2^n\) in order to mathematically quantify the information gain. We do not pursue this avenue because this model embeds the unrealistic idea that every distribution of shapes has the same probability, and because we do not believe that this model, although mathematically coherent, is a useful model of the information needs of the data analyst.
In the case of VK, we multiplied the original datasets 4, 6, ..., 20 times to reach a minimum size of 100 GB as the largest size.
References
Apache Spark. http://spark.apache.org
Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)
Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)
Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)
Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Proofs for parametric schema inference for massive JSON datasets. Working paper or preprint (2018). https://hal.archives-ouvertes.fr/hal-01960464/
Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)
Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)
Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)
Bray, T.: The JavaScript object notation (JSON) data interchange format (2014). https://tools.ietf.org/html/rfc7159
Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)
Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)
Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)
JSON schema definition language. http://jsoniq.org/docs/JSound/html-single/
JSON schema language. http://json-schema.org
Labs, T.S.: Studio 3T, 2017. https://studio3t.com
Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)
Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)
Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)
McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)
Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)
Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)
Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)
Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)
Schmidt, P.: mongodb-schema (2017). https://github.com/mongodb-js/mongodb-schema
scrapinghub. Skinfer (2015). https://github.com/scrapinghub/skinfer
Spark dataframe. https://spark.apache.org/docs/latest/sql-programming-guide.html
The JSON Query Language. http://www.jsoniq.org
Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Baazizi, MA., Colazzo, D., Ghelli, G. et al. Parametric schema inference for massive JSON datasets. The VLDB Journal 28, 497–521 (2019). https://doi.org/10.1007/s00778-018-0532-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0532-7