Parametric schema inference for massive JSON datasets

  • Mohamed-Amine BaaziziEmail author
  • Dario Colazzo
  • Giorgio Ghelli
  • Carlo Sartiani
Regular Paper


In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.


JSON Schema inference Map-reduce Spark Big data collections 



  1. 1.
  2. 2.
    Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)Google Scholar
  3. 3.
    Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)Google Scholar
  4. 4.
    Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)Google Scholar
  5. 5.
    Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Proofs for parametric schema inference for massive JSON datasets. Working paper or preprint (2018).
  6. 6.
    Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)Google Scholar
  7. 7.
    Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)Google Scholar
  8. 8.
    Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)Google Scholar
  9. 9.
    Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)Google Scholar
  10. 10.
    Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)Google Scholar
  11. 11.
    Bray, T.: The JavaScript object notation (JSON) data interchange format (2014).
  12. 12.
    Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)Google Scholar
  13. 13.
    Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)Google Scholar
  14. 14.
    Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)Google Scholar
  15. 15.
    DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)Google Scholar
  16. 16.
    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)Google Scholar
  18. 18.
    Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)Google Scholar
  19. 19.
  20. 20.
    JSON schema definition language.
  21. 21.
    JSON schema language.
  22. 22.
    Labs, T.S.: Studio 3T, 2017.
  23. 23.
    Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)Google Scholar
  24. 24.
    Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)Google Scholar
  25. 25.
    Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)Google Scholar
  26. 26.
    McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)Google Scholar
  27. 27.
    Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)CrossRefGoogle Scholar
  28. 28.
    Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)CrossRefGoogle Scholar
  29. 29.
    Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)Google Scholar
  30. 30.
    Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)Google Scholar
  31. 31.
    Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)Google Scholar
  32. 32.
    Schmidt, P.: mongodb-schema (2017).
  33. 33.
    scrapinghub. Skinfer (2015).
  34. 34.
  35. 35.
    The JSON Query Language.
  36. 36.
    Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Mohamed-Amine Baazizi
    • 1
    Email author
  • Dario Colazzo
    • 2
  • Giorgio Ghelli
    • 3
  • Carlo Sartiani
    • 4
  1. 1.CNRS, Laboratoire d’Informatique de Paris 6Sorbonne UniversitéParisFrance
  2. 2.CNRS, LAMSADE - Université Paris DauphinePSL Research UniversityParisFrance
  3. 3.Dipartimento di InformaticaUniversità di PisaPisaItaly
  4. 4.DIMIEUniversità della BasilicataPotenzaItaly

Personalised recommendations