Parametric schema inference for massive JSON datasets

Abstract

In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    We are here ignoring empty structural types, which are record types where one mandatory field has type \(\emptyset \), since they are never inferred, and we could even forbid them in the syntax.

  2. 2.

    https://developer.nytimes.com.

  3. 3.

    https://dumps.wikimedia.org/wikidatawiki/entities/.

  4. 4.

    https://www.kaggle.com/borisch/russian-election-2018-twitter.

  5. 5.

    https://www.kaggle.com/borisch/russian-election-2018-vkcom-user-activity/feed.

  6. 6.

    https://vk.com/dev/streaming_api_docs_2.

  7. 7.

    https://core.ac.uk.

  8. 8.

    https://core.ac.uk/services#dump-structure.

  9. 9.

    The inferred types for each dataset are reported in [19].

  10. 10.

    We may be more formal, as follows: Consider n keys and a space where every point is a set of shapes, that is, a set of subsets of . In this setting, every \({\mathcal {L}}\)-reduced type exactly indicates one point of a space whose size is \(2^{2^n}\); hence, each \({\mathcal {L}}\)-reduced type brings exactly the same amount of information: \(2^n\) bits. On the other side, a \({\mathcal {K}}\)-reduced type is, in general, compatible with many different points in this space; hence, it brings a lower number of bits, which depends on the number of optional keys, and may be computed for each specific \({\mathcal {K}}\)-reduced type. We may compare this number with \(2^n\) in order to mathematically quantify the information gain. We do not pursue this avenue because this model embeds the unrealistic idea that every distribution of shapes has the same probability, and because we do not believe that this model, although mathematically coherent, is a useful model of the information needs of the data analyst.

  11. 11.

    In the case of VK, we multiplied the original datasets 4, 6, ..., 20 times to reach a minimum size of 100 GB as the largest size.

References

  1. 1.

    Apache Spark. http://spark.apache.org

  2. 2.

    Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)

  3. 3.

    Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)

  4. 4.

    Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)

  5. 5.

    Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Proofs for parametric schema inference for massive JSON datasets. Working paper or preprint (2018). https://hal.archives-ouvertes.fr/hal-01960464/

  6. 6.

    Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)

  7. 7.

    Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)

  8. 8.

    Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)

    Google Scholar 

  9. 9.

    Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)

    Google Scholar 

  10. 10.

    Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)

  11. 11.

    Bray, T.: The JavaScript object notation (JSON) data interchange format (2014). https://tools.ietf.org/html/rfc7159

  12. 12.

    Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)

    Google Scholar 

  13. 13.

    Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)

  14. 14.

    Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)

  15. 15.

    DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)

  16. 16.

    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)

  18. 18.

    Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)

  19. 19.

    http://webia.lip6.fr/~baazizi/rs/js/vj18

  20. 20.

    JSON schema definition language. http://jsoniq.org/docs/JSound/html-single/

  21. 21.

    JSON schema language. http://json-schema.org

  22. 22.

    Labs, T.S.: Studio 3T, 2017. https://studio3t.com

  23. 23.

    Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)

    Google Scholar 

  24. 24.

    Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)

  25. 25.

    Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)

  26. 26.

    McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)

  27. 27.

    Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)

    Article  Google Scholar 

  28. 28.

    Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)

    Article  Google Scholar 

  29. 29.

    Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)

  30. 30.

    Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)

  31. 31.

    Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)

  32. 32.

    Schmidt, P.: mongodb-schema (2017). https://github.com/mongodb-js/mongodb-schema

  33. 33.

    scrapinghub. Skinfer (2015). https://github.com/scrapinghub/skinfer

  34. 34.

    Spark dataframe. https://spark.apache.org/docs/latest/sql-programming-guide.html

  35. 35.

    The JSON Query Language. http://www.jsoniq.org

  36. 36.

    Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mohamed-Amine Baazizi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Baazizi, MA., Colazzo, D., Ghelli, G. et al. Parametric schema inference for massive JSON datasets. The VLDB Journal 28, 497–521 (2019). https://doi.org/10.1007/s00778-018-0532-7

Download citation

Keywords

  • JSON
  • Schema inference
  • Map-reduce
  • Spark
  • Big data collections