Parsing gigabytes of JSON per second

Langdale, Geoff; Lemire, Daniel

doi:10.1007/s00778-019-00578-5

Parsing gigabytes of JSON per second

Regular Paper
Published: 11 October 2019

Volume 28, pages 941–960, (2019)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

1405 Accesses
31 Citations
311 Altmetric
1 Mention
Explore all metrics

Abstract

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of single instruction and multiple data instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

JavaScript Object Notation

A Methodology for Fine-Grained Parallelism in JavaScript Applications

Mirkwood: An Online Parallel Crawler

Notes

https://github.com/microsoft/FishStore.
We simplify this sequence for clarity. Our results are affected by the previous iteration over the preceding 64 byte input if any. Suppose a single backslash ended the previous 64 byte input; this alters the results of the previous algorithm. We similarly elide the full details of the adjustments for previous loop state in our presentation of subsequent algorithms.
We use the convention that 0b100010000 is the binary value with the fifth and ninth least significant bits set to 1.
Scripts, code, and raw results are available online: https://github.com/lemire/simdjson and https://github.com/lemire/simdjson_experiments_vldb2019.
https://github.com/miloyip/nativejson-benchmark.
https://github.com/chadaustin/sajson/tree/master/testdata.
https://github.com/dropbox/json11.
https://github.com/mikeando/fastjson.
https://github.com/vivkin/gason.
https://github.com/esnme/ujson4c.
https://github.com/zserge/jsmn.
https://github.com/DaveGamble/cJSON.
https://github.com/open-source-parsers/jsoncpp.
https://nlohmann.github.io/json/.
The simdjson library works on 64-bit ARM processors.

References

Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB in action: adaptive query processing on raw data. Proc. VLDB Endow. 5(12), 1942–1945 (2012)
Article Google Scholar
Boncz, P.A., Graefe, G., He, B., Sattler, K.U.: Database architectures for modern hardware. Technical report 18251, Dagstuhl Seminar (2019)
Bonetta, D., Brantner, M.: FAD.Js: fast JSON data access using JIT-based speculative optimizations. Proc. VLDB Endow. 10(12), 1778–1789 (2017)
Article Google Scholar
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. https://tools.ietf.org/html/rfc8259, internet Engineering Task Force, Request for Comments: 8259 (2017)
Cameron, R.D., Herdy, K.S., Lin, D.: High performance XML parsing using parallel bit stream technology. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, ACM, New York, NY, USA, CASCON ’08, pp. 17:222–17:235 (2008)
Chandramouli, B., Prasaad, G., Kossmann, D., Levandoski, J., Hunter, J., Barnett, M.: FASTER: a concurrent key-value store with in-place updates. In: Proceedings of the 2018 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’18, pp. 275–290 (2018)
Cohen, J., Roth, M.S.: Analyses of deterministic parsing algorithms. Commun. ACM 21(6), 448–458 (1978)
Article MathSciNet Google Scholar
Cole, C.R.: 100-Gb/s and beyond transceiver technologies. Opt. Fiber Technol. 17(5), 472–479 (2011)
Article Google Scholar
Downs, T.: avx-turbo: test the non-AVX, AVX2 and AVX-512 speeds across various active core counts. https://github.com/travisdowns/avx-turbo (2019)
Farfán, F., Hristidis, V., Rangaswami, R.: Beyond lazy XML parsing. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications, DEXA’07, pp. 75–86. Springer, Berlin (2007)
Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical report, Copenhagen University College of Engineering, Copenhagen, Denmark. http://www.agner.org/optimize/instruction_tables.pdf (2018)
Ge, C., Li, Y., Eilebrecht, E., Chandramouli, B., Kossmann, D.: Speculative distributed CSV data parsing for big data analytics. In: ACM SIGMOD International Conference on Management of Data, ACM (2019)
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991)
Article MathSciNet Google Scholar
Green, T.J., Gupta, A., Miklau, G., Onizuka, M., Suciu, D.: Processing XML streams with deterministic automata and stream indexes. ACM Trans. Database Syst. 29(4), 752–788 (2004)
Article Google Scholar
Kostoulas, M.G., Matsa, M., Mendelsohn, N., Perkins, E., Heifets, A., Mercaldi, M.: XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In: Proceedings of the 15th International Conference on World Wide Web, ACM, New York, NY, USA, WWW ’06, pp. 93–102 (2006)
Lemire, D., Kaser, O.: Faster 64-bit universal hashing using carry-less multiplications. J. Cryptogr. Eng. 6(3), 171–185 (2016)
Article Google Scholar
Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. Proc. VLDB Endow. 10(10), 1118–1129 (2017). https://doi.org/10.14778/3115404.3115416
Article Google Scholar
Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’14, pp. 1247–1258 (2014)
Marian, A., Siméon, J.: Projecting XML documents. In: Proceedings of the 29th International Conference on Very Large Data Bases—vol. 29, VLDB Endowment, VLDB ’03, pp. 213–224 (2003)
Chapter Google Scholar
Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., Neumann, T.: Instant loading for main memory databases. Proc. VLDB Endow. 6(14), 1702–1713 (2013)
Article Google Scholar
Muła, W., Lemire, D.: Faster Base64 encoding and decoding using AVX2 instructions. ACM Trans. Web 12(3), 20:1–20:26 (2018)
Article Google Scholar
Mytkowicz, T., Musuvathi, M., Schulte, W.: Data-parallel finite-state machines. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, NY, USA, ASPLOS ’14, pp. 529–542 (2014)
Naishlos, D.: Autovectorization in GCC. In: Proceedings of the 2004 GCC Developers Summit, pp. 105–118 (2004)
Noga, M.L., Schott, S., Löwe, W.: Lazy XML processing. In: Proceedings of the 2002 ACM Symposium on Document Engineering, ACM, New York, NY, USA, DocEng’02, pp. 88–94 (2002)
Palkar, S., Abuzaid, F., Bailis, P., Zaharia, M.: Filter before you parse: faster analytics on raw data with Sparser. Proc. VLDB Endow. 11(11), 1576–1589 (2018)
Article Google Scholar
Pavlopoulou, C., Carman, Jr E.P., Westmann, T., Carey, M.J., Tsotras, V.J.: A parallel and scalable processor for JSON data. In: EDBT’18 (2018)
Tahara, D., Diamond, T., Abadi, D.J.: Sinew: a SQL system for multi-structured data. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD’14, pp. 815–826 (2014)
Takase, T., Miyashita, H., Suzumura, T., Tatsubori, M.: An adaptive, fast, and safe XML parser based on byte sequences memorization. In: Proceedings of the 14th International Conference on World Wide Web, ACM, New York, NY, USA, WWW ’05, pp. 692–701 (2005)
Xie, D., Chandramouli, B., Li, Y., Kossmann, D.: FishStore: faster ingestion with subset hashing. In: Proceedings of the 2019 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD’19, pp. 1711–1728 (2019)
Xu, Q., Siyamwala, H., Ghosh, M., Suri, T., Awasthi, M., Guz, Z., Shayesteh, A., Balakrishnan, V.: Performance analysis of NVMe SSDs and their implication on real world databases. In: Proceedings of the 8th ACM International Systems and Storage Conference, ACM, New York, NY, USA, SYSTOR ’15, pp. 6:1–6:11
Zhang, Y., Pan, Y., Chiu, K.: Speculative p-DFAs for parallel XML parsing. In: 2009 International Conference on High Performance Computing (HiPC), IEEE, pp. 388–397 (2009)

Download references

Acknowledgements

The vectorized UTF-8 validation was motivated by a blog post by O. Goffart. K. Willets helped design the current vectorized UTF-8 validation. In particular, he provided the algorithm and code to check that sequences of two, three and four non-ASCII bytes match the leading byte. The authors are grateful to W. Muła for sharing related number parsing code online. The software library has benefited from the contributions of T. Navennec, K. Wolf, T. Kennedy, F. Wessels, G. Fotopoulos, H. N. Gies, E. Gedda, G. Floros, D. Xie, N. Xiao, E. Bogatov, J. Wang, L. F. Peres, W. Bolsterlee, A. Karandikar, R. Urban, T. Dyson, I. Dotsenko, A. Milovidov, C. Liu, S. Gleason, J. Keiser, Z. Bjornson, V. Baranov, I. A. Daza Dillon and others.

The work is supported in part by the Natural Sciences and Engineering Research Council of Canada under grant RGPIN-2017-03910.

Author information

Authors and Affiliations

branchfree.org, Sydney, NSW, Australia
Geoff Langdale
Université du Québec (TELUQ), Montreal, QC, Canada
Daniel Lemire

Authors

Geoff Langdale
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lemire
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Lemire.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Effect of minification

Table 11 Millions of cycles required to parse and validate selected documents (Skylake), before and after minification

Full size table

To ease readability, JSON documents may contain a variable number of white space characters between atoms, within objects and arrays. Intuitively, these superfluous characters should reduce parsing speed. To verify this intuition, we minified the documents prior to parsing. We find that all three parsers (simdjson, RapidJSON, sajson) use fewer CPU cycles to parse minified documents, see Table 11. However, the benefits in processing speed are often less than the benefits in storage. For example, the minified apache_builds file is 74% of the original, yet the processing time is only reduced to between 82% and 90% of the original—depending on the parser. The sajson parser often benefits more from minification. Thus while simdjson is more than 2.1 times faster than sajson on the original twitter document, it is only 1.9 times faster after minification.

Effect of specific optimizations

In a complex task like JSON parsing, no single optimization is likely to make a large difference. Nonetheless, it may be useful to quantify the effect of some optimizations.

We can compute the string masks (indicating the span of the strings) using a single carry-less multiplication between a word containing the location of the quote characters (as 1-bits) and a word containing only ones (see Sect. 3.1.1). Alternatively, we can achieve the same result with a series of shifts and XOR:
Instead of extracting the set bits using our optimized algorithm (see Fig. 6 in Sect. 3.1.4), we can use a naive approach:
Instead of using vectorized classification (Sect. 3.1.2), we can use a more naive approach where we detect the locations of structural characters by doing one vectorized comparison per structural character and doing a bitwise OR. Similarly, we can detect spaces by doing one comparison per allowable white space character and doing a bitwise OR.

We present the number of cycles per input byte in Table 12 with the three optimizations disabled one by one. The standard error of our measure is about 0.02 cycles per byte so small differences (\(<0.05\) cycles) may not be statistically significant. The fast index extraction reduces the cost of the stage 1 by over 10% in several instances (e.g., twitter, update-center). The carry-less multiplication appear has gains of over 5% in some instance; the gains reach nearly 20% in one instance (mesh). Vectorized classification is similarly helpful.

Table 12 Performance in cycles per bytes of the simdjson parser during stage 1 over several files

Full size table

Large files

All three fast parsers (simdjson, RapidJSON, and sajson) mostly read and write sequentially in memory. Furthermore, the memory bandwidth of our systems are far higher than our parsing speeds. Thus we can expect them to perform well even when all of the data does not fit in cache. To verify that we can still process JSON data at high speed when the input data exceeds the cache, we created a large file (refsnp-unsupported35000) made of the first 35,000 entries from a file describing human single nucleotide variations (refsnp-unsupported.json). Table 13 shows that we far exceed 1 GB per second with simdjson in this instance although the file does not fit in CPU cache.

Table 13 Throughput in GB/s when parsing a large (84 MB) file: refsnp-unsupported35000

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Langdale, G., Lemire, D. Parsing gigabytes of JSON per second. The VLDB Journal 28, 941–960 (2019). https://doi.org/10.1007/s00778-019-00578-5

Download citation

Received: 20 February 2019
Revised: 12 August 2019
Accepted: 27 September 2019
Published: 11 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s00778-019-00578-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parsing gigabytes of JSON per second

Abstract

Access this article