Skip to main content

Storage Size Estimation for Schemaless Big Data Applications: A JSON-Based Overview

  • Conference paper
  • First Online:
Intelligent Communication and Computational Technologies

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 19))

Abstract

Numerous technologies have been proposed for storing big data on the Cloud platform. However, choice of these technologies is always application specific. Determining a strong model is a perplexing task which makes it necessary for the architects and designers to review the requirements and choose a solution. This paper presents 14 data models available in the market to choose from. Above all, there are more than 45 database solutions available in the market, which can be categorized into one of the data models each of which is applicable to its own set of use cases (However, there are few products which could not be categorized into any of these 14 data models). Contributors have figured out that while storing schemaless information, the size of data stored in the database is higher than the original size. Metadata information and physical schema are the two responsible factors for such a high amount of storage requirement. Mathematical models and experimental evaluations conducted show that MongoDB requires storage space many times more than the original size of data. A storage space estimation equation for JSON-based solutions has been suggested, which can compare the storage requirement size using space required by CSV as a base. This may be used to decide an approximate amount of storage space required by the application, before buying a storage space in the Cloud environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    MessagePack is a JSON-like but comparitively smaller in size [22].

  2. 2.

    We use the term amortize because we donot consider the size of putting other characters like comma, carriage return, space for null values, and other special characters.

  3. 3.

    We are not including comma, other special characters, and null values since we are only after a rough estimate.

References

  1. Whitehouse, O.: Fea consolidated reference model document (2005)

    Google Scholar 

  2. Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)

    Article  MATH  Google Scholar 

  3. Gartner.com: Gartner report

    Google Scholar 

  4. Gibson, G.A., Vitter, J.S., Wilkes, J.: Strategic directions in storage i/o issues in large-scale computing. ACM Computing Surveys (CSUR) 28(4), 779–793 (1996)

    Article  Google Scholar 

  5. Stonebraker, M., Hellerstein, J.: What goes around comes around. Readings in Database Systems 4 (2005)

    Google Scholar 

  6. Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment 5(12), 1724–1735 (2012)

    Article  Google Scholar 

  7. Demirkan, H., Delen, D.: Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud. Decision Support Systems 55(1), 412–421 (2013)

    Article  Google Scholar 

  8. Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences 275, 314–347 (2014)

    Article  Google Scholar 

  9. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. Journal of Parallel and Distributed Computing 74(7), 2561–2573 (2014)

    Article  Google Scholar 

  10. Ndbcluster size requirement estimator. https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-programs-ndb-size-pl.html, accessed: 2016-09-30

  11. Hardware sizing calculator. https://neo4j.com/hardware-sizing/, accessed: 2016-09-30

  12. Padhy, R.P., Patra, M.R., Satapathy, S.C.: Rdbms to nosql: reviewing some next-generation non-relational databases. International Journal of Advanced Engineering Science and Technologies 11(1), 15–30 (2011)

    Google Scholar 

  13. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers, Principles, Techniques. Addison wesley Boston (1986)

    Google Scholar 

  14. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1250. AcM (2008)

    Google Scholar 

  15. Consortium, W.W.W., et al.: Json-ld 1.0: a json-based serialization for linked data (2014)

    Google Scholar 

  16. Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., et al.: The pfam protein families database. Nucleic acids research p. gkp985 (2009)

    Google Scholar 

  17. del Alba, L.: Data serialization comparison: Json, yaml, bson, messagepack. https://www.sitepoint.com/data-serialization-comparison-json-yaml-bson-messagepack/, accessed: 2016-09-26

  18. Cook, K.B., Kazan, H., Zuberi, K., Morris, Q., Hughes, T.R.: Rbpdb: a database of rna-binding specificities. Nucleic acids research 39(suppl 1), D301–D308 (2011)

    Article  Google Scholar 

  19. Cranford, K.: How to excel with sas. In: Proceedings of the 28 th Annual SCSUG Conference, Austin, Texas, September (2007)

    Google Scholar 

  20. Shafranovich, Y.: Common format and mime type for comma-separated values (csv) files (2005)

    Google Scholar 

  21. Sharma, T.C., Jain, M.: Weka approach for comparative study of classification algorithm. International Journal of Advanced Research in Computer and Communication Engineering 2(4), 1925–1931 (2013)

    Google Scholar 

  22. Messagepack. http://msgpack.org/index.html, accessed: 2016-09-26

  23. Commission, N.T..L.: Tlc yellow taxi trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, accessed: 2016-09-30

  24. DB-engines.com: Dbms rankings 2017 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Devang Swami .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte. Ltd.

About this paper

Cite this paper

Swami, D., Sahoo, B. (2018). Storage Size Estimation for Schemaless Big Data Applications: A JSON-Based Overview. In: Hu, YC., Tiwari, S., Mishra, K., Trivedi, M. (eds) Intelligent Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 19. Springer, Singapore. https://doi.org/10.1007/978-981-10-5523-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-5523-2_29

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-5522-5

  • Online ISBN: 978-981-10-5523-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics