Abstract
Numerous technologies have been proposed for storing big data on the Cloud platform. However, choice of these technologies is always application specific. Determining a strong model is a perplexing task which makes it necessary for the architects and designers to review the requirements and choose a solution. This paper presents 14 data models available in the market to choose from. Above all, there are more than 45 database solutions available in the market, which can be categorized into one of the data models each of which is applicable to its own set of use cases (However, there are few products which could not be categorized into any of these 14 data models). Contributors have figured out that while storing schemaless information, the size of data stored in the database is higher than the original size. Metadata information and physical schema are the two responsible factors for such a high amount of storage requirement. Mathematical models and experimental evaluations conducted show that MongoDB requires storage space many times more than the original size of data. A storage space estimation equation for JSON-based solutions has been suggested, which can compare the storage requirement size using space required by CSV as a base. This may be used to decide an approximate amount of storage space required by the application, before buying a storage space in the Cloud environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
MessagePack is a JSON-like but comparitively smaller in size [22].
- 2.
We use the term amortize because we donot consider the size of putting other characters like comma, carriage return, space for null values, and other special characters.
- 3.
We are not including comma, other special characters, and null values since we are only after a rough estimate.
References
Whitehouse, O.: Fea consolidated reference model document (2005)
Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)
Gartner.com: Gartner report
Gibson, G.A., Vitter, J.S., Wilkes, J.: Strategic directions in storage i/o issues in large-scale computing. ACM Computing Surveys (CSUR) 28(4), 779–793 (1996)
Stonebraker, M., Hellerstein, J.: What goes around comes around. Readings in Database Systems 4 (2005)
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment 5(12), 1724–1735 (2012)
Demirkan, H., Delen, D.: Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud. Decision Support Systems 55(1), 412–421 (2013)
Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences 275, 314–347 (2014)
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. Journal of Parallel and Distributed Computing 74(7), 2561–2573 (2014)
Ndbcluster size requirement estimator. https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-programs-ndb-size-pl.html, accessed: 2016-09-30
Hardware sizing calculator. https://neo4j.com/hardware-sizing/, accessed: 2016-09-30
Padhy, R.P., Patra, M.R., Satapathy, S.C.: Rdbms to nosql: reviewing some next-generation non-relational databases. International Journal of Advanced Engineering Science and Technologies 11(1), 15–30 (2011)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers, Principles, Techniques. Addison wesley Boston (1986)
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1250. AcM (2008)
Consortium, W.W.W., et al.: Json-ld 1.0: a json-based serialization for linked data (2014)
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., et al.: The pfam protein families database. Nucleic acids research p. gkp985 (2009)
del Alba, L.: Data serialization comparison: Json, yaml, bson, messagepack. https://www.sitepoint.com/data-serialization-comparison-json-yaml-bson-messagepack/, accessed: 2016-09-26
Cook, K.B., Kazan, H., Zuberi, K., Morris, Q., Hughes, T.R.: Rbpdb: a database of rna-binding specificities. Nucleic acids research 39(suppl 1), D301–D308 (2011)
Cranford, K.: How to excel with sas. In: Proceedings of the 28 th Annual SCSUG Conference, Austin, Texas, September (2007)
Shafranovich, Y.: Common format and mime type for comma-separated values (csv) files (2005)
Sharma, T.C., Jain, M.: Weka approach for comparative study of classification algorithm. International Journal of Advanced Research in Computer and Communication Engineering 2(4), 1925–1931 (2013)
Messagepack. http://msgpack.org/index.html, accessed: 2016-09-26
Commission, N.T..L.: Tlc yellow taxi trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, accessed: 2016-09-30
DB-engines.com: Dbms rankings 2017 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte. Ltd.
About this paper
Cite this paper
Swami, D., Sahoo, B. (2018). Storage Size Estimation for Schemaless Big Data Applications: A JSON-Based Overview. In: Hu, YC., Tiwari, S., Mishra, K., Trivedi, M. (eds) Intelligent Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 19. Springer, Singapore. https://doi.org/10.1007/978-981-10-5523-2_29
Download citation
DOI: https://doi.org/10.1007/978-981-10-5523-2_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5522-5
Online ISBN: 978-981-10-5523-2
eBook Packages: EngineeringEngineering (R0)