NoSQL database systems: a survey and decision guidance

Abstract

Today, data is generated and consumed at unprecedented scale. This has lead to novel approaches for scalable data management subsumed under the term “NoSQL” database systems to handle the ever-increasing data volume and request loads. However, the heterogeneity and diversity of the numerous existing systems impede the well-informed selection of a data store appropriate for a given application context. Therefore, this article gives a top-down overview of the field: instead of contrasting the implementation specifics of individual representatives, we propose a comparative classification model that relates functional and non-functional requirements to techniques and algorithms employed in NoSQL databases. This NoSQL Toolbox allows us to derive a simple decision tree to help practitioners and researchers filter potential system candidates based on central application requirements.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    The JavaScript Object Notation is a standard format consisting of nested attribute-value pairs and lists.

  2. 2.

    In some systems (e.g. Bigtable and HBase), multi-versioning is implemented by adding a timestamp as third-level key.

  3. 3.

    ACID [23]: Atomicity, Consistency, Isolation, Duration.

  4. 4.

    BASE [42]: Basically Available, Soft-state, Eventually consistent.

  5. 5.

    The FLP theorem states, that in a distributed system with asynchronous message delivery, no algorithm can guarantee to reach a consensus between participating nodes if one or more of them can fail by stopping.

  6. 6.

    A read/write register is a data structure with only two operations: setting a specific value (set) and returning the latest value that was set (get).

  7. 7.

    Therefore, consensus as used for coordination in many NoSQL systems either natively [4] or through coordination services like Chubby and Zookeeper [28] is even harder to achieve with high availability than strong consistency [19].

  8. 8.

    Low-end hardware is used, because it is substantially more cost-efficient than high-end hardware [27, Sect. 3.1].

  9. 9.

    Currently only RethinkDB can perform general \(\theta \)-joins. MongoDB’s aggregation framework has support for left-outer equi-joins in its aggregation framework and CouchDB allows joins for pre-declared map-reduce views.

  10. 10.

    An alternative to MapReduce] are generalized data processing pipelines, where the database tries to optimize the flow of data and locality of computation based on a more declarative query language (e.g. MongoDB’s aggregation framework).

References

  1. 1.

    Abadi D (2012) Consistency tradeoffs in modern distributed database system design: cap is only part of the story. Computer 45(2):37–42

    Article  Google Scholar 

  2. 2.

    Attiya H, Bar-Noy A et al (1995) Sharing memory robustly in message-passing systems. JACM 42(1)

  3. 3.

    Bailis P, Kingsbury K (2014) The network is reliable. Commun ACM 57(9):48–55

    Article  Google Scholar 

  4. 4.

    Baker J, Bond C, Corbett JC et al (2011) Megastore: providing scalable, highly available storage for interactive services. In: CIDR, pp 223–234

  5. 5.

    Bernstein PA, Cseri I, Dani N et al (2011) Adapting microsoft sql server for cloud computing. In: 27th ICDE, pp 1255–1263 IEEE

  6. 6.

    Boykin O, Ritchie S, O’Connell I, Lin J (2014) Summingbird: a framework for integrating batch and online mapreduce computations. VLDB 7(13)

  7. 7.

    Brewer EA (2000) Towards robust distributed systems

  8. 8.

    Calder B, Wang J, Ogus A et al (2011) Windows azure storage: a highly available cloud storage service with strong consistency. In: 23th SOSP. ACM

  9. 9.

    Chang F, Dean J, Ghemawat S et al (2006) Bigtable: a distributed storage system for structured data. In: 7th OSDI, USENIX Association, pp 15–15

  10. 10.

    Charron-Bost B, Pedone F, Schiper A (2010) Replication: theory and practice, lecture notes in computer science, vol. 5959. Springer

  11. 11.

    Cooper BF, Ramakrishnan R, Srivastava U et al (2008) Pnuts: Yahoo!’s hosted data serving platform. Proc VLDB Endow 1(2):1277–1288

    Article  Google Scholar 

  12. 12.

    Corbett JC, Dean J, Epstein M, et al (2012) Spanner: Google’s globally-distributed database. In: Proceedings of OSDI, USENIX Association, pp 251–264

  13. 13.

    Curino C, Jones E, Popa RA et al. (2011) Relational cloud: a database service for the cloud. In: 5th CIDR

  14. 14.

    Das S, Agrawal D, El Abbadi A et al (2010) G-store: a scalable data store for transactional multi key access in the cloud. In: 1st SoCC, ACM, pp 163–174

  15. 15.

    Davidson SB, Garcia-Molina H, Skeen D et al (1985) Consistency in a partitioned network: a survey. SUR 17(3):341–370

    Article  Google Scholar 

  16. 16.

    Dean J (2009) Designs, lessons and advice from building large distributed systems. Keynote talk at LADIS 2009

  17. 17.

    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. COMMUN ACM 51(1)

  18. 18.

    DeC andia G, Hastorun D et al (2007) Dynamo: amazon’s highly available key-value store. In: 21th SOSP, ACM, pp 205–220

  19. 19.

    Fischer MJ, Lynch NA, Paterson MS (1985) Impossibility of distributed consensus with one faulty process. J ACM 32(2):374–382

    MathSciNet  Article  MATH  Google Scholar 

  20. 20.

    Gessert F, Schaarschmidt M, Wingerath W, Friedrich S, Ritter N (2015) The cache sketch: Revisiting expiration-based caching in the age of cloud data management. In: BTW, pp 53–72

  21. 21.

    Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33(2):51–59

    Article  Google Scholar 

  22. 22.

    Gray J, Helland P (1996) The dangers of replication and a solution. SIGMOD Rec 25(2):173–182

    Article  Google Scholar 

  23. 23.

    Haerder T, Reuter A (1983) Principles of transaction-oriented database recovery. ACM Comput Surv 15(4):287–317

    MathSciNet  Article  Google Scholar 

  24. 24.

    Hamilton J (2007) On designing and deploying internet-scale services. In: 21st LISA. USENIX Association

  25. 25.

    Hellerstein JM, Stonebraker M, Hamilton J (2007) Architecture of a database system. Now Publishers Inc

  26. 26.

    Herlihy MP, Wing JM (1990) Linearizability: a correctness condition for concurrent objects. TOPLAS 12

  27. 27.

    Hoelzle U, Barroso LA (2009) The Datacenter As a Computer: an introduction to the design of warehouse-scale machines. Morgan and Claypool Publishers

  28. 28.

    Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: wait-free coordination for internet-scale systems. In: USENIXATC. USENIX Association

  29. 29.

    Kallman R, Kimura H, Natkins J et al (2008) H-store: a high-performance, distributed main memory transaction processing system. VLDB Endowment

  30. 30.

    Karger D, Lehman E, Leighton T et al (1997) Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: 29th STOC, ACM

  31. 31.

    Kleppmann M (2016) Designing data-intensive applications. O Reilly, to appear

  32. 32.

    Kraska T, Pang G, Franklin MJ et al (2013) Mdcc: Multi-data center consistency. In: 8th EuroSys, ACM

  33. 33.

    Kreps J (2014) Questioning the lambda architecture. Accessed: 17 Dec 2015

  34. 34.

    Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. SIGOPS Oper Syst Rev 44(2):35–40

    Article  Google Scholar 

  35. 35.

    Laney D (2001) 3d data management: Controlling data volume, velocity, and variety. Tech. rep, META Group

  36. 36.

    Lloyd W, Freedman MJ, Kaminsky, M et al (2011) Don’t settle for eventual: scalable causal consistency for wide-area storage with cops. In: 23th SOSP. ACM

  37. 37.

    Mahajan P, Alvisi L, Dahlin M et al (2011) Consistency, availability, and convergence. University of Texas at Austin Tech Report 11

  38. 38.

    Mao Y, Junqueira FP, Marzullo K (2008) Mencius: building efficient replicated state machines for wans. OSDI 8:369–384

    Google Scholar 

  39. 39.

    Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems. Manning Publications Co

  40. 40.

    Min C, Kim K, Cho H et al (2012) Sfs: random write considered harmful in solid state drives. In: FAST

  41. 41.

    Özsu MT, Valduriez P (2011) Principles of distributed database systems. Springer Science & Business Media

  42. 42.

    Pritchett D (2008) Base: an acid alternative. Queue 6(3):48–55

    Article  Google Scholar 

  43. 43.

    Qiao L, Surlaker K, Das S et al (2013) On brewing fresh espresso: Linkedin’s distributed data serving platform. In: SIGMOD, ACM, pp 1135–1146

  44. 44.

    Sadalage PJ, Fowler M (2013) NoSQL distilled : a brief guide to the emerging world of polyglot persistence. Addison-Wesley, Upper Saddle River

    Google Scholar 

  45. 45.

    Shapiro M, Preguica N, Baquero C et al (2011) A comprehensive study of convergent and commutative replicated data types. Ph.D. thesis, INRIA

  46. 46.

    Shukla D, Thota S, Raman K et al (2015) Schema-agnostic indexing with azure documentdb. PVLDB 8(12)

  47. 47.

    Sovran Y, Power R, Aguilera MK, Li J (2011) Transactional storage for geo-replicated systems. In: 23th SOSP, ACM, pp 385–400

  48. 48.

    Stonebraker M, Madden S, Abadi DJ et al (2007) The end of an architectural era: (it’s time for a complete rewrite). In: 33rd VLDB, pp 1150–1160

  49. 49.

    Wiese L et al (2015) Advanced Data Management: For SQL. Cloud and Distributed Databases. Walter de Gruyter GmbH & Co KG, NoSQL

  50. 50.

    Zhang H, Chen G et al (2015) In-memory big data management and processing: a survey. TKDE

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Felix Gessert.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gessert, F., Wingerath, W., Friedrich, S. et al. NoSQL database systems: a survey and decision guidance. Comput Sci Res Dev 32, 353–365 (2017). https://doi.org/10.1007/s00450-016-0334-3

Download citation

Keywords

  • NoSQL
  • Data management
  • Scalability
  • Data models
  • Sharding
  • Replication