Abstract
Document stores have become one of the key NoSQL storage solutions. They have been widely adopted in different domains due to their ability to store semi-structured data and expressive query capabilities. However, implementations differ in terms of concrete data storage and retrieval. Unfortunately, a standard framework for data and query optimization for document stores is nonexistent, and only implementation-specific design and query guidelines are used. Hence, the goal of this work is to aid automating the data design for document stores based on query costs instead of generic design rules. For this, we define a generic storage and query cost model based on disk access and memory allocation that allows estimating the impact of design decisions. Since all document stores carry out data operations in memory, we first estimate the memory usage by considering characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations: MongoDB and Couchbase. The results show that the memory usage estimates have the average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we have managed to suggest several improvements to document storage systems. Thus, this cost model also contributes to identifying discordance between document store implementations and their theoretical expectations.
Similar content being viewed by others
Notes
A demonstration of DocDesign is available in https://www.essi.upc.edu/dtim/tools/DocDesign.
http://source.wiredtiger.com/3.2.1/tune_cache.html (eviction_target).
The data generation and the experimental setup can be found in https://github.com/modithah/MongoExperiments.
References
Beal, L.D.R., Hill, D.C., Martin, R.A., Hedengren, J.D.: GEKKO Optimization Suite. Processes 6(8), 106–131 (2018)
Bertino, E., Foscoli, P.: On modeling cost functions for object-oriented databases. IEEE Trans. Knowl. Data Eng. 9(3), 500–508 (1997)
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Record 39(4), 12–27 (2010)
Dan, A., Towsley, D.: An approximate analysis of the LRU and FIFO buffer replacement schemes. SIGMETRICS Perform. Eval. Rev. 18(1), 143–152 (1990)
de la Vega, A., García-Saiz, D., Blanco, C., Zorrilla, M.E., Sánchez, P.: Mortadelo: Automatic generation of NoSQL stores from platform-independent data models. Future Gener. Comput. Syst. 105, (2020)
Fagin, R.: Asymptotic miss ratios over independent references. J. Comput. Syst. Sci. 14(2), 222–250 (1977)
Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992)
Gardarin, G., Gruser, J., Tang, Z.: A cost model for clustered object-oriented databases. In: International Conference on Very Large Data Bases, pp. 323–334 (1995)
Gou, G., Chirkova, R.: Efficiently querying large XML data repositories: A survey. IEEE Trans. Knowl. Data Eng. 19(10), 1381–1403 (2007)
Guo, F., Solihin, Y.: An analytical model for cache replacement policy performance. SIGMETRICS Perform. Evalu. Rev. 34(1), 228–239 (2006)
Hecht, R., Jablonski, S.: NoSQL Evaluation: A Use Case Oriented Survey. In: IEEE International Conference on Cloud and Service Computing, pp. 336–341 (2011)
Hewasinghage, M., Abelló, A., Varga, J., Zimányi, E.: DocDesign: Cost-Based Database Design for Document Stores. Presented at the (2020)
Imam, A.A., Basri, S., Ahmad, R., Watada, J., Gonzalez-Aparicio, M.T., Almomani, M.A.: Data modeling guidelines for NoSQL document-store databases. Int. J. Adv. Comput. Sci. Appl. 9(10), 544–555 (2018)
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996)
Jiang, B., Nain, P., Towsley, D.: LRU cache under stationary requests. SIGMETRICS Perform. Evalu. Rev. 45(2), 24–26 (2017)
Kim, J., Lee, W., Lee, K.: The Cost Model for XML Documents in Relational Database Systems. In: IEEE International Conference on Computer Systems and Applications, pp. 185–187, (2001)
King III, W.F.: Analysis of Demand Paging Algorithms. IFIP Congress 1, pp. 485–490 (1971)
Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: The Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, San Francisc (2007)
Manegold, S., Boncz, P., Kersten, M.: Generic database cost models for hierarchical memory systems. [INS]. CWI., (Technical Report INS-R0203), Jan. (2002)
Manegold, S., Boncz, P. A., Kersten, M. L.: Generic Database Cost Models for Hierarchical Memory Systems. In: International Conference on Very Large Data Bases, pp. 191–202, (2002)
Megiddo, N., Modha, D.S.: Outperforming LRU with an adaptive replacement cache algorithm. IEEE Comput. 37(4), 58–65 (2004)
Michels, J., Hare, K., Kulkarni, K., Zuzarte, C., Liu, Z.H., Hammerschmidt, B., Zemke, F.: The new and improved SQL: 2016 standard. ACM SIGMOD Record 47(2), 51–60 (2018)
Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: NoSE: Schema design for NoSQL applications. IEEE Trans. Knowl. Data Eng. 29(10), (2017)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Berlin (2020)
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. 5(12), 2020–2021 (2012)
Smith, A.J.: Cache memories. ACM Comput. Surv. 14(3), 473–530 (1982)
Smith, A.J.: Cache evaluation and the impact of workload choice. SIGARCH Comput. Archit. News 13(3), 64–73 (1985)
Vafaei, N., Ribeiro, R.A., Camarinha-Matos, L.M.: Data normalisation techniques in decision making: Case study with TOPSIS method. Int. J. Inf. Decis. Sci. 10(1), 19–38 (2018)
Vajk, T., Deák, L., Fekete, K., Mezei, G.: Automatic NoSQL schema development: A case study. In: Artificial Intelligence and Applications (2013)
Waldspurger, C., Saemundsson, T., Ahmad, I., Park, N.: Cache Modeling and Optimization using Miniature Simulations. In: USENIX Annual Technical Conference, pp. 487–498 (2017)
Yao, J.: An Efficient Storage Model of Tree-Like Structure in MongoDB. In: IEEE International Conference on Semantics, Knowledge and Grids, pp. 166–169 (2016)
Acknowledgements
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence - Doctoral College (IT4BI-DC)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Calculating internal B-tree blocks
We calculated the probability of a leaf block of data B-tree in memory being requested \(P^{\textit{req}}_d(C)\) as \(1-(1-SF(C))^{R_d(C)}\), in Eq. 10. Hence, we continue the calculation of the internal nodes of the data B-tree as follows.
If a document is in the cache, the data block containing the document and the internal block containing the reference entry to that data block must be in the cache. On the contrary, if an internal block is not in the cache, none of the data blocks pointed by the reference entries in it can be in the cache. Therefore, for an internal block not to be in the cache, all of the reference entries of the block should reference blocks, not in the cache. Thus, since the probability of a single reference entry referring to a leaf block not in the cache is \(1-P^{\textit{req}}_d(C)\), and there are \(R_{\textit{int}}(C)\) reference entries in a single internal block, the probability of an internal block in memory being requested can be defined as follows.
Moreover, we estimate the number of internal blocks pointing to the leaves, Inter(C), as follows.
Finally, we can state the cached internal blocks \(M_{\textit{int}}(C)\) as follows.
Cost calculation examples for MongoDB
We present the application of our cost model in MongoDB with two examples. First, a single collection is accessed through primary index with complete set of equations and calculations. Second, we present a real-world use case with only the initial calculation of the inputs because the complexity and the number of equations increase in such scenario.
1.1 Single collection with primary index
Let us take an scenario with Test 1 (13 million documents) and average document size of 40 bytes. First, we calculate the average number of documents and index entries in a block, together with the total number of data and index blocks as follows (by applying Eqs.1 and 2).
Since we have only the primary index, \(\textit{Rep}_{\textit{\_id}}=1\), \(P(C,\textit{\_id}) = 0.5\), and \(P(C)=0.5\). Now, applying Eqs. 7–13 together with Eqs. 4 and 19, we come up with the following set of equations.
By solving the above set of equations, we obtain the values for \(|Q| = 4242.85\), \(M^{\textit{sat}}_d(C) = 3038.98\), and \(M^{\textit{sat}}_{\textit{\_id}}(C) =2847.82\). Using these values at the memory saturation point, we can come up with the following set of equations by applying Eqs. 21–27 together with Eqs. 4 and 19.
By solving the above equations, we obtain \(M_d(C) = 2847.82 \) and \(M_{\textit{\_id}}(C)= 3038.98\). By applying these values on Eqs. 3 and 4, we get a relative cost for a query through \(\textit{\_id}\) as follows.
1.2 Multiple collections
Let us take a use case of storing the data of authors and their books. Let’s assume that we chose to have a reference to the authors inside each of the books (as shown in Listing 2) out of the possible design choices. Moreover, let us also assume that each author has 5 books and each book has 3 authors on average.
In this scenario, we have two collections and three indexes (two primary on each of the collections and one secondary index on \(A\_ID\) in Books). We calculate the number of documents/indexes in a block and the total number of blocks for each of these five B-trees as follows.
The following queries are executed with equal probability (0.25) on our documents store.
- Q1:
-
Find the author name by _id
- Q2:
-
Find the book name by _id
- Q3:
-
Find all the book names with a given _id of an author
- Q4:
-
Find all the author names with a given _id of a book
Since our cost model depends on the access probability on each of the B-tree structures, we calculate them as shown in Table 2. In Q3, we have single access to the secondary index on A_ID and five accesses to the data B-tree. Q4 involves two queries, first, one to retrieve a book through its _id and then on average, there would be 3 A_IDs which need to be retrieved as three independent requests through _id of the authors collection.
Now, we have the final input for our cost model, together with \(\textit{Rep}_{\textsf {A\_ID}} (\textsf {Books})\) by applying Eq. 6, as follows:
Now, we can apply Eqs. 7–13 together with Eqs. 4 and 19 on the inputs to obtain the values for memory distribution at the saturation point. Then, using these results on Eqs. 21–27 together with Eqs 4 and 19, we obtain the following final memory distribution. These calculations are similar to the example in Appendix B.1, but we omit listing them out due to their extensiveness.
Finally, we calculate the miss rates and the relative cost of each of the queries as follows.
Rights and permissions
About this article
Cite this article
Hewasinghage, M., Abelló, A., Varga, J. et al. A cost model for random access queries in document stores. The VLDB Journal 30, 559–578 (2021). https://doi.org/10.1007/s00778-021-00660-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00660-x