Skip to main content
Log in

A cost model for random access queries in document stores

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Document stores have become one of the key NoSQL storage solutions. They have been widely adopted in different domains due to their ability to store semi-structured data and expressive query capabilities. However, implementations differ in terms of concrete data storage and retrieval. Unfortunately, a standard framework for data and query optimization for document stores is nonexistent, and only implementation-specific design and query guidelines are used. Hence, the goal of this work is to aid automating the data design for document stores based on query costs instead of generic design rules. For this, we define a generic storage and query cost model based on disk access and memory allocation that allows estimating the impact of design decisions. Since all document stores carry out data operations in memory, we first estimate the memory usage by considering characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations: MongoDB and Couchbase. The results show that the memory usage estimates have the average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we have managed to suggest several improvements to document storage systems. Thus, this cost model also contributes to identifying discordance between document store implementations and their theoretical expectations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://docs.mongodb.com/manual/core/query-plans.

  2. https://www.postgresql.org/docs/12/static/runtime-config-query.html.

  3. https://www.postgresql.org/docs/12/geqo-pg-intro.html.

  4. A demonstration of DocDesign is available in https://www.essi.upc.edu/dtim/tools/DocDesign.

  5. https://docs.mongodb.com/manual/core/distributed-queries.

  6. https://docs.couchbase.com/server/5.1/architecture/db-engine-architecture.html (High water mark).

  7. https://jira.mongodb.org/browse/WT-4732.

  8. http://source.wiredtiger.com/3.2.1/tune_cache.html (eviction_target).

  9. https://rethinkdb.com/ .

  10. The data generation and the experimental setup can be found in https://github.com/modithah/MongoExperiments.

References

  1. Beal, L.D.R., Hill, D.C., Martin, R.A., Hedengren, J.D.: GEKKO Optimization Suite. Processes 6(8), 106–131 (2018)

    Article  Google Scholar 

  2. Bertino, E., Foscoli, P.: On modeling cost functions for object-oriented databases. IEEE Trans. Knowl. Data Eng. 9(3), 500–508 (1997)

    Article  Google Scholar 

  3. Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Record 39(4), 12–27 (2010)

    Article  Google Scholar 

  4. Dan, A., Towsley, D.: An approximate analysis of the LRU and FIFO buffer replacement schemes. SIGMETRICS Perform. Eval. Rev. 18(1), 143–152 (1990)

    Article  Google Scholar 

  5. de la Vega, A., García-Saiz, D., Blanco, C., Zorrilla, M.E., Sánchez, P.: Mortadelo: Automatic generation of NoSQL stores from platform-independent data models. Future Gener. Comput. Syst. 105, (2020)

  6. Fagin, R.: Asymptotic miss ratios over independent references. J. Comput. Syst. Sci. 14(2), 222–250 (1977)

    Article  MathSciNet  Google Scholar 

  7. Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992)

    Article  Google Scholar 

  8. Gardarin, G., Gruser, J., Tang, Z.: A cost model for clustered object-oriented databases. In: International Conference on Very Large Data Bases, pp. 323–334 (1995)

  9. Gou, G., Chirkova, R.: Efficiently querying large XML data repositories: A survey. IEEE Trans. Knowl. Data Eng. 19(10), 1381–1403 (2007)

    Article  Google Scholar 

  10. Guo, F., Solihin, Y.: An analytical model for cache replacement policy performance. SIGMETRICS Perform. Evalu. Rev. 34(1), 228–239 (2006)

    Article  Google Scholar 

  11. Hecht, R., Jablonski, S.: NoSQL Evaluation: A Use Case Oriented Survey. In: IEEE International Conference on Cloud and Service Computing, pp. 336–341 (2011)

  12. Hewasinghage, M., Abelló, A., Varga, J., Zimányi, E.: DocDesign: Cost-Based Database Design for Document Stores. Presented at the (2020)

  13. Imam, A.A., Basri, S., Ahmad, R., Watada, J., Gonzalez-Aparicio, M.T., Almomani, M.A.: Data modeling guidelines for NoSQL document-store databases. Int. J. Adv. Comput. Sci. Appl. 9(10), 544–555 (2018)

    Google Scholar 

  14. Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996)

    Article  Google Scholar 

  15. Jiang, B., Nain, P., Towsley, D.: LRU cache under stationary requests. SIGMETRICS Perform. Evalu. Rev. 45(2), 24–26 (2017)

    Article  Google Scholar 

  16. Kim, J., Lee, W., Lee, K.: The Cost Model for XML Documents in Relational Database Systems. In: IEEE International Conference on Computer Systems and Applications, pp. 185–187, (2001)

  17. King III, W.F.: Analysis of Demand Paging Algorithms. IFIP Congress 1, pp. 485–490 (1971)

  18. Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: The Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, San Francisc (2007)

    Google Scholar 

  19. Manegold, S., Boncz, P., Kersten, M.: Generic database cost models for hierarchical memory systems. [INS]. CWI., (Technical Report INS-R0203), Jan. (2002)

  20. Manegold, S., Boncz, P. A., Kersten, M. L.: Generic Database Cost Models for Hierarchical Memory Systems. In: International Conference on Very Large Data Bases, pp. 191–202, (2002)

  21. Megiddo, N., Modha, D.S.: Outperforming LRU with an adaptive replacement cache algorithm. IEEE Comput. 37(4), 58–65 (2004)

    Article  Google Scholar 

  22. Michels, J., Hare, K., Kulkarni, K., Zuzarte, C., Liu, Z.H., Hammerschmidt, B., Zemke, F.: The new and improved SQL: 2016 standard. ACM SIGMOD Record 47(2), 51–60 (2018)

    Article  Google Scholar 

  23. Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: NoSE: Schema design for NoSQL applications. IEEE Trans. Knowl. Data Eng. 29(10), (2017)

  24. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Berlin (2020)

    Book  Google Scholar 

  25. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)

    MATH  Google Scholar 

  26. Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. 5(12), 2020–2021 (2012)

    Article  Google Scholar 

  27. Smith, A.J.: Cache memories. ACM Comput. Surv. 14(3), 473–530 (1982)

    Article  Google Scholar 

  28. Smith, A.J.: Cache evaluation and the impact of workload choice. SIGARCH Comput. Archit. News 13(3), 64–73 (1985)

    Article  Google Scholar 

  29. Vafaei, N., Ribeiro, R.A., Camarinha-Matos, L.M.: Data normalisation techniques in decision making: Case study with TOPSIS method. Int. J. Inf. Decis. Sci. 10(1), 19–38 (2018)

    Google Scholar 

  30. Vajk, T., Deák, L., Fekete, K., Mezei, G.: Automatic NoSQL schema development: A case study. In: Artificial Intelligence and Applications (2013)

  31. Waldspurger, C., Saemundsson, T., Ahmad, I., Park, N.: Cache Modeling and Optimization using Miniature Simulations. In: USENIX Annual Technical Conference, pp. 487–498 (2017)

  32. Yao, J.: An Efficient Storage Model of Tree-Like Structure in MongoDB. In: IEEE International Conference on Semantics, Knowledge and Grids, pp. 166–169 (2016)

Download references

Acknowledgements

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence - Doctoral College (IT4BI-DC)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moditha Hewasinghage.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Calculating internal B-tree blocks

We calculated the probability of a leaf block of data B-tree in memory being requested \(P^{\textit{req}}_d(C)\) as \(1-(1-SF(C))^{R_d(C)}\), in Eq. 10. Hence, we continue the calculation of the internal nodes of the data B-tree as follows.

If a document is in the cache, the data block containing the document and the internal block containing the reference entry to that data block must be in the cache. On the contrary, if an internal block is not in the cache, none of the data blocks pointed by the reference entries in it can be in the cache. Therefore, for an internal block not to be in the cache, all of the reference entries of the block should reference blocks, not in the cache. Thus, since the probability of a single reference entry referring to a leaf block not in the cache is \(1-P^{\textit{req}}_d(C)\), and there are \(R_{\textit{int}}(C)\) reference entries in a single internal block, the probability of an internal block in memory being requested can be defined as follows.

$$\begin{aligned} P^{\textit{req}}_{\textit{int}}(C)=1-(1-P^{\textit{req}}_d(C))^{R_{\textit{int}}(C)} \end{aligned}$$
(28)

Moreover, we estimate the number of internal blocks pointing to the leaves, Inter(C), as follows.

$$\begin{aligned} \textit{Inter}(C)=\Bigg \lceil \frac{\lceil \frac{|C|}{R_d(C)}\rceil }{R_{\textit{int}}(C)}\Bigg \rceil \end{aligned}$$
(29)

Finally, we can state the cached internal blocks \(M_{\textit{int}}(C)\) as follows.

$$\begin{aligned} M_{\textit{int}}(C)&=\textit{Inter}(C)*P^{\textit{req}}_{\textit{int}}(C) \end{aligned}$$
(30)

Cost calculation examples for MongoDB

We present the application of our cost model in MongoDB with two examples. First, a single collection is accessed through primary index with complete set of equations and calculations. Second, we present a real-world use case with only the initial calculation of the inputs because the complexity and the number of equations increase in such scenario.

1.1 Single collection with primary index

Let us take an scenario with Test 1 (13 million documents) and average document size of 40 bytes. First, we calculate the average number of documents and index entries in a block, together with the total number of data and index blocks as follows (by applying Eqs.1 and 2).

$$\begin{aligned} \begin{aligned}&|C| = 13*10^6 \quad {\textit{Bsize}}_{d}=32Kb \quad {\textit{Bsize}}_{i}=32Kb\\&{\textit{Size}}_d(C) = 40b \quad {\textit{Size}}_{i_{\_id}}(C) = 22b \quad F=0.7\\&R_d(C) = 0.7\cdot \Big \lfloor \frac{32768}{40}\Big \rfloor = 573\\&R_{{{\_id}}}(C) = 0.7\cdot \Big \lfloor \frac{32768}{22}\Big \rfloor = 1042\\&B_d(C) =\Big \lceil \frac{13*10^6}{ 573}\Big \rceil = 22676 \quad K= 10Mb \quad M=256Mb \\&B_{{{\_id}}}(C)=\Big \lceil \frac{13*10^6 * 1 }{1042}\Big \rceil = 12473 \quad u = 0.80\\ \end{aligned} \end{aligned}$$

Since we have only the primary index, \(\textit{Rep}_{\textit{\_id}}=1\), \(P(C,\textit{\_id}) = 0.5\), and \(P(C)=0.5\). Now, applying Eqs. 713 together with Eqs. 4 and 19, we come up with the following set of equations.

$$\begin{aligned} \begin{aligned}&{Req}_{\textit{\_id}}(C) = |Q|\cdot 0.5\\&E_{{\_id}}(C) = 13*10^6*\bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&SF_{{\_id}}(C) = \frac{E_{\textit{\_id}}(C)}{13*10^6}= \bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&SF(C) = SF_{\textit{\_id}}(C) = \bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&P^{{req}}_d(C) = 1-(1-SF(C))^{573} \quad \\&P^{{req}}_{\textit{\_id}}(C) = 1-(1-SF_{\textit{\_id}}(C))^{1042}\\&M^{{sat}}_d(C) =22676*P^{\textit{req}}_d(C) \quad M^{{sat}}_{\textit{\_id}}(C) =12473*P^{\textit{req}}_{\textit{\_id}}(C)\\&\quad M^{{sat}}_d(C)*32768 +M^{\textit{sat}}_{\textit{\_id}}(C)* 32768 = ((0.8*256)-10)*1024^2 \end{aligned} \end{aligned}$$

By solving the above set of equations, we obtain the values for \(|Q| = 4242.85\), \(M^{\textit{sat}}_d(C) = 3038.98\), and \(M^{\textit{sat}}_{\textit{\_id}}(C) =2847.82\). Using these values at the memory saturation point, we can come up with the following set of equations by applying Eqs.  2127 together with Eqs. 4 and 19.

$$\begin{aligned} \begin{aligned}&P_d^{\textit{in}}(C) = P_d^{\textit{out}}(C) \quad P_{\textit{\_id}}^{\textit{in}}(C) = P_{\textit{\_id}}^{\textit{out}}(C) \\&{\textit{Shots}}_d^{\textit{in}}(C)=3038.98 \cdot \frac{M_d(C)}{22676} \\&{\textit{Shots}}_{\textit{\_id}}^{\textit{in}}(C)= 2847.82\cdot \frac{M_{\textit{\_id}}(C)}{24962} \\&E_d^{}(C)= M_d(C)-{\textit{Shots}}_d^{\textit{in}}(C) \\&E_{\textit{\_id}}^{}(C)= M_{\textit{\_id}}(C)-{\textit{Shots}}_{\textit{\_id}}^{\textit{in}}(C) \\&W_d(C) = \frac{M_d(C)\cdot 32768}{((0.8*256)-10)*1024^2} \\&W_{\_id}(C) = \frac{M_{\_id}(C)\cdot 32768}{((0.8*256)-10)*1024^2} \\&P_d^{\textit{out}}(C)=\frac{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)}}{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)} + W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}} \\&P_{\textit{\_id}}^{\textit{out}}(C)=\frac{W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}}{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)} + W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}} \\&P_d^{\textit{in}}(C) = \frac{3038.98 \cdot (1-0.5)}{3038.98 \cdot (1-0.5)+2847.82 \cdot (1-0.5)} = 0.52\\&P_{\textit{\_id}}^{\textit{in}}(C) = \frac{2847.82 \cdot (1-0.5)}{3038.98 \cdot (1-0.5)+2847.82 \cdot (1-0.5)} = 0.48 \\&M_d(C)* 32768 +M_{\textit{\_id}}(C)\cdot 32768 = ((.8*256)-10)*1024^2 \end{aligned} \end{aligned}$$

By solving the above equations, we obtain \(M_d(C) = 2847.82 \) and \(M_{\textit{\_id}}(C)= 3038.98\). By applying these values on Eqs. 3 and 4, we get a relative cost for a query through \(\textit{\_id}\) as follows.

$$\begin{aligned} \begin{aligned}&P_d(C) = \frac{2847.82}{22676} = 0.12 \quad P_d(C) = \frac{3038.98}{12473} = 0.24 \\&\textit{Cost}_{\textit{Rand}} = 2 - (0.24+0.12) = 1.64 \end{aligned} \end{aligned}$$

1.2 Multiple collections

Let us take a use case of storing the data of authors and their books. Let’s assume that we chose to have a reference to the authors inside each of the books (as shown in Listing 2) out of the possible design choices. Moreover, let us also assume that each author has 5 books and each book has 3 authors on average.

figure b

In this scenario, we have two collections and three indexes (two primary on each of the collections and one secondary index on \(A\_ID\) in Books). We calculate the number of documents/indexes in a block and the total number of blocks for each of these five B-trees as follows.

$$\begin{aligned} \begin{aligned}&|\textsf {Books}| = 4*10^6 \quad |\textsf {Authors}|= 2.5*10^6 \quad \textit{Bsize}_{d/i}=32Kb\\&\textit{Size}_d(\textsf {Books}) = 265b \quad \textit{Size}_d(\textsf {Authors}) = 150b \quad F=0.7\\&\textit{Size}_i(\textsf {Books}/\textsf {Authors}) = 22b \quad R_d(\textsf {Authors})= 152\quad \\&R_d(\textsf {Books})=89 \quad B_d(\textsf {Authors})=16448 \\&B_d(\textsf {Books}) =44944\\&R_{\textsf {\_id}}(\textsf {Books})= R_{\textsf {A\_{ID}}}(\textsf {Books})= R_{\textsf {\_id}}(\textsf {Authors}) =1042\\&B_{\textsf {\_id}}(\textsf {Books})=3843 \quad B_{\textsf {\_id}}(\textsf {Authors}) =2402\\&\textit{Mult}_{{\textsf {A\_ID}}}(\textsf {Books})=5 \quad B_{{\textsf {A\_ID}}}(\textsf {Books})=19213\\ \end{aligned} \end{aligned}$$

The following queries are executed with equal probability (0.25) on our documents store.

Q1:

Find the author name by _id

Q2:

Find the book name by _id

Q3:

Find all the book names with a given _id of an author

Q4:

Find all the author names with a given _id of a book

Since our cost model depends on the access probability on each of the B-tree structures, we calculate them as shown in Table 2. In Q3, we have single access to the secondary index on A_ID and five accesses to the data B-tree. Q4 involves two queries, first, one to retrieve a book through its _id and then on average, there would be 3 A_IDs which need to be retrieved as three independent requests through _id of the authors collection.

Table 2 Calculating the access probability of the B-trees

Now, we have the final input for our cost model, together with \(\textit{Rep}_{\textsf {A\_ID}} (\textsf {Books})\) by applying Eq. 6, as follows:

$$\begin{aligned} \begin{aligned}&P(\textsf {Book}) = 0.389 \quad P(\textsf {Book},\textsf {\_id}) = 0.111 \quad \\&P(\textsf {Book},\textsf {A\_ID}) = 0.056 \\&P(\textsf {Author}) = 0.222 \quad P(\textsf {Author},\textsf {\_id}) = 0.222\\&\textit{Rep}_{\textsf {A\_ID}} (\textsf {Books})= \frac{5* 2.5*10^6 }{4*10^6 } = 3.125 \end{aligned} \end{aligned}$$

Now, we can apply Eqs. 713 together with Eqs. 4 and 19 on the inputs to obtain the values for memory distribution at the saturation point. Then, using these results on Eqs. 2127 together with Eqs 4 and 19, we obtain the following final memory distribution. These calculations are similar to the example in Appendix B.1, but we omit listing them out due to their extensiveness.

$$\begin{aligned} \begin{aligned}&M_d(\textsf {Books})= 2532\quad M_{\textsf {\_id}}(\textsf {Books})= 638\quad \\&M_{\textsf {A\_ID}}(\textsf {Books})= 351\quad \\&M_d(\textsf {Authors}) = 1401\quad M_{\textsf {\_id}}(\textsf {Authors}) = 935\quad \end{aligned} \end{aligned}$$

Finally, we calculate the miss rates and the relative cost of each of the queries as follows.

$$\begin{aligned} \begin{aligned} P_d(\textsf {Books})&= \frac{2532}{44944} = 0.056\\ P_{\textsf {\_id}}(\textsf {Books})&= \frac{638}{3843} = 0.166\quad \\ P_{\textsf {A\_ID}}(\textsf {Books})&= \frac{351}{19213} = 0.018\\ P_d(\textsf {Authors})&= \frac{1401}{16448} = 0.08\quad \\ P_{\textsf {\_id}}(\textsf {Authors})&= \frac{935}{2402} = 0.38 \quad \\ \textit{Cost}(Q1)&= 2-(0.38+0.08) = 1.54\\ \textit{Cost}(Q2)&= 2-(0.166+ 0.056) = 1.778\\ \textit{Cost}(Q3)&= 1-0.018 + 5*(1-(0.056)) = 5.702\\ \textit{Cost}(Q4)&= 2-(0.166+ 0.056) + 3*(2-(0.38+0.08)) \\&= 6.398 \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hewasinghage, M., Abelló, A., Varga, J. et al. A cost model for random access queries in document stores. The VLDB Journal 30, 559–578 (2021). https://doi.org/10.1007/s00778-021-00660-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00660-x

Keywords

Navigation