A cost model for random access queries in document stores

Hewasinghage, Moditha; Abelló, Alberto; Varga, Jovan; Zimányi, Esteban

doi:10.1007/s00778-021-00660-x

A cost model for random access queries in document stores

Regular Paper
Published: 24 March 2021

Volume 30, pages 559–578, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Moditha Hewasinghage ORCID: orcid.org/0000-0001-6288-2043¹,
Alberto Abelló¹,
Jovan Varga¹ &
…
Esteban Zimányi²

428 Accesses
8 Citations
Explore all metrics

Abstract

Document stores have become one of the key NoSQL storage solutions. They have been widely adopted in different domains due to their ability to store semi-structured data and expressive query capabilities. However, implementations differ in terms of concrete data storage and retrieval. Unfortunately, a standard framework for data and query optimization for document stores is nonexistent, and only implementation-specific design and query guidelines are used. Hence, the goal of this work is to aid automating the data design for document stores based on query costs instead of generic design rules. For this, we define a generic storage and query cost model based on disk access and memory allocation that allows estimating the impact of design decisions. Since all document stores carry out data operations in memory, we first estimate the memory usage by considering characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations: MongoDB and Couchbase. The results show that the memory usage estimates have the average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we have managed to suggest several improvements to document storage systems. Thus, this cost model also contributes to identifying discordance between document store implementations and their theoretical expectations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

On data lake architectures and metadata management

Article 26 June 2020

Multi-model query languages: taming the variety of big data

Article Open access 31 May 2023

Notes

https://docs.mongodb.com/manual/core/query-plans.
https://www.postgresql.org/docs/12/static/runtime-config-query.html.
https://www.postgresql.org/docs/12/geqo-pg-intro.html.
A demonstration of DocDesign is available in https://www.essi.upc.edu/dtim/tools/DocDesign.
https://docs.mongodb.com/manual/core/distributed-queries.
https://docs.couchbase.com/server/5.1/architecture/db-engine-architecture.html (High water mark).
https://jira.mongodb.org/browse/WT-4732.
http://source.wiredtiger.com/3.2.1/tune_cache.html (eviction_target).
https://rethinkdb.com/ .
The data generation and the experimental setup can be found in https://github.com/modithah/MongoExperiments.

References

Beal, L.D.R., Hill, D.C., Martin, R.A., Hedengren, J.D.: GEKKO Optimization Suite. Processes 6(8), 106–131 (2018)
Article Google Scholar
Bertino, E., Foscoli, P.: On modeling cost functions for object-oriented databases. IEEE Trans. Knowl. Data Eng. 9(3), 500–508 (1997)
Article Google Scholar
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Record 39(4), 12–27 (2010)
Article Google Scholar
Dan, A., Towsley, D.: An approximate analysis of the LRU and FIFO buffer replacement schemes. SIGMETRICS Perform. Eval. Rev. 18(1), 143–152 (1990)
Article Google Scholar
de la Vega, A., García-Saiz, D., Blanco, C., Zorrilla, M.E., Sánchez, P.: Mortadelo: Automatic generation of NoSQL stores from platform-independent data models. Future Gener. Comput. Syst. 105, (2020)
Fagin, R.: Asymptotic miss ratios over independent references. J. Comput. Syst. Sci. 14(2), 222–250 (1977)
Article MathSciNet Google Scholar
Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992)
Article Google Scholar
Gardarin, G., Gruser, J., Tang, Z.: A cost model for clustered object-oriented databases. In: International Conference on Very Large Data Bases, pp. 323–334 (1995)
Gou, G., Chirkova, R.: Efficiently querying large XML data repositories: A survey. IEEE Trans. Knowl. Data Eng. 19(10), 1381–1403 (2007)
Article Google Scholar
Guo, F., Solihin, Y.: An analytical model for cache replacement policy performance. SIGMETRICS Perform. Evalu. Rev. 34(1), 228–239 (2006)
Article Google Scholar
Hecht, R., Jablonski, S.: NoSQL Evaluation: A Use Case Oriented Survey. In: IEEE International Conference on Cloud and Service Computing, pp. 336–341 (2011)
Hewasinghage, M., Abelló, A., Varga, J., Zimányi, E.: DocDesign: Cost-Based Database Design for Document Stores. Presented at the (2020)
Imam, A.A., Basri, S., Ahmad, R., Watada, J., Gonzalez-Aparicio, M.T., Almomani, M.A.: Data modeling guidelines for NoSQL document-store databases. Int. J. Adv. Comput. Sci. Appl. 9(10), 544–555 (2018)
Google Scholar
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996)
Article Google Scholar
Jiang, B., Nain, P., Towsley, D.: LRU cache under stationary requests. SIGMETRICS Perform. Evalu. Rev. 45(2), 24–26 (2017)
Article Google Scholar
Kim, J., Lee, W., Lee, K.: The Cost Model for XML Documents in Relational Database Systems. In: IEEE International Conference on Computer Systems and Applications, pp. 185–187, (2001)
King III, W.F.: Analysis of Demand Paging Algorithms. IFIP Congress 1, pp. 485–490 (1971)
Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: The Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, San Francisc (2007)
Google Scholar
Manegold, S., Boncz, P., Kersten, M.: Generic database cost models for hierarchical memory systems. [INS]. CWI., (Technical Report INS-R0203), Jan. (2002)
Manegold, S., Boncz, P. A., Kersten, M. L.: Generic Database Cost Models for Hierarchical Memory Systems. In: International Conference on Very Large Data Bases, pp. 191–202, (2002)
Megiddo, N., Modha, D.S.: Outperforming LRU with an adaptive replacement cache algorithm. IEEE Comput. 37(4), 58–65 (2004)
Article Google Scholar
Michels, J., Hare, K., Kulkarni, K., Zuzarte, C., Liu, Z.H., Hammerschmidt, B., Zemke, F.: The new and improved SQL: 2016 standard. ACM SIGMOD Record 47(2), 51–60 (2018)
Article Google Scholar
Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: NoSE: Schema design for NoSQL applications. IEEE Trans. Knowl. Data Eng. 29(10), (2017)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Berlin (2020)
Book Google Scholar
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
MATH Google Scholar
Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. 5(12), 2020–2021 (2012)
Article Google Scholar
Smith, A.J.: Cache memories. ACM Comput. Surv. 14(3), 473–530 (1982)
Article Google Scholar
Smith, A.J.: Cache evaluation and the impact of workload choice. SIGARCH Comput. Archit. News 13(3), 64–73 (1985)
Article Google Scholar
Vafaei, N., Ribeiro, R.A., Camarinha-Matos, L.M.: Data normalisation techniques in decision making: Case study with TOPSIS method. Int. J. Inf. Decis. Sci. 10(1), 19–38 (2018)
Google Scholar
Vajk, T., Deák, L., Fekete, K., Mezei, G.: Automatic NoSQL schema development: A case study. In: Artificial Intelligence and Applications (2013)
Waldspurger, C., Saemundsson, T., Ahmad, I., Park, N.: Cache Modeling and Optimization using Miniature Simulations. In: USENIX Annual Technical Conference, pp. 487–498 (2017)
Yao, J.: An Efficient Storage Model of Tree-Like Structure in MongoDB. In: IEEE International Conference on Semantics, Knowledge and Grids, pp. 166–169 (2016)

Download references

Acknowledgements

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence - Doctoral College (IT4BI-DC)

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, BarcelonaTech, Barcelona, Spain
Moditha Hewasinghage, Alberto Abelló & Jovan Varga
Université Libre de Bruxelles, 1050, Bruxelles, Belgium
Esteban Zimányi

Authors

Moditha Hewasinghage
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Jovan Varga
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Zimányi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moditha Hewasinghage.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Calculating internal B-tree blocks

We calculated the probability of a leaf block of data B-tree in memory being requested $P^{\textit{req}}_d(C)$ as $1-(1-SF(C))^{R_d(C)}$, in Eq. 10. Hence, we continue the calculation of the internal nodes of the data B-tree as follows.

If a document is in the cache, the data block containing the document and the internal block containing the reference entry to that data block must be in the cache. On the contrary, if an internal block is not in the cache, none of the data blocks pointed by the reference entries in it can be in the cache. Therefore, for an internal block not to be in the cache, all of the reference entries of the block should reference blocks, not in the cache. Thus, since the probability of a single reference entry referring to a leaf block not in the cache is $1-P^{\textit{req}}_d(C)$, and there are $R_{\textit{int}}(C)$ reference entries in a single internal block, the probability of an internal block in memory being requested can be defined as follows.

$$\begin{aligned} P^{\textit{req}}_{\textit{int}}(C)=1-(1-P^{\textit{req}}_d(C))^{R_{\textit{int}}(C)} \end{aligned}$$

(28)

Moreover, we estimate the number of internal blocks pointing to the leaves, Inter(C), as follows.

$$\begin{aligned} \textit{Inter}(C)=\Bigg \lceil \frac{\lceil \frac{|C|}{R_d(C)}\rceil }{R_{\textit{int}}(C)}\Bigg \rceil \end{aligned}$$

(29)

Finally, we can state the cached internal blocks $M_{\textit{int}}(C)$ as follows.

$$\begin{aligned} M_{\textit{int}}(C)&=\textit{Inter}(C)*P^{\textit{req}}_{\textit{int}}(C) \end{aligned}$$

(30)

Cost calculation examples for MongoDB

We present the application of our cost model in MongoDB with two examples. First, a single collection is accessed through primary index with complete set of equations and calculations. Second, we present a real-world use case with only the initial calculation of the inputs because the complexity and the number of equations increase in such scenario.

1.1 Single collection with primary index

Let us take an scenario with Test 1 (13 million documents) and average document size of 40 bytes. First, we calculate the average number of documents and index entries in a block, together with the total number of data and index blocks as follows (by applying Eqs.1 and 2).

$$\begin{aligned} \begin{aligned}&|C| = 13*10^6 \quad {\textit{Bsize}}_{d}=32Kb \quad {\textit{Bsize}}_{i}=32Kb\\&{\textit{Size}}_d(C) = 40b \quad {\textit{Size}}_{i_{\_id}}(C) = 22b \quad F=0.7\\&R_d(C) = 0.7\cdot \Big \lfloor \frac{32768}{40}\Big \rfloor = 573\\&R_{{{\_id}}}(C) = 0.7\cdot \Big \lfloor \frac{32768}{22}\Big \rfloor = 1042\\&B_d(C) =\Big \lceil \frac{13*10^6}{ 573}\Big \rceil = 22676 \quad K= 10Mb \quad M=256Mb \\&B_{{{\_id}}}(C)=\Big \lceil \frac{13*10^6 * 1 }{1042}\Big \rceil = 12473 \quad u = 0.80\\ \end{aligned} \end{aligned}$$

Since we have only the primary index, $\textit{Rep}_{\textit{\_id}}=1$, $P(C,\textit{\_id}) = 0.5$, and $P(C)=0.5$. Now, applying Eqs. 7–13 together with Eqs. 4 and 19, we come up with the following set of equations.

$$\begin{aligned} \begin{aligned}&{Req}_{\textit{\_id}}(C) = |Q|\cdot 0.5\\&E_{{\_id}}(C) = 13*10^6*\bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&SF_{{\_id}}(C) = \frac{E_{\textit{\_id}}(C)}{13*10^6}= \bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&SF(C) = SF_{\textit{\_id}}(C) = \bigg (1-\Big (\frac{13*10^6-1}{13*10^6}\Big )^{\textit{Req}_{\textit{\_id}}(C)}\bigg )\\&P^{{req}}_d(C) = 1-(1-SF(C))^{573} \quad \\&P^{{req}}_{\textit{\_id}}(C) = 1-(1-SF_{\textit{\_id}}(C))^{1042}\\&M^{{sat}}_d(C) =22676*P^{\textit{req}}_d(C) \quad M^{{sat}}_{\textit{\_id}}(C) =12473*P^{\textit{req}}_{\textit{\_id}}(C)\\&\quad M^{{sat}}_d(C)*32768 +M^{\textit{sat}}_{\textit{\_id}}(C)* 32768 = ((0.8*256)-10)*1024^2 \end{aligned} \end{aligned}$$

By solving the above set of equations, we obtain the values for $|Q| = 4242.85$, $M^{\textit{sat}}_d(C) = 3038.98$, and $M^{\textit{sat}}_{\textit{\_id}}(C) =2847.82$. Using these values at the memory saturation point, we can come up with the following set of equations by applying Eqs. 21–27 together with Eqs. 4 and 19.

$$\begin{aligned} \begin{aligned}&P_d^{\textit{in}}(C) = P_d^{\textit{out}}(C) \quad P_{\textit{\_id}}^{\textit{in}}(C) = P_{\textit{\_id}}^{\textit{out}}(C) \\&{\textit{Shots}}_d^{\textit{in}}(C)=3038.98 \cdot \frac{M_d(C)}{22676} \\&{\textit{Shots}}_{\textit{\_id}}^{\textit{in}}(C)= 2847.82\cdot \frac{M_{\textit{\_id}}(C)}{24962} \\&E_d^{}(C)= M_d(C)-{\textit{Shots}}_d^{\textit{in}}(C) \\&E_{\textit{\_id}}^{}(C)= M_{\textit{\_id}}(C)-{\textit{Shots}}_{\textit{\_id}}^{\textit{in}}(C) \\&W_d(C) = \frac{M_d(C)\cdot 32768}{((0.8*256)-10)*1024^2} \\&W_{\_id}(C) = \frac{M_{\_id}(C)\cdot 32768}{((0.8*256)-10)*1024^2} \\&P_d^{\textit{out}}(C)=\frac{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)}}{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)} + W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}} \\&P_{\textit{\_id}}^{\textit{out}}(C)=\frac{W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}}{W_d(C) \cdot \frac{E_d^{}(C)}{M_d(C)} + W_{\textit{\_id}}(C) \cdot \frac{E_{\textit{\_id}}^{}(C)}{M_{\textit{\_id}}(C)}} \\&P_d^{\textit{in}}(C) = \frac{3038.98 \cdot (1-0.5)}{3038.98 \cdot (1-0.5)+2847.82 \cdot (1-0.5)} = 0.52\\&P_{\textit{\_id}}^{\textit{in}}(C) = \frac{2847.82 \cdot (1-0.5)}{3038.98 \cdot (1-0.5)+2847.82 \cdot (1-0.5)} = 0.48 \\&M_d(C)* 32768 +M_{\textit{\_id}}(C)\cdot 32768 = ((.8*256)-10)*1024^2 \end{aligned} \end{aligned}$$

By solving the above equations, we obtain $M_d(C) = 2847.82 $ and $M_{\textit{\_id}}(C)= 3038.98$. By applying these values on Eqs. 3 and 4, we get a relative cost for a query through $\textit{\_id}$ as follows.

$$\begin{aligned} \begin{aligned}&P_d(C) = \frac{2847.82}{22676} = 0.12 \quad P_d(C) = \frac{3038.98}{12473} = 0.24 \\&\textit{Cost}_{\textit{Rand}} = 2 - (0.24+0.12) = 1.64 \end{aligned} \end{aligned}$$

1.2 Multiple collections

Let us take a use case of storing the data of authors and their books. Let’s assume that we chose to have a reference to the authors inside each of the books (as shown in Listing 2) out of the possible design choices. Moreover, let us also assume that each author has 5 books and each book has 3 authors on average.

In this scenario, we have two collections and three indexes (two primary on each of the collections and one secondary index on $A\_ID$ in Books). We calculate the number of documents/indexes in a block and the total number of blocks for each of these five B-trees as follows.

$$\begin{aligned} \begin{aligned}&|\textsf {Books}| = 4*10^6 \quad |\textsf {Authors}|= 2.5*10^6 \quad \textit{Bsize}_{d/i}=32Kb\\&\textit{Size}_d(\textsf {Books}) = 265b \quad \textit{Size}_d(\textsf {Authors}) = 150b \quad F=0.7\\&\textit{Size}_i(\textsf {Books}/\textsf {Authors}) = 22b \quad R_d(\textsf {Authors})= 152\quad \\&R_d(\textsf {Books})=89 \quad B_d(\textsf {Authors})=16448 \\&B_d(\textsf {Books}) =44944\\&R_{\textsf {\_id}}(\textsf {Books})= R_{\textsf {A\_{ID}}}(\textsf {Books})= R_{\textsf {\_id}}(\textsf {Authors}) =1042\\&B_{\textsf {\_id}}(\textsf {Books})=3843 \quad B_{\textsf {\_id}}(\textsf {Authors}) =2402\\&\textit{Mult}_{{\textsf {A\_ID}}}(\textsf {Books})=5 \quad B_{{\textsf {A\_ID}}}(\textsf {Books})=19213\\ \end{aligned} \end{aligned}$$

The following queries are executed with equal probability (0.25) on our documents store.

Q1:: Find the author name by _id
Q2:: Find the book name by _id
Q3:: Find all the book names with a given _id of an author
Q4:: Find all the author names with a given _id of a book

Since our cost model depends on the access probability on each of the B-tree structures, we calculate them as shown in Table 2. In Q3, we have single access to the secondary index on A_ID and five accesses to the data B-tree. Q4 involves two queries, first, one to retrieve a book through its _id and then on average, there would be 3 A_IDs which need to be retrieved as three independent requests through _id of the authors collection.

Table 2 Calculating the access probability of the B-trees

Full size table

Now, we have the final input for our cost model, together with $\textit{Rep}_{\textsf {A\_ID}} (\textsf {Books})$ by applying Eq. 6, as follows:

$$\begin{aligned} \begin{aligned}&P(\textsf {Book}) = 0.389 \quad P(\textsf {Book},\textsf {\_id}) = 0.111 \quad \\&P(\textsf {Book},\textsf {A\_ID}) = 0.056 \\&P(\textsf {Author}) = 0.222 \quad P(\textsf {Author},\textsf {\_id}) = 0.222\\&\textit{Rep}_{\textsf {A\_ID}} (\textsf {Books})= \frac{5* 2.5*10^6 }{4*10^6 } = 3.125 \end{aligned} \end{aligned}$$

Now, we can apply Eqs. 7–13 together with Eqs. 4 and 19 on the inputs to obtain the values for memory distribution at the saturation point. Then, using these results on Eqs. 21–27 together with Eqs 4 and 19, we obtain the following final memory distribution. These calculations are similar to the example in Appendix B.1, but we omit listing them out due to their extensiveness.

$$\begin{aligned} \begin{aligned}&M_d(\textsf {Books})= 2532\quad M_{\textsf {\_id}}(\textsf {Books})= 638\quad \\&M_{\textsf {A\_ID}}(\textsf {Books})= 351\quad \\&M_d(\textsf {Authors}) = 1401\quad M_{\textsf {\_id}}(\textsf {Authors}) = 935\quad \end{aligned} \end{aligned}$$

Finally, we calculate the miss rates and the relative cost of each of the queries as follows.

$$\begin{aligned} \begin{aligned} P_d(\textsf {Books})&= \frac{2532}{44944} = 0.056\\ P_{\textsf {\_id}}(\textsf {Books})&= \frac{638}{3843} = 0.166\quad \\ P_{\textsf {A\_ID}}(\textsf {Books})&= \frac{351}{19213} = 0.018\\ P_d(\textsf {Authors})&= \frac{1401}{16448} = 0.08\quad \\ P_{\textsf {\_id}}(\textsf {Authors})&= \frac{935}{2402} = 0.38 \quad \\ \textit{Cost}(Q1)&= 2-(0.38+0.08) = 1.54\\ \textit{Cost}(Q2)&= 2-(0.166+ 0.056) = 1.778\\ \textit{Cost}(Q3)&= 1-0.018 + 5*(1-(0.056)) = 5.702\\ \textit{Cost}(Q4)&= 2-(0.166+ 0.056) + 3*(2-(0.38+0.08)) \\&= 6.398 \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hewasinghage, M., Abelló, A., Varga, J. et al. A cost model for random access queries in document stores. The VLDB Journal 30, 559–578 (2021). https://doi.org/10.1007/s00778-021-00660-x

Download citation

Received: 15 August 2020
Revised: 07 January 2021
Accepted: 06 March 2021
Published: 24 March 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00778-021-00660-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cost model for random access queries in document stores

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

On data lake architectures and metadata management

Multi-model query languages: taming the variety of big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Calculating internal B-tree blocks

Cost calculation examples for MongoDB

1.1 Single collection with primary index

1.2 Multiple collections

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A cost model for random access queries in document stores

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

On data lake architectures and metadata management

Multi-model query languages: taming the variety of big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Calculating internal B-tree blocks

Cost calculation examples for MongoDB

1.1 Single collection with primary index

1.2 Multiple collections

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation