Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

Michiardi, Pietro; Carra, Damiano; Migliorini, Sara

doi:10.1007/s10796-020-09995-2

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

Published: 04 March 2020

Volume 23, pages 35–51, (2021)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

415 Accesses
8 Citations
Explore all metrics

Abstract

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What Happens When Two Multi-Query Optimization Paradigms Combine?

Dynamic Query Prioritization for In-Memory Databases

Multi-dimensional multiple query scheduling with distributed semantic caching framework

Article 04 June 2015

Notes

Our method can be easily extended for sharing similar join operators, for example by applying the “equivalence classes” approach used in (Zhou et al. 2007). Despite technical simplicity, our current optimization problem formulation would end-up discarding such potential SEs, due to their large memory footprints. Hence, we currently preempt such SEs from being considered.
For the sake of readability, we omit the description of several other optimizations – such as the removal of duplicate predicates – that we have implemented.
In light of the end-to-end MQO process, the last phase amounts to rewrite the queries in the input set to useselected CEs. Such rewrite can introduce additional work, which we currently neglect in our modeling approach: indeed, query rewriting involves highly selective operations, with low cost. This means we assume the dominating cost to be that of reading from RAM, which we found experimentally to be true.
Source code of our prototype is available as an open source contribution, available here: https://github.com/DistributedSystemsGroup/spark-sql-worksharing
Note that the operator in Apache Spark is a transformation. As a consequence, it takes effect only upon the first call toan action, with the first (rewritten) query. Thus, the first query effectively “pays the price” for caching.
The attentive reader might have noticed that also our method eventually spills some contents of the cached data to disk. This is explained by two effects: i) Apache Spark dynamically adjusts at runtime the amount of memory dedicated to store cached data, and thus overrides the 50% setting we use in our experiments; ii) our methodology is based on cardinality estimation to compute the weight of a CE: as a consequence, estimation errors might induce the system to spill some records on disk.
Data compression techniques can be helpful in this case, but we defer their analysis to future work.

References

Agrawal, S., Chaudhuri, S., & Narasayya, V. R. (2000). Automated selection of materialized views and indexes in sql databases. VLDB, 2000, 496–505.
Google Scholar
Agrawal, P., Kifer, D., & Olston, C. (2008). Scheduling shared scans of large data files. Proceedings of the VLDB Endowment, 1(1), 958–969.
Article Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al. (2015). Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM.
Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L. (2010). The datapath system: A data-centric analytic processing engine for large data warehouses. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘10, pp. 519–530. ACM, New York, NY, USA. https://doi.org/10.1145/1807167.1807224.
Azim, T., Karpathiotakis, M., & Ailamaki, A. (2017). Recache: Reactive caching for fast analytics over heterogeneous data. Proceedings of the VLDB Endowment, 11(3).
Baril, X., Bellahsene, Z. (2003). Selection of materialized views: A cost-based approach. In: Advanced Information Systems Engineering, pp. 665–680. Springer.
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R. (2011). Incoop: Mapreduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 7. ACM.
Candea, G., Polyzotis, N., & Vingralek, R. (2009). A scalable, predictable join operator for highly concurrent data warehouses. Proc. VLDB Endow, 2(1), 277–288. https://doi.org/10.14778/1687627.1687659.
Article Google Scholar
Candea, G., Polyzotis, N., & Vingralek, R. (2011). Predictable performance and high query concurrency for data analytics. The VLDB Journal, 20(2), 227–248. https://doi.org/10.1007/s00778-011-0221-2.
Article Google Scholar
Dalvi, N.N., Sanghai, S.K., Parsan, R., Sudarshan, S. (2001). Pipelining in multi-query optimization. In: ACM PODS, pp. 59–70. ACM.
Databricks: Spark sql performance test (2018). https://github.com/databricks/spark-sql-perf
Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Article Google Scholar
Derakhshan, R., Dehne, F.K., Korn, O., Stantic, B. (2006). Simulated annealing for materialized view selection in data warehousing environment. In: Databases and applications, pp. 89–94.
Dursun, K., Binnig, C., Cetintemel, U., Kraska, T. (2017). Revisiting reuse in main memory database systems. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1275–1289. ACM.
Elghandour, I., & Aboulnaga, A. (2012). Restore: Reusing results of mapreduce jobs. Proceedings of the VLDB Endowment, 5(6), 586–597.
Article Google Scholar
El-Helw, A., Raghavan, V., Soliman, M. A., Caragea, G., Gu, Z., & Petropoulos, M. (2015). Optimization of common table expressions in mpp database systems. Proceedings of the VLDB Endowment, 8(12), 1704–1715.
Article Google Scholar
Finkelstein, S. (1982). Common expression analysis in database applications. In: Proceedings of the 1982 ACM SIGMOD international conference on Management of data, pp. 235–245. ACM.
Floratou, A., Megiddo, N., Potti, N., Özcan, F., Kale, U., Schmitz-Hermes, J. (2016). Adaptive caching in big sql using the hdfs cache. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 321–333. ACM.
Giannikis, G., Alonso, G., & Kossmann, D. (2012). Shareddb: Killing one thousand queries with one stone. Proc. VLDB Endow, 5(6), 526–537. https://doi.org/10.14778/2168651.2168654.
Article Google Scholar
Goldstein, J., Larson, P.Å. (2001). Optimizing queries using materialized views: a practical, scalable solution. In: ACM SIGMOD Record, vol. 30, pp. 331–342. ACM.
Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., & Zhuang, L. (2010). Nectar: Automatic management of data and computation in datacenters. OSDI, 10, 1–8.
Google Scholar
Harizopoulos, S., Shkapenyuk, V., Ailamaki, A. (2005). Qpipe: A simultaneously pipelined relational query engine. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘05, pp. 383–394. ACM, New York, NY, USA. https://doi.org/10.1145/1066157.1066201.
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72. ACM.
Ivanova, M. G., Kersten, M. L., Nes, N. J., & Gonçalves, R. A. (2010). An architecture for recycling intermediates in a column-store. ACM Transactions on Database Systems (TODS), 35(4), 24.
Article Google Scholar
Kalnis, P., Mamoulis, N., & Papadias, D. (2002). View selection using randomized search. Data & Knowledge Engineering, 42(1), 89–111.
Article Google Scholar
Kellerer, H., Pferschy, U., & Pisinger, D. (2004). Introduction to NP-completeness of knapsack problems. Berlin: Springer.
Book Google Scholar
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P. (2011). A platform for scalable one-pass analytics using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 985–996. ACM.
Li, B., Mazur, E., Diao, Y., McGregor, A., & Shenoy, P. (2012). Scalla: a platform for scalable one-pass analytics using mapreduce. ACM Transactions on Database Systems (TODS), 37(4), 27.
Google Scholar
Merkle, R.C. (1980). Protocols for public key cryptosystems. Security and Privacy, IEEE Symposium on p. 122.
Michiardi, P., Carra, D., Migliorini, S. (2019). In-memory caching for multi-query optimization of data-intensive scalable computing workloads. In: Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS.
Mistry, H., Roy, P., Sudarshan, S., Ramamritham, K. (2001). Materialized view selection and maintenance using multi-query optimization. In: ACM SIGMOD Record, vol. 30, pp. 307–318. ACM.
Nagel, F., Boncz, P., Viglas, S.D. (2013). Recycling in pipelined query evaluation. In: Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pp. 338–349. IEEE.
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). Mrshare2: Sharing across multiple queries in mapreduce. Proc. VLDB Endow, 3(1–2), 494–505. https://doi.org/10.14778/1920841.1920906.
Article Google Scholar
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G. (2015). Making sense of performance in data analytics frameworks. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pp. 293–307. USENIX Association.
Psaroudakis, I., Athanassoulis, M., & Ailamaki, A. (2013). Sharing data and work across concurrent analytical queries. Proc. VLDB Endow, 6(9), 637–648. https://doi.org/10.14778/2536360.2536364.
Article Google Scholar
Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S. (2000). Efficient and extensible algorithms for multi query optimization. In: ACM SIGMOD Record, vol. 29, pp. 249–260. ACM.
Sellis, T. K. (1988). Multiple-query optimization. ACM Trans. Database Syst., 13(1), 23–52. https://doi.org/10.1145/42201.42203.
Article Google Scholar
Shim, J., Scheuermann, P., Vingralek, R. (1999). Dynamic caching of query results for decision support systems. In: IEEE SSDBM, SSDBM ‘99, pp. 254–. IEEE.
Silva, Y.N., Larson, P.A., Zhou, J. (2012). Exploiting common subexpressions for cloud query processing. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 1337–1348. IEEE.
Sinha, P., & Zoltners, A. A. (1979). The multiple-choice knapsack problem. Operations Research, 27(3), 503–515.
Article Google Scholar
Wang, G., & Chan, C. Y. (2013). Multi-query optimization in mapreduce framework. Proc. VLDB Endow, 7(3), 145–156. https://doi.org/10.14778/2732232.2732234.
Article Google Scholar
Yang, J., Karlapalem, K., & Li, Q. (1997). Algorithms for materialized view design in data warehousing environment. VLDB, 97, 25–29.
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association.
Zhang, C., Yang, J. (1999). Genetic algorithm for materialized view selection in data warehouse environments. In: DataWarehousing and Knowledge Discovery, pp. 116–125. Springer.
Zhou, J., Larson, P.A., Freytag, J.C., Lehner, W. (2007). Efficient exploitation of similar subexpressions for query processing. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 533–544. ACM.
Zhu, C., Zhu, Q., Zuzarte, C., & Ma, W. (2016). Optimization of generic progressive queries based on dependency analysis and materialized views. Information Systems Frontiers, 18(1), 205–231.
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the Italian National Group for Scientific Computation (GNCS-INDAM) and by “Progetto di Eccellenza” of the Computer Science Dept., Univ. of Verona, Italy.

Author information

Authors and Affiliations

Data Science Department, Eurecom, Biot, France
Pietro Michiardi
Computer Science Department, University of Verona, Verona, Italy
Damiano Carra & Sara Migliorini

Authors

Pietro Michiardi
View author publications
You can also search for this author in PubMed Google Scholar
Damiano Carra
View author publications
You can also search for this author in PubMed Google Scholar
Sara Migliorini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damiano Carra.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Michiardi, P., Carra, D. & Migliorini, S. Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks. Inf Syst Front 23, 35–51 (2021). https://doi.org/10.1007/s10796-020-09995-2

Download citation

Published: 04 March 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10796-020-09995-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

What Happens When Two Multi-Query Optimization Paradigms Combine?

Dynamic Query Prioritization for In-Memory Databases

Multi-dimensional multiple query scheduling with distributed semantic caching framework

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

What Happens When Two Multi-Query Optimization Paradigms Combine?

Dynamic Query Prioritization for In-Memory Databases

Multi-dimensional multiple query scheduling with distributed semantic caching framework

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation