Skip to main content

Parallel query processing in a polystore

Abstract

The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. http://www.leanxcale.com.

  2. https://livy.apache.org.

  3. http://www.grid5000.fr

References

  1. Stonebraker, M., Cetintemel, U.: One size fits all: an idea whose time has come and gone. In: ICDE, pp. 2–11 (2015)

  2. Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on innovative data systems research (CIDR) (2015)

  3. Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., Zdonik, S.: The BigDAWG polystore system. SIGMOD Record 44(2), 11–16 (2015)

    Article  Google Scholar 

  4. Gadepally, V., Chen, P., Duggan, J., Elmore, A.J., Haynes, B., Kepner, J., Madden, S., Mattson, T., Stonebraker, M.: The BigDawg polystore system and architecture. In: IEEE high performance extreme computing conference (HPEC), pp. 1–6 (2016)

  5. Minpeng, Z., Tore, R.: Querying combined cloud-based and relational databases. In: International conference on cloud and service computing (CSC), pp. 330–335 (2011)

  6. Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)

  7. Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: ACM SIGMOD, pp. 829–840 (2012)

  8. Kolev, B., Bondiombouy, C., Valduriez, P., Jimenez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore sytem. In: ACM SIGMOD, pp. 2113–2116 (2016)

  9. Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. In: Distributed and parallel databases, vol. 34, pp. 463–503. Springer, Berlin (2015)

  10. Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Multistore big data integration with CloudMdsQL. In: Transactions on large-scale data and knowledge-centered systems (TLDKS), pp. 48–74. Springer, Berlin (2016)

  11. Abouzeid, A., Badja-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2, 922–933 (2009)

    Google Scholar 

  12. DeWitt, D., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in Polybase. In: ACM SIGMOD, pp. 1255–1266 (2013)

  13. Hacigümüs, H., Sankaranarayanan, J., Tatemura, J., LeFevre, J., Polyzotis, N.: Odyssey: a multi-store system for evolutionary analytics. PVLDB 6, 1180–1181 (2013)

    Google Scholar 

  14. LeFevre, J., Sankaranarayanan, J., Hacıgümüs, H., Tatemura, J., Polyzotis, N., Carey, M.: MISO: souping up big data query processing with a multistore system. In: ACM SIGMOD, pp. 1591–1602 (2014)

  15. Yuanyuan, T., Zou, T., Özcan, F., Gonscalves, R., Pirahesh, H.,: Joins for hybrid warehouses: exploiting massive parallelism in hadoop and enterprise data warehouses. In: EDBT/ICDT Conf., pp. 373–384 (2015)

  16. Kolev, B., Pau, R., Levchenko, O., Valduriez, P., Jimenez-Peris, R., Pereira, J.: Benchmarking polystores: the CloudMdsQL experience. In: IEEE international conference on Big Data, pp. 2574–2579 (2016)

  17. Haas, L., Kossmann, D., Wimmers, E., Yang, J.: Optmizing queries across diverse data sources. In: International conference on very large databases (VLDB), pp. 276–285 (1997)

  18. Kolev, B., Levchenko, O., Paciti, E., Valduriez, P., Vilaca, R., Goncalves, R., Jimenez-Peris, R., Kranas, P.: Parallel polyglot query processing on heterogeneous cloud data stores with LeanXcale. In IEEE international conference on Big Data, pp. 1756–1765 (2018)

  19. Özsu, T., Valduriez, P.: Principles of Distributed Database Systems. Springer, Berlin (2020)

    Book  Google Scholar 

  20. Tomasic, A., Raschid, L., Valduriez, P.: “Scaling access to heterogeneous data sources with DISCO.” IEEE Trans. Knowl. Data Eng. 10, 808–823 (1998)

    Article  Google Scholar 

  21. Bondiombouy, C., Valduriez, P.: Query processing in multistore systems: an overview. Int. J. Cloud Comput. 5(4), 309–346 (2016)

    Article  Google Scholar 

  22. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. PVLDB 2, 1626–1629 (2009)

    Google Scholar 

  23. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1, 1265–1276 (2008)

    Google Scholar 

  24. Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. PVLDB 21, 611–636 (2012)

    Google Scholar 

  25. Dasgupta, S., Coakley, K., Gupta, A.: Analytics-driven data ingestion and derivation in the AWESOME polystore. In: IEEE international conference on big data, pp. 2555–2564 (2016)

  26. Khan, Y., Zimmermann, A., Jha, A., Rebholz-Schuhmann, D., Sahay, R.: Querying web polystores. In: IEEE international conference on Big Data (2017)

  27. Alotaibi, R., Bursztyn, D., Deutsch, A., Manolescu, I.: Towards scalable hybrid stores: constraint-based rewriting to the rescue. In: ACM SIGMOD, pp. 1660–1677 (2019)

  28. Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., Meng, X., Kaftan, T., Franklin, M., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in Spark. In: ACM SIGMOD, pp. 1383–1394 (2015)

  29. Presto—Distributed Query Engine for Big Data, https://prestodb.io/

  30. Apache Drill—Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage, https://drill.apache.org/

  31. Wang, J., Baker, T., Balazinska, M., Halperin, D., Haynes, B., Howe, B., Hutchison, D., Jain, S., Maas, R., Mehta, P., Moritz, D., Myers, B., Ortiz, J., Suciu, D., Whitaker, A., Xu, S.: The Myria big data management and analytics system and cloud service. In: Conference on innovative data systems research (CIDR) (2017)

  32. Apache Impala, http://impala.apache.org/

  33. Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S.: Musketeer: all for one, one for all in data processing systems. In: Proceedings of the tenth european conference on computer systems (EuroSys '15). Article 2, pp. 1–16. ACM (2015)

  34. Agrawal, D., Chawla, S., Contreras-Rojas, B., Elmagarmid, A., Idris, Y., Kaoudi, Z., Kruse, S., Lucas, J., Mansour, E., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Thirumuruganathan, S., Troudi, A.: RHEEM: enabling cross-platform data processing: may the big data be with you! Proc. VLDB Endow. 11(11), 1414–1427 (2018)

    Article  Google Scholar 

  35. Kruse, S., Kaoudi, Z., Contreras-Rojas, B., Chawla, S., Naumann, F., Quiané-Ruiz, J.-A.: RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems. VLDB J. (2020). https://doi.org/10.1007/s00778-020-00612-x

    Article  Google Scholar 

  36. Awada, K., Eltabakh, M., Tang, C., Al-Kateb, M., Nair, S., Au, G.: Cost estimation across heterogeneous SQL-based big data infrastructures in teradata IntelliSphere. In: EDBT, pp. 534–545 (2020)

  37. Jiménez-Peris, R., Patiño-Martinez, M.: System and method for highly scalable decentralized and low contention transactional processing, Filed at USPTO: 2011. European Patent #EP2780832, US Patent #US9760597 (2011)

  38. Begoli, E., Camacho-Rodriguez, J., Hyde, J., Mior, M., Lemire, D.: Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In: ACM SIGMOD, pp. 221–230 (2018)

  39. Darema, F.: The SPMD model: past, present and future. In: Recent advances in parallel virtual machine and message passing interface, vol. 2131. Springer, Berlin (2001)

  40. TPC-H. http://www.tpc.org/tpch/

Download references

Acknowledgements

This research has been partially funded by the European Union's Horizon 2020 Programme, project BigDataStack (Grant 779747), project INFINITECH (Grant 856632), project PolicyCLOUD (Grant 870675), by the Madrid Regional Council, FSE and FEDER, project EDGEDATA (P2018/TCS-4499), CLOUDDB project TIN2016-80350-P (MINECO/FEDER, UE), and industrial doctorate grant for Pavlos Kranas (IND2017/TIC-7829). Prof. Jose Pereira, Ricardo Vilaça, and Rui Gonçalves contributed to this work when they were with LeanXcale.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boyan Kolev.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kranas, P., Kolev, B., Levchenko, O. et al. Parallel query processing in a polystore. Distrib Parallel Databases 39, 939–977 (2021). https://doi.org/10.1007/s10619-021-07322-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-021-07322-5

Keywords

  • Database integration
  • Heterogeneous databases
  • Distributed and parallel databases
  • Polystores
  • Query languages
  • Query processing