Big SQL systems: an experimental evaluation

Aluko, Victor; Sakr, Sherif

doi:10.1007/s10586-019-02914-4

Big SQL systems: an experimental evaluation

Published: 11 February 2019

Volume 22, pages 1347–1377, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

781 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it had clear performance limitations on providing the best performance of processing massive amounts of structured data. In addition, practically, many users of the big data systems face some challenges on dealing with the APIs and the low level programming abstractions of the Big Data System and they would prefer to use SQL (in which they are more proficient) as a high-level declarative language to express their tasks while leaving all of the execution optimization details to the backend engine. Thus, several systems have been designed and implemented to tackle these challenges by designing and implementing scalable query execution engines for processing massive structured data while supporting SQL interfaces. In this article, we present an extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB. In particular, we report and analyze the performance characteristics of these systems using three different benchmarks, namely, TPC-H, TPC-DS and TPCx-BB. Finally, we report a set of insights and important lessons that we have learned from conducting our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Article 17 January 2020

A Study of SQL-on-Hadoop Systems

A BigBench Implementation in the Hadoop Ecosystem

Notes

References

Abadi, D., Babu, S., Özcan, F., Pandis, I.: Sql-on-hadoop systems: tutorial. Proc. VLDB Endow. 8(12), 2050–2051 (2015)
Article Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Article Google Scholar
Ammar, K., Özsu, M.T.: Experimental analysis of distributed graph systems. PVLDB 11(10), 1151–1164 (2018)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. SIGMOD, Chicago (2015)
Book Google Scholar
Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From bigbench to tpcx-bb: standardization of a big data benchmark. In: Technology Conference on Performance Evaluation and Benchmarking, pp. 24–44. Springer (2016)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Commun. Data Eng. 36(4), 28–38 (2015)
Google Scholar
Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of sql-on-hadoop systems. In: Proceedings of the Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pp. 154–166. Springer (2014)
Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A Distributed Data Warehouse System on Large Clusters. ICDE, Oslo (2013)
Google Scholar
Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)
Floratou, A., Özcan, F., Schiefer, B.: Benchmarking sql-on-hadoop systems: Tpc or not tpc? In: Proceedings of the Workshop on Big Data Benchmarks, pp. 63–72, Springer (2014)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1197–1208. ACM (2013)
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.V.: Bigbench V2: the new and improved bigbench. In: Proceedings of the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 1225–1236 (2017)
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major Technical Advancements in Apache Hive. SIGMOD, Chicago (2014)
Book Google Scholar
Ivanov, T., Beer, M.-G.: Performance evaluation of spark sql using bigbench. In: Big Data Benchmarking, pp. 96–116. Springer (2015)
Ivanov, T., Singhal, R.: Abench: Big data architecture stack benchmark. In: Proceedings of the Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, 09–13 April 2018, pp. 13–16 (2018)
Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1507–1518, (2018)
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern. Open-Source SQL Engine for Hadoop. In: Proceedings of the CIDR (2015)
Laney, D.: 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)
Google Scholar
Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H., Li, J., Wang, J.: Performance evaluation and optimization of multi-dimensional indexes in hive. IEEE Trans. Serv. Comput. 11(5), 835–849 (2018)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Google Scholar
Mesmoudi, A., Hacid, M.-S., Toumani, F.: Benchmarking sql on mapreduce systems using large astronomy databases. Distrib. Parall. Databases 34(3), 347–378 (2016)
Article Google Scholar
Nambiar, R.O., Poess, M.: The making of tpc-ds. In: Proceedings of the 32nd international conference on Very large data bases, pp. 1049–1058. VLDB Endowment (2006)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)
Qin, X., Chen, Y., Chen, J., Li, S., Liu, J., Zhang, H.: The performance of sql-on-hadoop systems-an experimental study. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 464–471. IEEE (2017)
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A unifying framework for modeling and building data processing applications. In: SIGMOD (2015)
Sakr, S.: Big Data 2.0 Processing Systems: A Survey. Springer, New York (2016)
Book Google Scholar
Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST (2010)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: SIGMOD (2010)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: SOCC (2013)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)

Download references

Acknowledgements

This work is funded by the European Regional Development Funds via the Mobilitas Plus Programme (Grant MOBTT75).

Author information

Authors and Affiliations

University of Taru, Taru, Estonia
Victor Aluko & Sherif Sakr

Authors

Victor Aluko
View author publications
You can also search for this author in PubMed Google Scholar
Sherif Sakr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sherif Sakr.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Configuration parameter	Modified value	Explanation
Configuration parameters of the hadoop framework
mapreduce.framework.name	YARN	This forces YARN and MapReduce2 to be used instead of MapReduce 1
YARN.app.mapreduce.am.resource.mb	4096	The amount of memory in MB allocated to the Application Master
YARN.app.mapreduce.am.command-opts	-Xmx3072m	It is set to 80% of the memory allocated to the AM to prevent it from crashing
mapreduce.map.memory.mb	4096	Memory allocated for each map task in MB It was increased due to GC limit errors
mapreduce.reduce.memory.mb	8192	Memory allocated for each reduce task in MB. It was increased due to GC limit errors
mapreduce.map.java.opts	-Xmx3072m	It is set to 80% of the memory allocated to each map task
mapreduce.reduce.java.opts	-Xmx6144m	It is set to 80% of the memory allocation to each reduce task
mapreduce.task.timeout	1,800,000	This was increased to 300 Secs to ensure no task times out while YARN is waiting
YARN.nodemanager.resource.memory-mb	62,000	This is set to the maximum memory available since only benchmarks should run on the servers
YARN.nodemanager.resource.cpu-vcores	15	This is set to available cores minus 1, so that the 1 core can be used for other system features
YARN.nodemanager.vmem-check-enabled	false	This prevents containers from being killed when virtual memory is exceeded
YARN.nodemanager.pmem-check-enabled	true	This is set to true in order to shutdown containers which exceed the available physical memory
YARN.nodemanager.vmem-pmem-ratio	3.0	This allows more virtual memory allocation
YARN.scheduler.maximum-allocation-mb	62,000	This is set to the maximum available memory on each node
YARN.scheduler.minimum-allocation-mb	1024	This is set to 1 GB as the minimum starting memory
YARN.scheduler.maximum-allocation-vcores	15	This is set to the maximum cores that can be assigned to any task
Configuration Parameters of the Hive Engine
Metastore	Mysql	Mysql was chosen for its easy integration with Hive
hive.exec.parallel	True	Enables Hive to run jobs in parallel
hive.exec.parallel.thread.number	8	Sets the maximum number of parallel jobs that can be executed to 8
hive.mapred.mode	Nonstrict	Enables Hive to run queries which do not have a where or limit clause
hive.strict.checks.cartesian.product	False	Allow Cartesian product queries which may run for too long
Configuration parameters of the spark SQL engine
spark.master	YARN	Enables YARN for task management
spark.driver.memory	3 GB	To allow more memory for the executors
spark.executor.memory	5 GB	To allow more executors to be started
spark.sql.broadcastTimeout	1200	To allow long running join operations which timed out in 300 Secs
spark.driver.maxResultSize	5 GB	To enable spark collecting large intermediate results between jobs
spark.network.timeout	800 s	Increases the timeout Spark should wait for responses across network
Configuration Parameters of the PrestoDB Engine
discovery-server.enabled	True	Enabled to the coordinator node
node-scheduler.include-coordinator	True	Enabled to the coordinator node
experimental.spill-enabled	True	Allowing intermediate results to be written to the disk in order to avoid out-of-memory errors
experimental.spiller-spill-path	File system path	A path on the file system where spill data are written

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aluko, V., Sakr, S. Big SQL systems: an experimental evaluation. Cluster Comput 22, 1347–1377 (2019). https://doi.org/10.1007/s10586-019-02914-4

Download citation

Received: 26 August 2018
Revised: 22 December 2018
Accepted: 31 January 2019
Published: 11 February 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10586-019-02914-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big SQL systems: an experimental evaluation

Abstract

Access this article

Similar content being viewed by others

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

A Study of SQL-on-Hadoop Systems

A BigBench Implementation in the Hadoop Ecosystem

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Big SQL systems: an experimental evaluation

Abstract

Access this article

Similar content being viewed by others

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

A Study of SQL-on-Hadoop Systems

A BigBench Implementation in the Hadoop Ecosystem

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation