Abstract
Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it had clear performance limitations on providing the best performance of processing massive amounts of structured data. In addition, practically, many users of the big data systems face some challenges on dealing with the APIs and the low level programming abstractions of the Big Data System and they would prefer to use SQL (in which they are more proficient) as a high-level declarative language to express their tasks while leaving all of the execution optimization details to the backend engine. Thus, several systems have been designed and implemented to tackle these challenges by designing and implementing scalable query execution engines for processing massive structured data while supporting SQL interfaces. In this article, we present an extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB. In particular, we report and analyze the performance characteristics of these systems using three different benchmarks, namely, TPC-H, TPC-DS and TPCx-BB. Finally, we report a set of insights and important lessons that we have learned from conducting our experiments.
Similar content being viewed by others
Notes
TPC-H
References
Abadi, D., Babu, S., Özcan, F., Pandis, I.: Sql-on-hadoop systems: tutorial. Proc. VLDB Endow. 8(12), 2050–2051 (2015)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Ammar, K., Özsu, M.T.: Experimental analysis of distributed graph systems. PVLDB 11(10), 1151–1164 (2018)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. SIGMOD, Chicago (2015)
Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From bigbench to tpcx-bb: standardization of a big data benchmark. In: Technology Conference on Performance Evaluation and Benchmarking, pp. 24–44. Springer (2016)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Commun. Data Eng. 36(4), 28–38 (2015)
Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of sql-on-hadoop systems. In: Proceedings of the Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pp. 154–166. Springer (2014)
Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A Distributed Data Warehouse System on Large Clusters. ICDE, Oslo (2013)
Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)
Floratou, A., Özcan, F., Schiefer, B.: Benchmarking sql-on-hadoop systems: Tpc or not tpc? In: Proceedings of the Workshop on Big Data Benchmarks, pp. 63–72, Springer (2014)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1197–1208. ACM (2013)
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.V.: Bigbench V2: the new and improved bigbench. In: Proceedings of the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 1225–1236 (2017)
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major Technical Advancements in Apache Hive. SIGMOD, Chicago (2014)
Ivanov, T., Beer, M.-G.: Performance evaluation of spark sql using bigbench. In: Big Data Benchmarking, pp. 96–116. Springer (2015)
Ivanov, T., Singhal, R.: Abench: Big data architecture stack benchmark. In: Proceedings of the Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, 09–13 April 2018, pp. 13–16 (2018)
Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1507–1518, (2018)
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern. Open-Source SQL Engine for Hadoop. In: Proceedings of the CIDR (2015)
Laney, D.: 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)
Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H., Li, J., Wang, J.: Performance evaluation and optimization of multi-dimensional indexes in hive. IEEE Trans. Serv. Comput. 11(5), 835–849 (2018)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Mesmoudi, A., Hacid, M.-S., Toumani, F.: Benchmarking sql on mapreduce systems using large astronomy databases. Distrib. Parall. Databases 34(3), 347–378 (2016)
Nambiar, R.O., Poess, M.: The making of tpc-ds. In: Proceedings of the 32nd international conference on Very large data bases, pp. 1049–1058. VLDB Endowment (2006)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)
Qin, X., Chen, Y., Chen, J., Li, S., Liu, J., Zhang, H.: The performance of sql-on-hadoop systems-an experimental study. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 464–471. IEEE (2017)
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A unifying framework for modeling and building data processing applications. In: SIGMOD (2015)
Sakr, S.: Big Data 2.0 Processing Systems: A Survey. Springer, New York (2016)
Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST (2010)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: SIGMOD (2010)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: SOCC (2013)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
Acknowledgements
This work is funded by the European Regional Development Funds via the Mobilitas Plus Programme (Grant MOBTT75).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Configuration parameter | Modified value | Explanation |
---|---|---|
Configuration parameters of the hadoop framework | ||
mapreduce.framework.name | YARN | This forces YARN and MapReduce2 to be used instead of MapReduce 1 |
YARN.app.mapreduce.am.resource.mb | 4096 | The amount of memory in MB allocated to the Application Master |
YARN.app.mapreduce.am.command-opts | -Xmx3072m | It is set to 80% of the memory allocated to the AM to prevent it from crashing |
mapreduce.map.memory.mb | 4096 | Memory allocated for each map task in MB It was increased due to GC limit errors |
mapreduce.reduce.memory.mb | 8192 | Memory allocated for each reduce task in MB. It was increased due to GC limit errors |
mapreduce.map.java.opts | -Xmx3072m | It is set to 80% of the memory allocated to each map task |
mapreduce.reduce.java.opts | -Xmx6144m | It is set to 80% of the memory allocation to each reduce task |
mapreduce.task.timeout | 1,800,000 | This was increased to 300 Secs to ensure no task times out while YARN is waiting |
YARN.nodemanager.resource.memory-mb | 62,000 | This is set to the maximum memory available since only benchmarks should run on the servers |
YARN.nodemanager.resource.cpu-vcores | 15 | This is set to available cores minus 1, so that the 1 core can be used for other system features |
YARN.nodemanager.vmem-check-enabled | false | This prevents containers from being killed when virtual memory is exceeded |
YARN.nodemanager.pmem-check-enabled | true | This is set to true in order to shutdown containers which exceed the available physical memory |
YARN.nodemanager.vmem-pmem-ratio | 3.0 | This allows more virtual memory allocation |
YARN.scheduler.maximum-allocation-mb | 62,000 | This is set to the maximum available memory on each node |
YARN.scheduler.minimum-allocation-mb | 1024 | This is set to 1 GB as the minimum starting memory |
YARN.scheduler.maximum-allocation-vcores | 15 | This is set to the maximum cores that can be assigned to any task |
Configuration Parameters of the Hive Engine | ||
Metastore | Mysql | Mysql was chosen for its easy integration with Hive |
hive.exec.parallel | True | Enables Hive to run jobs in parallel |
hive.exec.parallel.thread.number | 8 | Sets the maximum number of parallel jobs that can be executed to 8 |
hive.mapred.mode | Nonstrict | Enables Hive to run queries which do not have a where or limit clause |
hive.strict.checks.cartesian.product | False | Allow Cartesian product queries which may run for too long |
Configuration parameters of the spark SQL engine | ||
spark.master | YARN | Enables YARN for task management |
spark.driver.memory | 3 GB | To allow more memory for the executors |
spark.executor.memory | 5 GB | To allow more executors to be started |
spark.sql.broadcastTimeout | 1200 | To allow long running join operations which timed out in 300 Secs |
spark.driver.maxResultSize | 5 GB | To enable spark collecting large intermediate results between jobs |
spark.network.timeout | 800 s | Increases the timeout Spark should wait for responses across network |
Configuration Parameters of the PrestoDB Engine | ||
discovery-server.enabled | True | Enabled to the coordinator node |
node-scheduler.include-coordinator | True | Enabled to the coordinator node |
experimental.spill-enabled | True | Allowing intermediate results to be written to the disk in order to avoid out-of-memory errors |
experimental.spiller-spill-path | File system path | A path on the file system where spill data are written |
Rights and permissions
About this article
Cite this article
Aluko, V., Sakr, S. Big SQL systems: an experimental evaluation. Cluster Comput 22, 1347–1377 (2019). https://doi.org/10.1007/s10586-019-02914-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02914-4