Skip to main content
Log in

Big SQL systems: an experimental evaluation

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it had clear performance limitations on providing the best performance of processing massive amounts of structured data. In addition, practically, many users of the big data systems face some challenges on dealing with the APIs and the low level programming abstractions of the Big Data System and they would prefer to use SQL (in which they are more proficient) as a high-level declarative language to express their tasks while leaving all of the execution optimization details to the backend engine. Thus, several systems have been designed and implemented to tackle these challenges by designing and implementing scalable query execution engines for processing massive structured data while supporting SQL interfaces. In this article, we present an extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB. In particular, we report and analyze the performance characteristics of these systems using three different benchmarks, namely, TPC-H, TPC-DS and TPCx-BB. Finally, we report a set of insights and important lessons that we have learned from conducting our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. http://hadoop.apache.org/

  2. www.cio.com/article/2377070/business-intelligence/10-hot-hadoop-startups-to-watch.html

  3. https://prestodb.io/

  4. TPC-H

  5. https://github.com/DataSystemsGroupUT/Benchmarking-Big-SQL-Systems/

  6. https://cwiki.apache.org/confluence/display/Hive/LanguageManual

  7. https://parquet.apache.org/

  8. https://avro.apache.org/

  9. https://impala.apache.org/

  10. https://hbase.apache.org/

  11. http://www-01.ibm.com/software/data/infosphere/hadoop/big-sql.html

  12. https://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro/index.html

  13. http://tajo.apache.org/

  14. http://docs.openstack.org/developer/swift/

  15. http://drill.apache.org/

  16. https://cloud.google.com/bigquery/

  17. https://phoenix.apache.org/

  18. http://www.salesforce.com/

  19. http://www.tpc.org/

  20. https://issues.apache.org/jira/browse/HIVE-600

  21. https://dataworkssummit.com/berlin-2018/session/evaluation-of-tpc-h-on-spark-spark-sql-in-aloja/

  22. https://github.com/prestodb/benchto

  23. https://github.com/IBM/spark-tpc-ds-performance-test

  24. https://github.com/cloudera/impala-tpcds-kit

  25. https://github.com/hortonworks/hive-testbench

  26. https://spark.apache.org/docs/latest/ml-guide.html

  27. http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/

References

  1. Abadi, D., Babu, S., Özcan, F., Pandis, I.: Sql-on-hadoop systems: tutorial. Proc. VLDB Endow. 8(12), 2050–2051 (2015)

    Article  Google Scholar 

  2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

    Article  Google Scholar 

  3. Ammar, K., Özsu, M.T.: Experimental analysis of distributed graph systems. PVLDB 11(10), 1151–1164 (2018)

    Google Scholar 

  4. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. SIGMOD, Chicago (2015)

    Book  Google Scholar 

  5. Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From bigbench to tpcx-bb: standardization of a big data benchmark. In: Technology Conference on Performance Evaluation and Benchmarking, pp. 24–44. Springer (2016)

  6. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Commun. Data Eng. 36(4), 28–38 (2015)

    Google Scholar 

  7. Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of sql-on-hadoop systems. In: Proceedings of the Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pp. 154–166. Springer (2014)

  8. Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A Distributed Data Warehouse System on Large Clusters. ICDE, Oslo (2013)

    Google Scholar 

  9. Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)

  10. Floratou, A., Özcan, F., Schiefer, B.: Benchmarking sql-on-hadoop systems: Tpc or not tpc? In: Proceedings of the Workshop on Big Data Benchmarks, pp. 63–72, Springer (2014)

  11. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1197–1208. ACM (2013)

  12. Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.V.: Bigbench V2: the new and improved bigbench. In: Proceedings of the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 1225–1236 (2017)

  13. Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major Technical Advancements in Apache Hive. SIGMOD, Chicago (2014)

    Book  Google Scholar 

  14. Ivanov, T., Beer, M.-G.: Performance evaluation of spark sql using bigbench. In: Big Data Benchmarking, pp. 96–116. Springer (2015)

  15. Ivanov, T., Singhal, R.: Abench: Big data architecture stack benchmark. In: Proceedings of the Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, 09–13 April 2018, pp. 13–16 (2018)

  16. Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1507–1518, (2018)

  17. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern. Open-Source SQL Engine for Hadoop. In: Proceedings of the CIDR (2015)

  18. Laney, D.: 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)

    Google Scholar 

  19. Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H., Li, J., Wang, J.: Performance evaluation and optimization of multi-dimensional indexes in hive. IEEE Trans. Serv. Comput. 11(5), 835–849 (2018)

    Google Scholar 

  20. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)

    Google Scholar 

  21. Mesmoudi, A., Hacid, M.-S., Toumani, F.: Benchmarking sql on mapreduce systems using large astronomy databases. Distrib. Parall. Databases 34(3), 347–378 (2016)

    Article  Google Scholar 

  22. Nambiar, R.O., Poess, M.: The making of tpc-ds. In: Proceedings of the 32nd international conference on Very large data bases, pp. 1049–1058. VLDB Endowment (2006)

  23. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)

  24. Qin, X., Chen, Y., Chen, J., Li, S., Liu, J., Zhang, H.: The performance of sql-on-hadoop systems-an experimental study. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 464–471. IEEE (2017)

  25. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A unifying framework for modeling and building data processing applications. In: SIGMOD (2015)

  26. Sakr, S.: Big Data 2.0 Processing Systems: A Survey. Springer, New York (2016)

    Book  Google Scholar 

  27. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)

    Article  Google Scholar 

  28. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST (2010)

  29. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: SIGMOD (2010)

  30. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: SOCC (2013)

  31. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton (2012)

    Google Scholar 

  32. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)

Download references

Acknowledgements

This work is funded by the European Regional Development Funds via the Mobilitas Plus Programme (Grant MOBTT75).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sherif Sakr.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Configuration parameter

Modified value

Explanation

Configuration parameters of the hadoop framework

 mapreduce.framework.name

YARN

This forces YARN and MapReduce2 to be used instead of MapReduce 1

 YARN.app.mapreduce.am.resource.mb

4096

The amount of memory in MB allocated to the Application Master

 YARN.app.mapreduce.am.command-opts

-Xmx3072m

It is set to 80% of the memory allocated to the AM to prevent it from crashing

 mapreduce.map.memory.mb

4096

Memory allocated for each map task in MB It was increased due to GC limit errors

 mapreduce.reduce.memory.mb

8192

Memory allocated for each reduce task in MB. It was increased due to GC limit errors

 mapreduce.map.java.opts

-Xmx3072m

It is set to 80% of the memory allocated to each map task

 mapreduce.reduce.java.opts

-Xmx6144m

It is set to 80% of the memory allocation to each reduce task

 mapreduce.task.timeout

1,800,000

This was increased to 300 Secs to ensure no task times out while YARN is waiting

 YARN.nodemanager.resource.memory-mb

62,000

This is set to the maximum memory available since only benchmarks should run on the servers

 YARN.nodemanager.resource.cpu-vcores

15

This is set to available cores minus 1, so that the 1 core can be used for other system features

 YARN.nodemanager.vmem-check-enabled

false

This prevents containers from being killed when virtual memory is exceeded

 YARN.nodemanager.pmem-check-enabled

true

This is set to true in order to shutdown containers which exceed the available physical memory

 YARN.nodemanager.vmem-pmem-ratio

3.0

This allows more virtual memory allocation

 YARN.scheduler.maximum-allocation-mb

62,000

This is set to the maximum available memory on each node

 YARN.scheduler.minimum-allocation-mb

1024

This is set to 1 GB as the minimum starting memory

 YARN.scheduler.maximum-allocation-vcores

15

This is set to the maximum cores that can be assigned to any task

Configuration Parameters of the Hive Engine

 Metastore

Mysql

Mysql was chosen for its easy integration with Hive

 hive.exec.parallel

True

Enables Hive to run jobs in parallel

 hive.exec.parallel.thread.number

8

Sets the maximum number of parallel jobs that can be executed to 8

 hive.mapred.mode

Nonstrict

Enables Hive to run queries which do not have a where or limit clause

 hive.strict.checks.cartesian.product

False

Allow Cartesian product queries which may run for too long

Configuration parameters of the spark SQL engine

 spark.master

YARN

Enables YARN for task management

 spark.driver.memory

3 GB

To allow more memory for the executors

 spark.executor.memory

5 GB

To allow more executors to be started

 spark.sql.broadcastTimeout

1200

To allow long running join operations which timed out in 300 Secs

 spark.driver.maxResultSize

5 GB

To enable spark collecting large intermediate results between jobs

 spark.network.timeout

800 s

Increases the timeout Spark should wait for responses across network

Configuration Parameters of the PrestoDB Engine

 discovery-server.enabled

True

Enabled to the coordinator node

 node-scheduler.include-coordinator

True

Enabled to the coordinator node

 experimental.spill-enabled

True

Allowing intermediate results to be written to the disk in order to avoid out-of-memory errors

 experimental.spiller-spill-path

File system path

A path on the file system where spill data are written

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aluko, V., Sakr, S. Big SQL systems: an experimental evaluation. Cluster Comput 22, 1347–1377 (2019). https://doi.org/10.1007/s10586-019-02914-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02914-4

Keywords

Navigation