A Study of SQL-on-Hadoop Systems

  • Yueguo Chen
  • Xiongpai Qin
  • Haoqiong Bian
  • Jun Chen
  • Zhaoan Dong
  • Xiaoyong Du
  • Yanjie Gao
  • Dehai Liu
  • Jiaheng Lu
  • Huijie Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8807)

Abstract

Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many applications. This leads to the quick emergence of dozens of SQL-on-Hadoop systems that try to support interactive SQL query processing to the data stored in HDFS. This paper firstly gives a brief technical review on recent efforts of SQL-on-Hadoop systems. Then we test and compare the performance of five representative SQL-on-Hadoop systems, based on some queries selected or derived from the TPC-DS benchmark. According to the results, we show that such systems can benefit more from the applications of many parallel query processing techniques that have been widely studied in the traditional MPP analytical databases.

Keywords

Big data SQL-on-Hadoop Interactive query Benchmark 

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
    Jethro data (2013). http://jethrodata.com/product/
  6. 6.
    Presto (2013). http://prestodb.io
  7. 7.
  8. 8.
  9. 9.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)Google Scholar
  10. 10.
    Argyros, T.: The enterprise approach to interactive sql on hadoop data: teradata sql-h (2013). http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-SQL-on-Hadoop-data-teradata-sql-h/
  11. 11.
    Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: Hawq: a massively parallel processing sql engine in hadoop. In: SIGMOD Conference, pp. 1223–1234 (2014)Google Scholar
  12. 12.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  13. 13.
    DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD Conference, pp. 1255–1266 (2013)Google Scholar
  14. 14.
    Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)Google Scholar
  15. 15.
    Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the elephants handle the nosql onslaught? PVLDB 5(12), 1712–1723 (2012)Google Scholar
  16. 16.
    Franklin, M.J.: Making sense of big data with the berkeley data analytics stack. In: SSDBM, p. 1 (2013)Google Scholar
  17. 17.
    He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: ICDE, pp. 1199–1208 (2011)Google Scholar
  18. 18.
    Iu, M.-Y., Zwaenepoel, W.: Hadooptosql: a mapreduce query optimizer. In: EuroSys, pp. 251–264 (2010)Google Scholar
  19. 19.
    Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)CrossRefGoogle Scholar
  20. 20.
    Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.:. Ysmart: yet another sql-to-mapreduce translator. In: ICDCS, pp. 25–36 (2011)Google Scholar
  21. 21.
    Nambiar, R.O., Poess, M.: The making of tpc-ds. In: VLDB, pp. 1049–1058 (2006)Google Scholar
  22. 22.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)Google Scholar
  23. 23.
    Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRefGoogle Scholar
  24. 24.
    Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: SIGMOD Conference, pp. 13–24 (2013)Google Scholar
  25. 25.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Yueguo Chen
    • 1
    • 2
  • Xiongpai Qin
    • 1
    • 2
  • Haoqiong Bian
    • 1
    • 2
  • Jun Chen
    • 1
    • 2
  • Zhaoan Dong
    • 1
    • 2
  • Xiaoyong Du
    • 1
    • 2
  • Yanjie Gao
    • 1
    • 2
  • Dehai Liu
    • 1
    • 2
  • Jiaheng Lu
    • 1
    • 2
  • Huijie Zhang
    • 1
    • 2
  1. 1.Key Laboratory of Data Engineering and Knowledge Engineering, MOEBeijingChina
  2. 2.School of InformationRenmin University of ChinaBeijingChina

Personalised recommendations