The VLDB Journal

, Volume 27, Issue 6, pp 797–822 | Cite as

Generating custom code for efficient query execution on heterogeneous processors

  • Sebastian BreßEmail author
  • Bastian Köcher
  • Henning Funke
  • Steffen Zeuch
  • Tilmann Rabl
  • Volker Markl
Regular Paper


Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error- prone process. In this paper, we propose concepts to adapt to and to exploit the performance enhancements of modern processors automatically. Our core idea is to create processor-specific code variants and to learn a well-performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic- and processor-specific code transformations. Our experimental results show that the performance of code variants may diverge up to two orders of magnitude. In order to achieve peak performance, we generate custom code for each processor. We show that our approach finds an efficient custom code variant for multi-core CPUs, GPUs, and MICs.


Database systems Database query processing Query compilation Heterogeneous processors CPU GPU MIC Code generation Code variants Variant optimization 



We thank Tobias Behrens, Tobias Fuchs, Martin Kiefer, Manuel Renz, Viktor Rosenfeld, and Jonas Traub from TU Berlin for helpful feedback. This work was funded by the EU projects SAGE (671500) and E2Data (780245), DFG Priority Program Scalable Data Management for Future Hardware (MA4662-5) and Collaborative Research Center SFB 876, project A2, and the German Ministry for Education and Research as BBDC (01IS14013A).


  1. 1.
    Abadi, D., et al.: The design and implementation of modern column-oriented database systems. Found. Trends Databases 5(3), 197–280 (2013)Google Scholar
  2. 2.
    Ahmad, Y., Koch, C.: DBToaster: a SQL compiler for high-performance delta processing in main-memory databases. PVLDB 2(2), 1566–1569 (2009)Google Scholar
  3. 3.
    Ailamaki, A.: Database architecture for new hardware. In: VLDB, p. 1241 (2004)Google Scholar
  4. 4.
    Balkesen, C., et al.: Main-memory hash joins on multi-core CPUs: tuning to the underlying hardware. In: ICDE, pp. 362–373 (2013)Google Scholar
  5. 5.
    Balkesen, C., et al.: Multi-core, main-memory joins: sort versus hash revisited. PVLDB 7(1), 85–96 (2013)Google Scholar
  6. 6.
    Boncz, P., et al.: MonetDB/X100: hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005)Google Scholar
  7. 7.
    Boncz, P., Neumann, T., Erling, O.: TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: TPCTC, pp. 61–76. Springer, Berlin (2014)Google Scholar
  8. 8.
    Borkar, S., Chien, A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)Google Scholar
  9. 9.
    Breß, S.: The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektrum 14(3), 199–209 (2014)Google Scholar
  10. 10.
    Breß, S., et al.: Robust query processing in co-processor-accelerated databases. In: SIGMOD. ACM (2016)Google Scholar
  11. 11.
    Broneske, D., et al.: Database scan variants on modern CPUs: a performance study. In: IMDM@VLDB (2014)Google Scholar
  12. 12.
    Brown, K., et al.: A heterogeneous parallel framework for domain-specific languages. In: PACT. IEEE (2011)Google Scholar
  13. 13.
    Chamberlin, D., et al.: A history and evaluation of system R. Commun. ACM 24(10), 632–646 (1981)Google Scholar
  14. 14.
    Dees, J., et al.: Efficient many-core query execution in main memory column-stores. In: ICDE. IEEE (2013)Google Scholar
  15. 15.
    Esmaeilzadeh, et al.: Dark silicon and the end of multicore scaling. In: ISCA, pp. 365–376. ACM (2011)Google Scholar
  16. 16.
    Färber, F., et al.: The SAP HANA database: an architecture overview. Data Eng. Bull. 35(1), 28–33 (2012)Google Scholar
  17. 17.
    Freedman, C., et al.: Compilation in the microsoft SQL server hekaton engine. Data Eng. Bull. 37(1), 22–30 (2014)Google Scholar
  18. 18.
    Funke, H., et al.: Pipelined query processing in coprocessor environments. In: SIGMOD. ACM (2018)Google Scholar
  19. 19.
    Harizopoulos, S., et al.: OLTP through the looking glass, and what we found there. In: SIGMOD. ACM (2008)Google Scholar
  20. 20.
    He, B., et al.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524. ACM (2008)Google Scholar
  21. 21.
    He, B., et al.: Relational query co-processing on graphics processors. In: TODS, vol. 34. ACM (2009)Google Scholar
  22. 22.
    He, J., et al.: Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10), 889–900 (2013)Google Scholar
  23. 23.
    He, J., et al.: In-cache query co-processing on coupled CPU-GPU architectures. PVLDB 8(4), 329–340 (2014)Google Scholar
  24. 24.
    Heimel, M., et al.: Hardware-oblivious parallelism for in-memory column-stores. PVLDB 6(9), 709–720 (2013)Google Scholar
  25. 25.
    Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., Burlington (2011)zbMATHGoogle Scholar
  26. 26.
    Jha, S., et al.: Improving main memory hash joins on Intel Xeon Phi processors: an experimental approach. PVLDB 8(6), 642–653 (2015)Google Scholar
  27. 27.
    Karnagel, T., et al.: Optimizing GPU-accelerated group-by and aggregation. In: ADMS, pp. 13–24 (2015)Google Scholar
  28. 28.
    Klonatos, Y., et al.: Building efficient query engines in a high-level language. PVLDB 7(10), 853–864 (2014)Google Scholar
  29. 29.
    Koch, C.: Abstraction without regret in database systems building: a manifesto. Data Eng. Bull. 37(1), 70–79 (2014)Google Scholar
  30. 30.
    Krikellas, K., et al.: Generating code for holistic query evaluation. In: ICDE, pp. 613–624. IEEE (2010)Google Scholar
  31. 31.
    Larson, P.-A., et al.: Real-time analytical processing with SQL server. Proc. VLDB Endow. 8(12), 1740–1751 (2015)Google Scholar
  32. 32.
    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: CGO, pp. 75–86. IEEE (2004)Google Scholar
  33. 33.
    Leis, V., et al.: The adaptive radix tree: ARTful indexing for main-memory databases. In: ICDE. IEEE (2013)Google Scholar
  34. 34.
    Leis, V., et al.: Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In: SIGMOD, pp. 743–754. ACM (2014)Google Scholar
  35. 35.
    Manegold, S., et al.: Optimizing database architecture for the new bottleneck: memory access. VLDB J. 9(3), 231–246 (2000)zbMATHGoogle Scholar
  36. 36.
    Meraji, S., et al.: Towards a hybrid design for fast query processing in DB2 with BLU acceleration using graphical processing units: a technology demonstration. In: SIGMOD, pp. 1951–1960. ACM (2016)Google Scholar
  37. 37.
    Müller, R., et al.: Streams on wires: a query compiler for FPGAs. PVLDB 2(1), 229–240 (2009)Google Scholar
  38. 38.
    Müller, R., Teubner, J., Alonso, G.: Data processing on FPGAs. PVLDB 2(1), 910–921 (2009)Google Scholar
  39. 39.
    Nagel, F., et al.: Code generation for efficient query processing in managed runtimes. PVLDB 7(12), 1095–1106 (2014)Google Scholar
  40. 40.
    Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)Google Scholar
  41. 41.
    O’Neil, P., O’Neil, E.J., Chen, X.: The star schema benchmark (SSB). Revision 3, (2009). Accessed 5 June 2018
  42. 42.
    Palkar, S., et al.: Weld: a common runtime for high performance data analytics. In: CIDR (2017)Google Scholar
  43. 43.
    Paul, J., et al.: GPL: a GPU-based pipelined query processing engine. In: SIGMOD. ACM (2016)Google Scholar
  44. 44.
    Pirk, H., et al.: By their fruits shall ye know them: a data analyst’s perspective on massively parallel system design. In: DaMoN, pp. 5:1–5:6. ACM (2015)Google Scholar
  45. 45.
    Pirk, H., et al.: Voodoo: a vector algebra for portable database performance on modern hardware. PVLDB 9(14), 1707–1718 (2016)Google Scholar
  46. 46.
    Rahman, R.: Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers. Apress, New York City (2013)Google Scholar
  47. 47.
    Raman, V., et al.: DB2 with BLU acceleration: so much more than just a column store. PVLDB. 6(11), 1080–1091 (2013)Google Scholar
  48. 48.
    Rao, J., et al.: Compiled query execution engine using JVM. In: ICDE, IEEE (2006)Google Scholar
  49. 49.
    Rao, J., Ross, K.: Making B+- trees cache conscious in main memory. In: SIGMOD, pp. 475–486. ACM (2000)Google Scholar
  50. 50.
    Richter, S., Alvarez, V., Dittrich, J.: A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB 9(3), 96–107 (2015)Google Scholar
  51. 51.
    Rosenfeld, V., et al.: The operator variant selection problem on heterogeneous hardware. In: ADMS@VLDB (2015)Google Scholar
  52. 52.
    Rossbach, C., et al.: Dandelion: a compiler and runtime for heterogeneous systems. In: SOSP. ACM (2013)Google Scholar
  53. 53.
    Răducanu, B., et al.: Micro adaptivity in Vectorwise. In: SIGMOD, pp. 1231–1242. ACM (2013)Google Scholar
  54. 54.
    Shaikhha, A., et al.: How to architect a query compiler. In: SIGMOD, pp. 1907–1922. ACM (2016)Google Scholar
  55. 55.
    Shen, J., et al.: Performance traps in OpenCL for CPUs. In: PDP, pp. 38–45 (2013)Google Scholar
  56. 56.
    Sompolski, J., et al.: Vectorization versus compilation in query execution. In: DaMoN, pp. 33–40. ACM (2011)Google Scholar
  57. 57.
    Wanderman-Milne, S., Li, N.: Runtime code generation in Cloudera Impala. Data Eng. Bull. 37(1), 31–37 (2014)Google Scholar
  58. 58.
    Wu, H., et al.: Kernel weaver: automatically fusing database primitives for efficient GPU computation. In: MICRO, pp. 107–118. IEEE (2012)Google Scholar
  59. 59.
    Ye, Y., et al.: Scalable aggregation on multicore processors. In: DaMoN, pp. 1–9. ACM (2011)Google Scholar
  60. 60.
    Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. PVLDB 6(10), 817–828 (2013)Google Scholar
  61. 61.
    Zahran, M.: Heterogeneous computing: here to stay. Commun. ACM 60(3), 42–45 (2017)Google Scholar
  62. 62.
    Zeuch, S., et al.: Non-invasive progressive optimization for in-memory databases. PVLDB 9(14), 1659–1670 (2016)Google Scholar
  63. 63.
    Zhang, K., et al.: Hetero-DB: next generation high-performance database systems by best utilizing heterogeneous computing and storage resources. J. Comput. Sci. Technol. 30(4), 657–678 (2015)Google Scholar
  64. 64.
    Zhang, S., et al.: OmniDB: towards portable and efficient query processing on parallel CPU/GPU architectures. PVLDB 6(12), 1374–1377 (2013)Google Scholar
  65. 65.
    Zhou, J., Ross, K.: Implementing database operations using SIMD instructions. In: SIGMOD. ACM (2002)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Sebastian Breß
    • 1
    • 2
    Email author
  • Bastian Köcher
    • 2
  • Henning Funke
    • 3
  • Steffen Zeuch
    • 1
  • Tilmann Rabl
    • 1
    • 2
  • Volker Markl
    • 1
    • 2
  1. 1.DFKI GmbHBerlinGermany
  2. 2.TU BerlinBerlinGermany
  3. 3.TU DortmundDortmundGermany

Personalised recommendations