Skip to main content

Generating custom code for efficient query execution on heterogeneous processors


Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error- prone process. In this paper, we propose concepts to adapt to and to exploit the performance enhancements of modern processors automatically. Our core idea is to create processor-specific code variants and to learn a well-performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic- and processor-specific code transformations. Our experimental results show that the performance of code variants may diverge up to two orders of magnitude. In order to achieve peak performance, we generate custom code for each processor. We show that our approach finds an efficient custom code variant for multi-core CPUs, GPUs, and MICs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16


  1. 1.

  2. 2.

    Note that this comparison is not intended to be an end-to-end measurement of system performance.


  1. 1.

    Abadi, D., et al.: The design and implementation of modern column-oriented database systems. Found. Trends Databases 5(3), 197–280 (2013)

    Article  Google Scholar 

  2. 2.

    Ahmad, Y., Koch, C.: DBToaster: a SQL compiler for high-performance delta processing in main-memory databases. PVLDB 2(2), 1566–1569 (2009)

    Google Scholar 

  3. 3.

    Ailamaki, A.: Database architecture for new hardware. In: VLDB, p. 1241 (2004)

  4. 4.

    Balkesen, C., et al.: Main-memory hash joins on multi-core CPUs: tuning to the underlying hardware. In: ICDE, pp. 362–373 (2013)

  5. 5.

    Balkesen, C., et al.: Multi-core, main-memory joins: sort versus hash revisited. PVLDB 7(1), 85–96 (2013)

    Google Scholar 

  6. 6.

    Boncz, P., et al.: MonetDB/X100: hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005)

  7. 7.

    Boncz, P., Neumann, T., Erling, O.: TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: TPCTC, pp. 61–76. Springer, Berlin (2014)

    Chapter  Google Scholar 

  8. 8.

    Borkar, S., Chien, A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)

    Article  Google Scholar 

  9. 9.

    Breß, S.: The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektrum 14(3), 199–209 (2014)

    Article  Google Scholar 

  10. 10.

    Breß, S., et al.: Robust query processing in co-processor-accelerated databases. In: SIGMOD. ACM (2016)

  11. 11.

    Broneske, D., et al.: Database scan variants on modern CPUs: a performance study. In: IMDM@VLDB (2014)

  12. 12.

    Brown, K., et al.: A heterogeneous parallel framework for domain-specific languages. In: PACT. IEEE (2011)

  13. 13.

    Chamberlin, D., et al.: A history and evaluation of system R. Commun. ACM 24(10), 632–646 (1981)

    Article  Google Scholar 

  14. 14.

    Dees, J., et al.: Efficient many-core query execution in main memory column-stores. In: ICDE. IEEE (2013)

  15. 15.

    Esmaeilzadeh, et al.: Dark silicon and the end of multicore scaling. In: ISCA, pp. 365–376. ACM (2011)

  16. 16.

    Färber, F., et al.: The SAP HANA database: an architecture overview. Data Eng. Bull. 35(1), 28–33 (2012)

    Google Scholar 

  17. 17.

    Freedman, C., et al.: Compilation in the microsoft SQL server hekaton engine. Data Eng. Bull. 37(1), 22–30 (2014)

    Google Scholar 

  18. 18.

    Funke, H., et al.: Pipelined query processing in coprocessor environments. In: SIGMOD. ACM (2018)

  19. 19.

    Harizopoulos, S., et al.: OLTP through the looking glass, and what we found there. In: SIGMOD. ACM (2008)

  20. 20.

    He, B., et al.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524. ACM (2008)

  21. 21.

    He, B., et al.: Relational query co-processing on graphics processors. In: TODS, vol. 34. ACM (2009)

  22. 22.

    He, J., et al.: Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10), 889–900 (2013)

    Google Scholar 

  23. 23.

    He, J., et al.: In-cache query co-processing on coupled CPU-GPU architectures. PVLDB 8(4), 329–340 (2014)

    Google Scholar 

  24. 24.

    Heimel, M., et al.: Hardware-oblivious parallelism for in-memory column-stores. PVLDB 6(9), 709–720 (2013)

    Google Scholar 

  25. 25.

    Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., Burlington (2011)

    MATH  Google Scholar 

  26. 26.

    Jha, S., et al.: Improving main memory hash joins on Intel Xeon Phi processors: an experimental approach. PVLDB 8(6), 642–653 (2015)

    Google Scholar 

  27. 27.

    Karnagel, T., et al.: Optimizing GPU-accelerated group-by and aggregation. In: ADMS, pp. 13–24 (2015)

  28. 28.

    Klonatos, Y., et al.: Building efficient query engines in a high-level language. PVLDB 7(10), 853–864 (2014)

    Google Scholar 

  29. 29.

    Koch, C.: Abstraction without regret in database systems building: a manifesto. Data Eng. Bull. 37(1), 70–79 (2014)

    Google Scholar 

  30. 30.

    Krikellas, K., et al.: Generating code for holistic query evaluation. In: ICDE, pp. 613–624. IEEE (2010)

  31. 31.

    Larson, P.-A., et al.: Real-time analytical processing with SQL server. Proc. VLDB Endow. 8(12), 1740–1751 (2015)

    Article  Google Scholar 

  32. 32.

    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: CGO, pp. 75–86. IEEE (2004)

  33. 33.

    Leis, V., et al.: The adaptive radix tree: ARTful indexing for main-memory databases. In: ICDE. IEEE (2013)

  34. 34.

    Leis, V., et al.: Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In: SIGMOD, pp. 743–754. ACM (2014)

  35. 35.

    Manegold, S., et al.: Optimizing database architecture for the new bottleneck: memory access. VLDB J. 9(3), 231–246 (2000)

    MATH  Article  Google Scholar 

  36. 36.

    Meraji, S., et al.: Towards a hybrid design for fast query processing in DB2 with BLU acceleration using graphical processing units: a technology demonstration. In: SIGMOD, pp. 1951–1960. ACM (2016)

  37. 37.

    Müller, R., et al.: Streams on wires: a query compiler for FPGAs. PVLDB 2(1), 229–240 (2009)

    Google Scholar 

  38. 38.

    Müller, R., Teubner, J., Alonso, G.: Data processing on FPGAs. PVLDB 2(1), 910–921 (2009)

    Google Scholar 

  39. 39.

    Nagel, F., et al.: Code generation for efficient query processing in managed runtimes. PVLDB 7(12), 1095–1106 (2014)

    Google Scholar 

  40. 40.

    Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)

    Google Scholar 

  41. 41.

    O’Neil, P., O’Neil, E.J., Chen, X.: The star schema benchmark (SSB). Revision 3, (2009). Accessed 5 June 2018

  42. 42.

    Palkar, S., et al.: Weld: a common runtime for high performance data analytics. In: CIDR (2017)

  43. 43.

    Paul, J., et al.: GPL: a GPU-based pipelined query processing engine. In: SIGMOD. ACM (2016)

  44. 44.

    Pirk, H., et al.: By their fruits shall ye know them: a data analyst’s perspective on massively parallel system design. In: DaMoN, pp. 5:1–5:6. ACM (2015)

  45. 45.

    Pirk, H., et al.: Voodoo: a vector algebra for portable database performance on modern hardware. PVLDB 9(14), 1707–1718 (2016)

    Google Scholar 

  46. 46.

    Rahman, R.: Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers. Apress, New York City (2013)

    Book  Google Scholar 

  47. 47.

    Raman, V., et al.: DB2 with BLU acceleration: so much more than just a column store. PVLDB. 6(11), 1080–1091 (2013)

    Google Scholar 

  48. 48.

    Rao, J., et al.: Compiled query execution engine using JVM. In: ICDE, IEEE (2006)

  49. 49.

    Rao, J., Ross, K.: Making B+- trees cache conscious in main memory. In: SIGMOD, pp. 475–486. ACM (2000)

    Article  Google Scholar 

  50. 50.

    Richter, S., Alvarez, V., Dittrich, J.: A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB 9(3), 96–107 (2015)

    Google Scholar 

  51. 51.

    Rosenfeld, V., et al.: The operator variant selection problem on heterogeneous hardware. In: ADMS@VLDB (2015)

  52. 52.

    Rossbach, C., et al.: Dandelion: a compiler and runtime for heterogeneous systems. In: SOSP. ACM (2013)

  53. 53.

    Răducanu, B., et al.: Micro adaptivity in Vectorwise. In: SIGMOD, pp. 1231–1242. ACM (2013)

  54. 54.

    Shaikhha, A., et al.: How to architect a query compiler. In: SIGMOD, pp. 1907–1922. ACM (2016)

  55. 55.

    Shen, J., et al.: Performance traps in OpenCL for CPUs. In: PDP, pp. 38–45 (2013)

  56. 56.

    Sompolski, J., et al.: Vectorization versus compilation in query execution. In: DaMoN, pp. 33–40. ACM (2011)

  57. 57.

    Wanderman-Milne, S., Li, N.: Runtime code generation in Cloudera Impala. Data Eng. Bull. 37(1), 31–37 (2014)

    Google Scholar 

  58. 58.

    Wu, H., et al.: Kernel weaver: automatically fusing database primitives for efficient GPU computation. In: MICRO, pp. 107–118. IEEE (2012)

  59. 59.

    Ye, Y., et al.: Scalable aggregation on multicore processors. In: DaMoN, pp. 1–9. ACM (2011)

  60. 60.

    Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. PVLDB 6(10), 817–828 (2013)

    Google Scholar 

  61. 61.

    Zahran, M.: Heterogeneous computing: here to stay. Commun. ACM 60(3), 42–45 (2017)

    Article  Google Scholar 

  62. 62.

    Zeuch, S., et al.: Non-invasive progressive optimization for in-memory databases. PVLDB 9(14), 1659–1670 (2016)

    Google Scholar 

  63. 63.

    Zhang, K., et al.: Hetero-DB: next generation high-performance database systems by best utilizing heterogeneous computing and storage resources. J. Comput. Sci. Technol. 30(4), 657–678 (2015)

    MathSciNet  Article  Google Scholar 

  64. 64.

    Zhang, S., et al.: OmniDB: towards portable and efficient query processing on parallel CPU/GPU architectures. PVLDB 6(12), 1374–1377 (2013)

    Google Scholar 

  65. 65.

    Zhou, J., Ross, K.: Implementing database operations using SIMD instructions. In: SIGMOD. ACM (2002)

Download references


We thank Tobias Behrens, Tobias Fuchs, Martin Kiefer, Manuel Renz, Viktor Rosenfeld, and Jonas Traub from TU Berlin for helpful feedback. This work was funded by the EU projects SAGE (671500) and E2Data (780245), DFG Priority Program Scalable Data Management for Future Hardware (MA4662-5) and Collaborative Research Center SFB 876, project A2, and the German Ministry for Education and Research as BBDC (01IS14013A).

Author information



Corresponding author

Correspondence to Sebastian Breß.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Breß, S., Köcher, B., Funke, H. et al. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 797–822 (2018).

Download citation


  • Database systems
  • Database query processing
  • Query compilation
  • Heterogeneous processors
  • CPU
  • GPU
  • MIC
  • Code generation
  • Code variants
  • Variant optimization