Abstract
Index joins present a case of pointer-chasing code that causes data cache misses. In principle, we can hide these cache misses by overlapping them with computation: The lookups involved in an index join are parallel tasks whose execution can be interleaved, so that, when a cache miss occurs in one task, the processor executes independent instructions from another one. Yet, the literature provides no concrete performance model for such interleaved execution and, more importantly, production systems still waste processor cycles on cache misses because (a) hardware and compiler limitations prohibit automatic task interleaving and (b) existing techniques that involve the programmer produce unmaintainable code and are thus avoided in practice. In this paper, we address these shortcomings: we model interleaved execution explaining how to estimate the speedup of any interleaving technique, and we propose interleaving with coroutines, i.e., functions that suspend their execution for later resumption. We deploy coroutines on index joins running in SAP HANA and show that interleaving with coroutines performs like other state-of-the-art techniques, retains close resemblance to the original code, and supports both interleaved and non-interleaved execution in the same implementation. Thus, we establish the first systematic and practical approach for interleaving index joins of any type.
Similar content being viewed by others
Notes
Retrieved from a profiling session of 60 s.
We have not yet investigated how to form a pipeline with variable size, so we do not provide a corresponding SPP version.
At the time of writing, Visual C++ and Clang are the only compilers that implement the technical specification for coroutines.
An interested reader can find more information about C++ coroutines in the technical specification [6] and at Lewis Baker’s excellent blog: https://lewissbaker.github.io/.
The observed speedups differ from the expected ones mainly due to rounding; group sizes cannot be floats.
References
Laudon, J., Gupta, A., Horowitz, M.: Interleaving: A multithreading technique targeting multiprocessors and workstations. In: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VI, pp. 308–318. ACM, New York, NY, USA (1994)
Mowry, T.C., Lam, M.S., Gupta, A.: Design and evaluation of a compiler algorithm for prefetching. In: Proceedings of the fifth international conference on architectural support for programming languages and operating systems, ASPLOS V, pp. 62–73. ACM, New York, NY, USA (1992)
Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 1–32 (2007)
Kocberber, O., Falsafi, B., Grot, B.: Asynchronous memory access chaining. PVLDB 9(4), 252–263 (2015)
Färber, F., May, N., Lehner, W., Große, P., Müller, I., Rauhe, H., Dees, J.: The SAP HANA database: an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)
Programming Languages – C++ Extensions for Coroutines. Proposed Draft Technical Specification ISO/IEC DTS 22277 (E). http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4680.pdf. Accessed 16 May 2018
Psaropoulos, G., Legler, T., May, N., Ailamaki, A.: Interleaving with coroutines: a practical approach for robust index joins. Proc. VLDB Endow. 11(2), 230–242 (2017)
Lang, H., Mühlbauer, T., Funke, F., Boncz, P.A., Neumann, T., Kemper, A.: Data blocks: hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 311–326. ACM, New York, NY, USA (2016)
Larson, P.-A., Clinciu, C., Hanson, E.N., Oks, A., Price, S.L., Rangarajan, S., Surna, A., Zhou, Q.: Sql server column store indexes. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp. 1177–1184. ACM, New York, NY, USA (2011)
Poess, M., Potapov, D.: Data compression in Oracle. In: Proceedings of VLDB, pp. 937–947. (2003)
Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., Zhang, L.: DB2 with BLU Acceleration: so much more than just a column store. PVLDB 6(11), 1080–1091 (2013)
Melton, J.: Advanced SQL 1999: Understanding Object-Relational, and Other Advanced Features. Elsevier Science Inc., Amsterdam (2002)
Colgan, M.: Oracle database in-memory. Technical report, Oracle Corporation, 2015. http://www.oracle.com/technetwork/database/in-memory/overview/twp-oracle-database-in-memory-2245633.pdf. Accessed 16 May 2018
Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, S., Kersten, M.: MonetDB: two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35(1), 40–45 (2012)
Kemper, A., Neumann, T.: HyPer: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In: Proceedings of ICDE, pp. 195–206, (2011)
Larson, P.-A., Clinciu, C., Fraser, C., Hanson, E.N., Mokhtar, M., Nowakiewicz, M., Papadimos, V., Price, S.L., Rangarajan, S., Rusanu, R., Saubhasik, M.: Enhancements to SQL Server column stores. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp. 1159–1168. ACM, New York, NY, USA (2013)
Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, pp. 54–65. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. (1999)
Müller, I., Ratsch, C., Färber, F.: Adaptive string dictionary compression in in-memory column-store database systems. In: Proceedings of EDBT, pp. 283–294, (2014)
Rao, J., Ross, K.A.: Making B+-trees cache conscious in main memory. In: Proceedings of ACM SIGMOD, pp. 475–486. (2000)
Transaction Processing Performance Council. TPC-DS Benchmark Version 2.3.0. http://www.tpc.org/tpcds/. Accessed 14 August 2017
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual, (2016)
Kocberber, O., Grot, B., Picorel, J., Falsafi, B., Lim, K., Ranganathan, P.: Meet the walkers: accelerating index traversals for in-memory databases. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture, MICRO-46, pp. 468–479. ACM, New York, (2013)
Neumann, T.: Trying to Speed Up Binary Search, 2015. http://databasearchitects.blogspot.ch/2015/09/trying-to-speed-up-binary-search.html. Accessed 16 May 2018
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: Proceedings of the ASPLOS (1991)
Kim, C., Chhugani, J., Satish, N., Sedlar, E., Nguyen, A.D., Kaldewey, T., Lee, V.W., Brandt, S.A., Dubey, P.: FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In: Proceedings of SIGMOD, pp. 339–350, (2010)
Zhou, J., Ross, K.A.: Buffering accesses to memory-resident index structures. In: Proceedings of VLDB, pp. 405–416. (2003)
Zhou, J., Cieslewicz, J., Ross, K.A., Shah, M.: Improving database performance on simultaneous multithreading processors. In: Proceedings of VLDB, pp. 49–60. (2005)
Moura, A .L .D., Ierusalimschy, R.: Revisiting coroutines. ACM Trans. Program. Lang. Syst., 31(2), 6:1–6:31 (2009)
Conway, M.E.: Design of a separable transition-diagram compiler. Commun. ACM 6(7), 396–408 (1963)
C# Reference. https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/yield. Accessed 16 May 2018
Generators. Python Wiki. https://wiki.python.org/moin/Generators. Accessed 16 May 2018
Russinovich, M.E., Solomon, D.A., Ionescu, A.: Windows Internals, Part 1: Covering Windows Server 2008 R2 and Windows 7, 6th edn. Microsoft Press, USA (2012)
Kerrisk, M.: The Linux Programming Interface: A Linux and UNIX System Programming Handbook, 1st edn. No Starch Press, San Francisco (2010)
RethinkDB Team. Improving a large C++ project with coroutines, 2010. https://www.rethinkdb.com/blog/improving-a-large-c-project-with-coroutines/. Accessed 16 May 2018
Dybvig, R .K.: The Scheme Programming Language, 4th edn. The MIT Press, Cambridge (2009)
Kowalke, O., Goodspeed, N.: Call/cc (call-with-current-continuation): a low-level api for stackful context switching. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0534r3.pdf. Accessed 16 May 2018
Marusarz, J.: Understanding how general exploration works in Intel®VTune™Amplifier XE, 2015. https://software.intel.com/en-us/articles/understanding-how-general-exploration-works-in-intel-vtune-amplifier-xe. Accessed 16 May 2018
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976)
Jonathan, C., Minhas, U.F., Hunter, J., Levandoski, J., Nishanov, G.: Exploiting coroutines to attack the “killer nanoseconds”. PVLDB 11(11), 1702–1714 (2018)
Kiriansky, V., Xu, H., Rinard, M., Amarasinghe, S.: Cimple: instruction and memory level parallelism. ArXiv e-prints, (2018)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Psaropoulos, G., Legler, T., May, N. et al. Interleaving with coroutines: a systematic and practical approach to hide memory latency in index joins. The VLDB Journal 28, 451–471 (2019). https://doi.org/10.1007/s00778-018-0533-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0533-6