Abstract
It is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
With \(n = 1{,}568 = 2^5 \cdot 7^2\), we choose a matrix size that is not a power of \(2\) to avoid performance artifacts due to the specific problem size.
- 2.
The subscripts R through U are the values of the flag arguments side, uplo, trans, and diag; they distinguish the form of the operation performed by the kernel.
- 3.
Read from the CPU’s time stamp counter through the assembly instruction rdtsc.
- 4.
The system fluctuations cause variations of the dgeqrf timings of 0.057 % on average. With the exception of the tiny dcopy s, these fluctuations are not significant.
- 5.
By “touching”, we mean a simple read+write access to the data, e.g. .
- 6.
The length of the list can be safely restricted to the number of kernel calls per iteration of the blocked algorithm.
- 7.
For \(n = 2400\), the upper triangular portion of the matrix is about twice as large as the cache size.
References
Peise, E., Bientinesi, P.: Performance modeling for dense linear algebra. In: Proceedings of the 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS12), November 2012
Whaley, R.: Empirically tuning lapack’s blocking factor for increased performance. In: 2008 International Multiconference on Computer Science and Information Technology, IMCSIT 2008, pp. 303–310, October 2008
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pp. 63–74. ACM, New York (1991)
Iakymchuk, R., Bientinesi, P.: Modeling performance through memory-stalls. ACM SIGMETRICS Perform. Eval. Rev. 40(2), 86–91 (2012)
OpenBLAS: http://www.openblas.net/
Acknowledgments
Financial support from the Deutsche Forschungsgemeinschaft (DFG) through grant GSC 111 and the Deutsche Telekom Stiftung is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Peise, E., Bientinesi, P. (2015). A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-17353-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)