Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications
- 165 Downloads
- 2 Citations
Abstract
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.
Keywords
GPU Parallel programming Data layout transformationPreview
Unable to display preview. Download preview PDF.
References
- 1.Anderson J.M., Amarasinghe S.P., Lam M.S.: Data and computation transformations for multiprocessors. SIGPLAN Not. 30(8), 166–178 (1995)CrossRefGoogle Scholar
- 2.Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from berkeley. Technical report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006Google Scholar
- 3.Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: ISPASS, pp. 163–174. IEEE (2009)Google Scholar
- 4.Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for gpgpus. In: ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pp. 225–234. ACM, New York, NY, USA (2008)Google Scholar
- 5.Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC08: Proceedings of the 2008 Conference on Supercomputing, pp. 1–12. Piscataway, NJ, USA (2008)Google Scholar
- 6.Demmel J.W.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA (1997)CrossRefMATHGoogle Scholar
- 7.Ferziger J.H., Peric M.: Computational Methods for Fluid Dynamics. Springer, Berlin (1999)MATHGoogle Scholar
- 8.Girbal S., Vasilache N., Bastoul C., Cohen A., Parello D., Sigler M., Temam O.: Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Prog. 34(3), 261–317 (2006)CrossRefMATHGoogle Scholar
- 9.Gundolf C.D., Douglas C.C., Haase G., Hu J., Kowarschik M., Weiss C.: Portable memory hierarchy techniques for PDE solvers, part II. SIAM News 33, 8–9 (2000)Google Scholar
- 10.Ipek E., Mutlu O., Martínez J.F., Caruana R.: Self-optimizing memory controllers: A reinforcement learning approach. Comp. Arch. News 36(3), 39–50 (2008)CrossRefGoogle Scholar
- 11.Jang, B., Mistry, P., Schaa, D., Dominguez, R., Kaeli, D.: Data transformations enabling loop vectorization on multithreaded data parallel architectures. In: PPoPP ’10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 353–354. ACM, New York, NY, USA (2010)Google Scholar
- 12.Ju Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pp. 344–358. Springer, London, UK (1992)Google Scholar
- 13.Kennedy, K., Kremer, U.: Automatic data layout for high performance fortran. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), pp. 76. ACM, New York, NY, USA (1995)Google Scholar
- 14.Kindratenko, V., Enos, J., Shi, G.: Gpu clusters for high-performance computing. In: Proceedings of the Workshop on Parallel Programming on Accelerator Clusters. Jan 2009Google Scholar
- 15.Kwon, Y.-S., Koo, B.-T., Eum, N.-W.: Partial conflict-relieving programmable address shuffler for parallel memories in multi-core processor. In: ASP-DAC ’09: Proceedings of the 2009 Asia and South Pacific Design Automation Conference, pp. 329–334. IEEE Press, Piscataway, NJ, USA (2009)Google Scholar
- 16.Lu, Q., Alias, C., Bondhugula, U., Henretty, T., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P., Chen, Y., Lin, H., Ngai, T.-f.: Data layout transformation for enhancing data locality on nuca chip multiprocessors. In: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 348–357 (2009)Google Scholar
- 17.Mace M.E.: Memory Storage Patterns in Parallel Processing. Kluwer, Boston (1987)CrossRefGoogle Scholar
- 18.Mahapatra N.R., Venkatrao B.: The processor-memory bottleneck: problems and solutions. Crossroads 5(3es), 2 (1999)CrossRefGoogle Scholar
- 19.McVoy, L., Staelin, C.: lmbench: portable tools for performance analysis. In: Proceedings of the 1996 USENIX Annual Technical Conference, pp. 23–23 (1996)Google Scholar
- 20.Morton K.W., Mayers D.F.: Numerical Solution of Partial Differential Equations: An Introduction. Cambridge University Press, New York, NY (2005)CrossRefMATHGoogle Scholar
- 21.Moscibroda, T., Mutlu, O.: Distributed order scheduling and its application to multi-core DRAM controllers. In: Proceedings of the 27th Symposium on Principles of Distributed Computing, pp. 365–374 (2008)Google Scholar
- 22.Mutlu O., Moscibroda T.: Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. Comput. Arch. News 36(3), 63–74 (2008)CrossRefGoogle Scholar
- 23.nVIDIA: nvidia cuda programming guide 2.0 (2008)Google Scholar
- 24.Pohl T., Kowarschik M., Wilke J., Iglberger K., Rüde U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Process. Lett. 13(4), 549–560 (2003)CrossRefMathSciNetGoogle Scholar
- 25.Qian Y.H., D’Humieres D., Lallemand P.: Lattice BGK models for Navier-Stokes equation. Europhys. Lett. 17(6), 479–484 (1992)CrossRefMATHGoogle Scholar
- 26.Rivera, G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC00: Proceedings of the 2000 conference on Supercomputing, p. 32 (2000)Google Scholar
- 27.Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-m.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)Google Scholar
- 28.Sellappa S, Chatterjee S.: Cache-Efficient multigrid algorithms. Int. J. High Perform. Comput. Appl. 18(1), 115–133 (2004)CrossRefGoogle Scholar
- 29.Shao, J., Davis, B.T.: A burst scheduling access reordering mechanism. In: Proceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 285–294 (2007)Google Scholar
- 30.Spradling C.D.: Spec cpu2006 benchmark tools. Comput. Arch. News 35(1), 130–134 (2007)CrossRefGoogle Scholar
- 31.Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC08: Proceedings of the 2008 Conference on Supercomputing, pp. 1–11 (2008)Google Scholar
- 32.Zhao Y.: Lattice Boltzmann based PDE solver on the GPU. Vis. Comput. 24(5), 323–333 (2008)CrossRefGoogle Scholar