Abstract
The Single Instruction, Multiple Data (SIMD) execution model has been receiving renewed attention recently. This awareness stems from the rise of graphics processing units (GPUs) as a powerful alternative for parallel computing. Many compiler optimizations have been recently proposed for this hardware, but register allocation is a field yet to be explored. In this context, this paper describes a register spiller for SIMD machines that capitalizes on the opportunity to share identical data between threads. It provides two different benefits: first, it uses less memory, as more spilled values are shared among threads. Second, it improves the access times to spilled values. We have implemented our proposed allocator in the Ocelot open source compiler, and have been able to speedup the code produced by this framework by 21%. Although we have designed our algorithm on top of a linear scan register allocator, we claim that our ideas can be easily adapted to fit the necessities of other register allocators.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools, 2nd edn. Addison Wesley (2006)
Backus, J.: The history of fortran i, ii, and iii. SIGPLAN Not. 13(8), 165–180 (1978)
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for GPU architectures. In: PPoPP, pp. 105–114. ACM (2010)
Belady, L.A.: A study of replacement algorithms for a virtual storage computer. IBM Systems Journal 5(2), 78–101 (1966)
Bouchez, F.: Allocation de Registres et Vidage en Mémoire. Master’s thesis, ENS Lyon (October 2005)
Briggs, P., Cooper, K.D., Torczon, L.: Rematerialization. In: PLDI, pp. 311–321. ACM (1992)
Carrillo, S., Siegel, J., Li, X.: A control-structure splitting optimization for GPGPU. In: Computing Frontiers, pp. 147–150. ACM (2009)
Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., Markstein, P.W.: Register allocation via coloring. Computer Languages 6, 47–57 (1981)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IISWC, pp. 44–54. IEEE (2009)
Coutinho, B., Sampaio, D., Pereira, F.M.Q., Meira, W.: Divergence analysis and optimizations. In: PACT. IEEE (2011)
Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. TOPLAS 13(4), 451–490 (1991)
Diamos, G., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: PACT, pp. 354–364 (2010)
Farach-colton, M., Liberatore, V.: On local register allocation. Journal of Algorithms 37(1), 37–65 (2000)
Garland, M.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)
Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Commun. ACM 53, 58–66 (2010)
Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, 1st edn. Elsevier (2004)
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: GPGPU-4, pp. 3:1–3:8. ACM (2011)
Harris, M.: The parallel prefix sum (scan) with CUDA. Tech. Rep. Initial release on February 14, 2007, NVIDIA (2008)
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: ISCA, pp. 451–460. ACM (2010)
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30, 56–69 (2010)
Nickolls, J., Kirk, D.: Graphics and Computing GPUs. In: Patterson, Hennessy (eds.) Computer Organization and Design, 4th edn., ch. A, pp. A.1–A.77. Elsevier (2009)
Pereira, F.M.Q., Palsberg, J.: Register Allocation After Classical SSA Elimination is NP-Complete. In: Aceto, L., Ingólfsdóttir, A. (eds.) FOSSACS 2006. LNCS, vol. 3921, pp. 79–93. Springer, Heidelberg (2006)
Poletto, M., Sarkar, V.: Linear scan register allocation. TOPLAS 21(5), 895–913 (1999)
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP, pp. 73–82. ACM (2008)
Sampaio, D., Martins, R., Collange, C., Pereira, F.M.Q.: Divergence analysis with affine constraints. Tech. rep., École normale supérieure de Lyon (2011)
Sethi, R.: Complete register allocation problems. In: 5th annual ACM Symposium on Theory of Computing, pp. 182–195. ACM Press (1973)
Sreedhar, V.C., Gao, G.R.: A linear time algorithm for placing f-nodes. In: POPL, pp. 62–73. ACM (1995)
Wegman, M.N., Zadeck, F.K.: Constant propagation with conditional branches. TOPLAS 13(2) (1991)
Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS, pp. 369–380. ACM (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sampaio, D.N., Gedeon, E., Pereira, F.M.Q., Collange, C. (2012). Spill Code Placement for SIMD Machines. In: de Carvalho Junior, F.H., Barbosa, L.S. (eds) Programming Languages. Lecture Notes in Computer Science, vol 7554. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33182-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-33182-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33181-7
Online ISBN: 978-3-642-33182-4
eBook Packages: Computer ScienceComputer Science (R0)