Abstract
In this chapter, we propose a parallel algorithm for sparse matrix transposition using CSR format to run on many-core GPUs, utilizing the tremendous computational power and memory bandwidth of the GPU offered by parallel programming in CUDA. Our code is run on a quad-core Intel Xeon64 CPU E5507 platform and a NVIDIA GPU GTX 470 card. We measure the performance of our algorithm running with input ranging from smaller to larger matrices, and our experimental results show that the preliminary results are scaling well up to 512 threads and are promising for bigger matrices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
NVIDIA (2008) CUDA programming guide. NVIDIA Corporation, June, 2008, version 2.0.
Buluç A, Fineman JT, Frigo M, Gilbert JR, Leiserson CE (2009) Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of the 21th annual symposium on parallelism in algorithms and architectures, Calgary, 2009, pp 233–244
Cilk++ programmer’s guide (2009) Cilk Arts, Inc., Burlington. Available at: http://www.cilk.com
Krishnamoorthy S, Baumgartner G, Cociorva D, Lam C-C, Sadayappan P (2004) Efficient parallel out-of-core matrix transposition. Int J High Perform Comput Netw 2(2–4):110–119
Mateescu G, Bauer GH, Fiedler RA (2011) Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. In: Proceedings of the second international workshop on performance modeling, benchmarking and simulation of high performance computing systems, 2011, pp 5–6
Stathis P, Cheresiz D, Vassiliadis S, Juurlink B (2004) Sparse matrix transpose unit. In: Proceeding of the 18th international conference on parallel and distributed processing symposium (IPDPS), 2004
Gustavson FG (1978) Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Trans Math Softw 4(3):250–269
Davis TA (1994) University of Florida sparse matrix collection. NA Dig 92
Li K-C, Weng T-H (2009) Performance-based parallel application toolkit for high-performance clusters. J Supercomput 48(1):43–65
Acknowledgments
This chapter is based upon the work supported in part by Taiwan National Science Council (NSC) grants no. NSC101-2221-E-126-002 and NSC101-2915-I-126-001 and NVIDIA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSC or NVIDIA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this paper
Cite this paper
Weng, TH., Pham, H., Jiang, H., Li, KC. (2013). Designing Parallel Sparse Matrix Transposition Algorithm Using CSR for GPUs. In: Juang, J., Huang, YC. (eds) Intelligent Technologies and Engineering Systems. Lecture Notes in Electrical Engineering, vol 234. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6747-2_31
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6747-2_31
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6746-5
Online ISBN: 978-1-4614-6747-2
eBook Packages: EngineeringEngineering (R0)