Two-Level Task Scheduling for Irregular Applications on GPU Platform

Li, Jing; Liu, Lei; Wu, Yuan; Feng, Xiaobing; Wu, Chengyong

doi:10.1007/s10766-015-0387-0

Two-Level Task Scheduling for Irregular Applications on GPU Platform

Published: 04 November 2015

Volume 45, pages 79–93, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Jing Li^1,2,
Lei Liu¹,
Yuan Wu³,
Xiaobing Feng¹ &
…
Chengyong Wu¹

477 Accesses
Explore all metrics

Abstract

With a data parallel design, GPUs depend on uniform work distribution to expose their full potential. Therefore, irregular applications suffer from serious performance degradation as it is highly challenging to schedule irregular tasks on a GPU: It requires understandings of GPU architecture and irregular applications to devise a scheduling most suitable in this context, not to mention error-prone concurrent programming. This paper proposes a two-level scheduling to distribute irregular tasks and enable resource sharing on GPUs, by managing tasks and threads hierarchically. Meanwhile, we manage to group cache friendly tasks for more data reuse in L1 cache. We further extend our scheduling to handle nested irregularities. Besides, we devise a programming framework to facilitate the task scheduling for application programmers. The experimental results show that our approach effectively improves performance of six irregular applications on a typical platform, yielding a harmonic-mean speedup of \(2.1\times \) at a small schedule cost, and does not burden programmers with lots of work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads

Article Open access 07 June 2024

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

Article 06 June 2024

Parallelization of butterfly counting on hierarchical memory

Article 07 June 2024

References

Ruetsch, G., Micikevicius, P.: Optimize matrix transpose in CUDA. NVIDIA (2009). http://www.cs.colostate.edu/~cs675/MatrixTranspose.pdf
Fujimoto, N.: Faster matrix-vector multiplication on geforce 8800 gtx. In: Proceedings of IEEE IPDPS, pp. 1–8 (2008)
Cederman, D., Tsigas, P.: Dynamic load balancing using work-stealing. In: Hwu, WmW (ed.) GPU Computing Gems Jade Edition, pp. 485–499. Morgan Kaufmann, Boston (2012)
Chapter Google Scholar
Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single-and multi-gpu systems. In: Proceedings of IEEE IPDPS, pp. 1–12 (2010)
Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on gpus. In: Proceedings of IEEE IISWC, pp. 141–151 (2012)
Jia, H., Zhang, Y., Wang, W., Xu, J.: Accelerating viola-jones facce detection algorithm on gpus. In: Proceedings of IEEE HPCC-ICESS, Liverpool, UK, pp. 396–403 (2012)
Müller, C., Strengert, M., Ertl, T.: Adaptive load balancing for raycasting of non-uniformly bricked volumes. Parallel Comput. 33(6), 406–419 (2007)
Article Google Scholar
Zhu, Z., Li, J., Li, G.: Load-balanced breadth-first search on gpus. In: Web-Age Information Management. Volume 8485 of Lecture Notes in Computer Science, pp. 435–447. Springer (2014)
Harish, P., Narayanan, P.: Accelerating large graph algorithms on the gpu using cuda. In: High Performance Computing, pp. 197–208. Springer, Berlin (2007)
Liu, L., Li, Y., Cui, Z., Bao, Y., Chen, M., Wu, C.: Going vertical in memory management: Handling multiplicity by multi-policy. In: Proceedings of IEEE ISCA, pp. 169–180 (2014)
NVIDIA: CUDA C Programming Guide. Version 6.5 edn. (2014)
Aila, T., Laine, S.: Understanding the efficiency of ray traversal on gpus. In: Proceedings of ACM HPG, pp. 145–149 (2009)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C 28(1), 100–108 (1979)
MATH Google Scholar
Tsiodras, T.: Renderer 2.x-porting to cuda (one month later). Accessed 2015-01
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of IEEE IISWC, pp. 44–54 (2009)
Bradski, G.: Opencv. Dr. Dobb’s Journal of Software Tools (2000). http://opencv.org/
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings ACM/IEEE Conference on Supercomputing, pp. 18:1–18:11 (2009)
Dan Ginsburg, P.E.G., Pienaar, R.: OpenCL Programming Guide. Addison-Wesley Professional, Boston (2011)
Google Scholar
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: Proceedings of IEEE/ACM MICRO, pp. 407–420 (2007)
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of ACM ISCA, pp. 235–246 (2010)
Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating cuda graph algorithms at maximum warp. In: Proceedings of ACM PPoPP, pp. 267–276 (2011)
Tzeng, S., Patney, A., Owens, J.D.: Task management for irregular-parallel workloads on the gpu. In: Proceedings of HPG, Eurographics Association, pp. 29–37 (2010)
Zhang, E., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for gpu computing. In: Proceedings of ACM ASPLOS, pp. 369–380 (2011)
Ma, L., Agrawal, K., Chamberlain, R.D.: Theoretical analysis of classic algorithms on highly-threaded many-core gpus. In: Proceedings of ACM SIGPLAN PPoPP, pp. 391–392 (2014)

Download references

Acknowledgments

This work is supported by the National High Technology Research and Development Program of China (2012AA010902), the National Natural Science Foundation of China (61432018), and the Innovation Research Group of NSFC (61221062).

Author information

Authors and Affiliations

SKL of Computer Architecture, Institute of Computing Technology, CAS, Beijing, China
Jing Li, Lei Liu, Xiaobing Feng & Chengyong Wu
School of Computer and Control Engineering, UCAS, Beijing, China
Jing Li
Beijing Samsung Telecom R&D Center, Beijing, China
Yuan Wu

Authors

Jing Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Chengyong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Liu, L., Wu, Y. et al. Two-Level Task Scheduling for Irregular Applications on GPU Platform. Int J Parallel Prog 45, 79–93 (2017). https://doi.org/10.1007/s10766-015-0387-0

Download citation

Received: 30 March 2015
Accepted: 20 May 2015
Published: 04 November 2015
Issue Date: February 2017
DOI: https://doi.org/10.1007/s10766-015-0387-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Level Task Scheduling for Irregular Applications on GPU Platform

Abstract

Access this article

Similar content being viewed by others

First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

Parallelization of butterfly counting on hierarchical memory

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-Level Task Scheduling for Irregular Applications on GPU Platform

Abstract

Access this article

Similar content being viewed by others

First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

Parallelization of butterfly counting on hierarchical memory

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation