Abstract
With a data parallel design, GPUs depend on uniform work distribution to expose their full potential. Therefore, irregular applications suffer from serious performance degradation as it is highly challenging to schedule irregular tasks on a GPU: It requires understandings of GPU architecture and irregular applications to devise a scheduling most suitable in this context, not to mention error-prone concurrent programming. This paper proposes a two-level scheduling to distribute irregular tasks and enable resource sharing on GPUs, by managing tasks and threads hierarchically. Meanwhile, we manage to group cache friendly tasks for more data reuse in L1 cache. We further extend our scheduling to handle nested irregularities. Besides, we devise a programming framework to facilitate the task scheduling for application programmers. The experimental results show that our approach effectively improves performance of six irregular applications on a typical platform, yielding a harmonic-mean speedup of \(2.1\times \) at a small schedule cost, and does not burden programmers with lots of work.
Similar content being viewed by others
References
Ruetsch, G., Micikevicius, P.: Optimize matrix transpose in CUDA. NVIDIA (2009). http://www.cs.colostate.edu/~cs675/MatrixTranspose.pdf
Fujimoto, N.: Faster matrix-vector multiplication on geforce 8800 gtx. In: Proceedings of IEEE IPDPS, pp. 1–8 (2008)
Cederman, D., Tsigas, P.: Dynamic load balancing using work-stealing. In: Hwu, WmW (ed.) GPU Computing Gems Jade Edition, pp. 485–499. Morgan Kaufmann, Boston (2012)
Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single-and multi-gpu systems. In: Proceedings of IEEE IPDPS, pp. 1–12 (2010)
Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on gpus. In: Proceedings of IEEE IISWC, pp. 141–151 (2012)
Jia, H., Zhang, Y., Wang, W., Xu, J.: Accelerating viola-jones facce detection algorithm on gpus. In: Proceedings of IEEE HPCC-ICESS, Liverpool, UK, pp. 396–403 (2012)
Müller, C., Strengert, M., Ertl, T.: Adaptive load balancing for raycasting of non-uniformly bricked volumes. Parallel Comput. 33(6), 406–419 (2007)
Zhu, Z., Li, J., Li, G.: Load-balanced breadth-first search on gpus. In: Web-Age Information Management. Volume 8485 of Lecture Notes in Computer Science, pp. 435–447. Springer (2014)
Harish, P., Narayanan, P.: Accelerating large graph algorithms on the gpu using cuda. In: High Performance Computing, pp. 197–208. Springer, Berlin (2007)
Liu, L., Li, Y., Cui, Z., Bao, Y., Chen, M., Wu, C.: Going vertical in memory management: Handling multiplicity by multi-policy. In: Proceedings of IEEE ISCA, pp. 169–180 (2014)
NVIDIA: CUDA C Programming Guide. Version 6.5 edn. (2014)
Aila, T., Laine, S.: Understanding the efficiency of ray traversal on gpus. In: Proceedings of ACM HPG, pp. 145–149 (2009)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C 28(1), 100–108 (1979)
Tsiodras, T.: Renderer 2.x-porting to cuda (one month later). Accessed 2015-01
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of IEEE IISWC, pp. 44–54 (2009)
Bradski, G.: Opencv. Dr. Dobb’s Journal of Software Tools (2000). http://opencv.org/
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings ACM/IEEE Conference on Supercomputing, pp. 18:1–18:11 (2009)
Dan Ginsburg, P.E.G., Pienaar, R.: OpenCL Programming Guide. Addison-Wesley Professional, Boston (2011)
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: Proceedings of IEEE/ACM MICRO, pp. 407–420 (2007)
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of ACM ISCA, pp. 235–246 (2010)
Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating cuda graph algorithms at maximum warp. In: Proceedings of ACM PPoPP, pp. 267–276 (2011)
Tzeng, S., Patney, A., Owens, J.D.: Task management for irregular-parallel workloads on the gpu. In: Proceedings of HPG, Eurographics Association, pp. 29–37 (2010)
Zhang, E., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for gpu computing. In: Proceedings of ACM ASPLOS, pp. 369–380 (2011)
Ma, L., Agrawal, K., Chamberlain, R.D.: Theoretical analysis of classic algorithms on highly-threaded many-core gpus. In: Proceedings of ACM SIGPLAN PPoPP, pp. 391–392 (2014)
Acknowledgments
This work is supported by the National High Technology Research and Development Program of China (2012AA010902), the National Natural Science Foundation of China (61432018), and the Innovation Research Group of NSFC (61221062).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, J., Liu, L., Wu, Y. et al. Two-Level Task Scheduling for Irregular Applications on GPU Platform. Int J Parallel Prog 45, 79–93 (2017). https://doi.org/10.1007/s10766-015-0387-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-015-0387-0