The Journal of Supercomputing

, Volume 71, Issue 6, pp 2309–2338 | Cite as

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

  • Behram Khan
  • Daniel Goodman
  • Salman Khan
  • Will Toms
  • Paolo Faraboschi
  • Mikel Luján
  • Ian Watson


To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel ‘tasks’ and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task’s data requirements. This prior knowledge of a task’s data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program’s performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 % for the L1 and L2 caches, respectively, and up to 30 % improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 %, allowing for less load on the interconnect, but also in lower power consumption.


Scheduling Hardware scheduling Task-based application Dataflow 


  1. 1.
    Dagum L, Menon R (1998) OpenMP: an industry-standard API for shared-memory programming. In: IEEE computer science engineering, vol 5. IEEE Computer Society Press, Los Alamitos. doi:10.1109/99.660313
  2. 2.
    Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP ’95. ACM, New York. doi:10.1145/209936.209958
  3. 3.
    Reinders J (2007) Intel threading building blocks, 1st edn. O’Reilly & Associates Inc, SebastopolGoogle Scholar
  4. 4.
    Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. In: Queue, vol 6. ACM, New York. doi:10.1145/1365490.1365500
  5. 5.
    Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. In: IEEE design test, vol 12. IEEE Computer Society Press, Los Alamitos. doi:10.1109/MCSE.2010.69
  6. 6.
    Thies W, Karczmarek M, Amarasinghe SP (2002) StreamIt: a language for streaming applications. In: Proceedings of the 11th international conference on compiler construction, CC ’02. Springer-Verlag, London.
  7. 7.
    Jenista JC, Eom YH, Demsky B (2010) OoOJava: an out-of-order approach to parallel programming. In: Proceedings of the 2nd USENIX conference on hot topics in parallelism, HotPar’10. USENIX Association, Berkeley.
  8. 8.
    Perez JM, Badia RM, Labarta J (2008) A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of the 2008 IEEE international conference on cluster computingGoogle Scholar
  9. 9.
    Watson I et al (2010) The TERAFLUX project. Accessed 1 Jan 2015
  10. 10.
    Gurd JR, Kirkham CC, Watson I (1985) The manchester prototype dataflow computer. In: Communication ACM, vol 28. ACM, New York. doi:10.1145/2465.2468
  11. 11.
    Papadopoulos GM, Culler DE (1990) Monsoon: an explicit token-store architecture. In: Proceedings of the 17th annual international symposium on computer architecture, ISCA ’90. ACM, New York. doi:10.1145/325164.325117
  12. 12.
    Cann D (1992) Retire fortran?: a debate rekindled. In: Communication ACM, vol 35. ACM, New York. doi:10.1145/135226.135231
  13. 13.
    Watson I, Woods V, Watson P, Banach R, Greenberg M, Sargeant J (1988) Flagship: a parallel architecture for declarative programming. In: Proceedings of the 15th annual international symposium on computer architecture, ISCA ’88. IEEE Computer Society Press, Los Alamitos.
  14. 14.
    Darlington J, Reeve M (1981) ALICE a multi-processor reduction machine for the parallel evaluation CF applicative languages. In: Proceedings of the 1981 conference on functional programming languages and computer architecture, FPCA ’81. ACM, New York. doi:10.1145/800223.806764
  15. 15.
    Peyton Jones SL, Clack C, Salkild J, Hardie M (1987) GRIP&Mdash; a high-performance architecture for parallel graph reduction. In: Proceedings of a conference on functional programming languages and computer architecture. Springer-Verlag, London.
  16. 16.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. In: Communication ACM, vol 51. ACM, New York. doi:10.1145/1327452.1327492
  17. 17.
    Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley.
  18. 18.
    Goodman D, Khan S, Seaton C, Guskov Y, Khan B, Lujan M, Watson I (2012) DFScala: high level dataflow support for Scala. In: Proceedings of the data-flow execution models for extreme scale computingGoogle Scholar
  19. 19.
    Odersky M, Spoon L, Venners B (2008) Programming in Scala: a comprehensive step-by-step guide, 1st edn. Artima Incorporation, USAGoogle Scholar
  20. 20.
    Roberts ES, Vandevoorde MT (1989) WORKCREWS : an abstraction for controlling parallelism, vol 42.
  21. 21.
    Mohr E, Kranz DA, Halstead Jr RH (1990) Lazy task creation: a technique for increasing the granularity of parallel programs. In: Proceedings of the 1990 ACM conference on LISP and functional programming, LFP ’90. ACM, New York. doi:10.1145/91556.91631
  22. 22.
    Hendler D, Shavit N (2002) Non-blocking steal-half work queues. In: Proceedings of the 21st annual symposium on principles of distributed computing, PODC ’02. ACM, New York. doi: 10.1145/571825.571876
  23. 23.
    Chase D, Lev Y (2005) Dynamic circular work-stealing deque. In: Proceedings of the 17th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’05. ACM, New York. doi:10.1145/1073970.1073974
  24. 24.
    Acar UA, Blelloch GE, Blumofe RD (2000) The data locality of work stealing. In: Proceedings of the 12th annual ACM symposium on parallel algorithms and architectures, SPAA ’00. ACM, New York. doi:10.1145/341800.341801
  25. 25.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. In: Communication ACM, vol 13. ACM, New York. doi:10.1145/362686.362692
  26. 26.
    Kumar S, Hughes CJ, Nguyen A (2007) Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Proceedings of the 34th annual international symposium on computer architecture, ISCA ’07. ACM, New York. doi:10.1145/1250662.1250683
  27. 27.
    Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The Gem5 simulator. In: SIGARCH computer architecture news, vol 39. ACM, New York. doi:10.1145/2024716.2024718
  28. 28.
    Horn B, Schunck B (1981) Determining optical flow. In: Artificial intelligence, vol 17. Elsevier, LondonGoogle Scholar
  29. 29.
    Project Gutenberg (1971).
  30. 30.
    Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, NorwellCrossRefMATHGoogle Scholar
  31. 31.
    Lea D (2000) A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java GrandeGoogle Scholar
  32. 32.
    Halstead Jr RH (1984) Implementation of multiLISP: LISP on a multiprocessor. In: Proceedings of the 1984 ACM symposium on LISP and functional programming, LFP ’84. ACM, New York. doi:10.1145/800055.802017
  33. 33.
    Kwok YK, Ahmad I (1999) Static scheduling algorithms for allocating directed task graphs to multiprocessors. In: ACM Computer Surveys, vol 31. ACM, New York. doi:10.1145/344588.344618
  34. 34.
    Su E, Tian X, Girkar M, Haab G, Shah S, Petersen P (2002) Compiler support of the workqueuing execution model for Intel SMP architectures. In: 4th European workshop on OpenMPGoogle Scholar
  35. 35.
    Arora NS, Blumofe RD, Plaxton CG (1998) Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of the 10th annual ACM symposium on parallel algorithms and architectures, SPAA ’98. ACM, New York. doi:10.1145/277651.277678
  36. 36.
    Sanchez D, Yoo RM, Kozyrakis C (2010) Flexible architectural support for fine-grain scheduling. In: Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems, ASPLOS XV. ACM, New York. doi:10.1145/1736020.1736055
  37. 37.
    Dally W, Towles B (2003) Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  38. 38.
    Yoo RM, Hughes CJ, Kim C, Chen YK, Kozyrakis C (2013) Locality-aware task management for unstructured parallelism: a quantitative limit study. In: Proceedings of the 25th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’13. ACM, New York. doi:10.1145/2486159.2486175
  39. 39.
    Chen S, Gibbons PB, Kozuch M, Liaskovitis V, Ailamaki A, Blelloch GE, Falsafi B, Fix L, Hardavellas N, Mowry TC, Wilkerson C (2007) Scheduling threads for constructive cache sharing on CMPs. In: Proceedings of the 19th annual ACM symposium on parallel algorithms and architectures, SPAA ’07. ACM, New York. doi:10.1145/1248377.1248396
  40. 40.
    Blelloch GE, Gibbons PB (2004) Effectively sharing a cache among threads. In: Proceedings of the 16th annual ACM symposium on parallelism in algorithms and architectures, SPAA ’04. ACM, New York. doi:10.1145/1007912.1007948
  41. 41.
    Blelloch GE, Gibbons PB, Matias Y (1999) Provably efficient scheduling for languages with fine-grained parallelism. In: Journal of ACM, vol 46. ACM, New York. doi:10.1145/301970.301974

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Behram Khan
    • 1
  • Daniel Goodman
    • 3
  • Salman Khan
    • 2
  • Will Toms
    • 3
  • Paolo Faraboschi
    • 4
  • Mikel Luján
    • 3
  • Ian Watson
    • 3
  1. 1.BT ResearchIpswichUK
  2. 2.Solarflare CommunicationsIrvineUSA
  3. 3.The University of ManchesterManchesterUK
  4. 4.HP LabsPalo AltoUSA

Personalised recommendations