Cache-conscious off-line real-time scheduling for multi-core platforms: algorithms and implementation

Abstract

Most schedulability analysis techniques for multi-core architectures assume a single worst-case execution time (WCET) per task, which is valid in all execution conditions. This assumption is too pessimistic for parallel applications running on multi-core architectures with local instruction or data caches, for which the WCET of a task depends on the cache contents at the beginning of its execution, itself depending on the tasks that were executed immediately before the task under study. In this paper, we propose two scheduling techniques for multi-core architectures equipped with local instruction and data caches. The two techniques schedule a parallel application modeled as a task graph, and generate a static partitioned non-preemptive schedule, that takes benefit of cache reuse between pairs of consecutive tasks. We propose an exact method, using an integer linear programming formulation, as well as a heuristic method based on list scheduling. The efficiency of the techniques is demonstrated through an implementation of these cache-conscious schedules on a real multi-core hardware: a 16-core cluster of the Kalray MPPA-256, Andey generation. We point out implementation issues that arise when implementing the schedules on this particular platform. In addition, we propose strategies to adapt the schedules to the identified implementation factors. An experimental evaluation reveals that our proposed scheduling methods significantly reduce the length of schedules as compared to cache-agnostic scheduling methods. Furthermore, our experiments show that among the identified implementation factors, shared bus contention has the most impact.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    Note that although designed for multi-core platforms, our proposed techniques can also be used on a single core to account for reuse among tasks executing on the same core.

  2. 2.

    For the experiments, M is the sum of all tasks’ WCETs when not reusing cache contents, to ensure that M is greater than the finish time of any task.

  3. 3.

    For the experiments, M is the sum of the worst-case execution time of all tasks adjusted for worst-case shared bus contention. Specifically, the total number of memory requests that interfere with the execution of \(\tau _j\) is equal to \(MR_{\tau _j} * (|c| - 1)\), where |c| is the number of cores to which tasks are assigned. This ensures M will be greater than the finish time of any task.

  4. 4.

    The rank of a task is defined as the longest path in terms of the number of nodes to reach that task from the entry task.

  5. 5.

    A task \(\tau _j\) can possibly execute after another task \(\tau _i\) if the sequence \(<\tau _i,\tau _j>\) may exist in a valid schedule regarding precedence constraints between tasks. This means that \(\tau _i\) is neither a direct nor indirect successor of \(\tau _j\) in the task graph, i.e. \(\tau _i \in nSucc(\tau _j)\) according to the notation introduced in Table 1.

  6. 6.

    Note that executing T6 directly after T1 on the same core is not a violation of precedence constraints between tasks, provided that tasks T2 to T5 are assigned to another core, and there is an idle time between the end of T1 and the start of T6.

  7. 7.

    With those benchmarks we do not have to modify the code of tasks to have a communication buffer per pair of communicating tasks. It is very costly to modify the code of the other benchmarks for having the same property.

  8. 8.

    Note that the symbol \(WCET_{\tau _j}\) has different meaning with the one used in Sect. 3. Here, \(WCET_{\tau _j}\) is predetermined according to its known execution order.

  9. 9.

    Only 15 out of the 16 cores of the cluster are used in our implementation on the Kalray MPPA, because one core is dedicated to the spawning of the tasks on the cluster cores.

References

  1. Abdallah L, Jan M, Ermont J, Fraboul C (2016) Reducing the contention experienced by real-time core-to-i/o flows over a tilera-like network on chip. In: 28th Euromicro conference on real-time systems, ECRTS 2016, Toulouse, France, July 5–8, vol 86–96

  2. Altmeyer S, Davis RI, Indrusiak L, Maiza C, Nelis V, Reineke J (2015) A generic and compositional framework for multicore response time analysis. In: International conference on real time and networks systems, RTNS ’15, pp 129–138

  3. Arnaud A, Puaut I (2006) Dynamic instruction cache locking in hard real-time systems. In: International conference on real-time networks and systems (RTNS), pp 1–10

  4. Bahn JH, Yang J, Bagherzadeh N (2008) Parallel FFT algorithms on network-on-chips. In: Fifth international conference on information technology: new generations (ITNG 2008), pp 1087–1093

  5. Becker M, Dasari D, Nikolic B, Akesson B, Nélis V, Nolte T (2016) Contention-free execution of automotive applications on a clustered many-core platform. In: 28th Euromicro conference on real-time systems, ECRTS, pp 14–24

  6. Calandrino JM, Anderson JH (2009) On the design and implementation of a cache-aware multicore real-time scheduler. In: 21st Euromicro conference on real-time systems, pp. 194–204

  7. Carle T, Djemal M, Potop-Butucaru D, de Simone R, Zhang Z (2014) Static mapping of real-time applications onto massively parallel processor arrays. In: Proceedings of the 2014 14th international conference on application of concurrency to system design, ACSD ’14, pp 112–121

  8. Chattopadhyay S, Roychoudhury A, Mitra T (2010) Modeling shared cache and bus in multi-cores for timing analysis. In: Proceedings of the 13th international workshop on software & compilers for embedded systems, SCOPES ’10, pp 6:1–6:10

  9. Dasari D, Nélis V (2012) An analysis of the impact of bus contention on the WCET in multicores. In: Min G, Hu J, Liu LC, Yang LT, Seelam S, Lefèvre L (eds) 14th IEEE international conference on high performance computing and communication & 9th IEEE international conference on embedded software and systems, HPCC-ICESS 2012, Liverpool, UK, June 25–27, 2012. IEEE Computer Society, pp 1450–1457. https://doi.org/10.1109/HPCC.2012.212

  10. Dasari D, Andersson B, Nélis V, Petters SM, Easwaran A, Lee J (2011) Response time analysis of cots-based multicores considering the contention on the shared memory bus. In: IEEE 10th international conference on trust, security and privacy in computing and communications, TrustCom 2011, Changsha, China, 16–18 November, 2011. IEEE Computer Society, pp 1068–1075. https://doi.org/10.1109/TrustCom.2011.146

  11. Davis RI, Burns A (2011) A survey of hard real-time scheduling for multiprocessor systems. ACM Comput Surv 43(4):35:1–35:44

    Article  Google Scholar 

  12. Ding H, Liang Y, Mitra T (2013) Shared cache aware task mapping for WCRT minimization. In: 8th Asia and south Pacific design automation conference, ASP-DAC, pp 735–740

  13. Dupont de Dinechin B, van Amstel D, Poulhiès M, Lager G (2014) Time-critical computing on a single-chip massively parallel processor. In: Proceedings of the conference on design, automation & test in Europe, DATE ’14, pp 97:1–97:6

  14. Fernandez G, Abella J, Quiñones E, Rochange C, Vardanega T, Cazorla FJ (2014) Contention in multicore hardware shared resources: understanding of the state of the art. In: 14th international workshop on worst-case execution time analysis, OpenAccess series in informatics (OASIcs), pp 31–42

  15. Geer D (2005) Industry trends: chip makers turn to multicore processors. Computer 38:11–13

    Article  Google Scholar 

  16. Guan N, Stigge M, Yi W, Yu G (2009) Cache-aware scheduling and analysis for multicores. In: Proceedings of the seventh ACM international conference on embedded software, EMSOFT ’09, pp 245–254

  17. Gurobi Optimization, Inc. (2015) Gurobi optimizer reference manual. Gurobi Optimization, Inc., Oregon

    Google Scholar 

  18. Hardy D, Piquet T, Puaut I (2009) Using bypass to tighten WCET estimates for multi-core processors with shared instruction caches. In: Proceedings of the 30th IEEE real-time systems symposium, RTSS, pp 68–77

  19. Kasahara H, Narita S (1984) Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput 33(11):1023–1029

    Article  Google Scholar 

  20. Kelter T, Falk H, Marwedel P, Chattopadhyay S, Roychoudhury A (2014) Static analysis of multi-core tdma resource arbitration delays. Real-Time Syst 50(2):185–229

    Article  Google Scholar 

  21. Kim H, de Niz D, Andersson B, Klein MH, Mutlu O, Rajkumar R (2014) Bounding memory interference delay in cots-based multi-core systems. In: 20th IEEE real-time and embedded technology and applications symposium, RTAS 2014, Berlin, Germany, April 15–17, 2014. IEEE Computer Society, pp 145–154. https://doi.org/10.1109/RTAS.2014.6925998

  22. Kim H, de Niz D, Andersson B, Klein MH, Mutlu O, Rajkumar R (2016) Bounding and reducing memory interference in cots-based multi-core systems. Real-Time Syst 52(3):356–395. https://doi.org/10.1007/s11241-016-9248-1

    Article  Google Scholar 

  23. Kwok YK, Ahmad I (1999a) Benchmarking and comparison of the task graph scheduling algorithms. J Parallel Distrib Comput 59:381–422

  24. Kwok YK, Ahmad I (1999b) Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput Surv 31(4):406–471

  25. Li YTS, Malik S (1995) Performance analysis of embedded software using implicit path enumeration. In: Proceedings of the 32nd annual ACM/IEEE design automation conference, pp 456–461

  26. Liang Y, Ding H, Mitra T, Roychoudhury A, Li Y, Suhendra V (2012) Timing analysis of concurrent programs running on shared cache multi-cores. Real-time Syst 48(6):638–680

    Article  Google Scholar 

  27. Maaita A, Pont MJ (2005) Using “planned pre-emption” to reduce levels of task jitter in a time-triggered hybrid scheduler. In: Proceedings of the second UK embedded forum (Birmingham, UK), pp 18–35

  28. Martinez S, Hardy D, Puaut I (2017) Quantifying wcet reduction of parallel applications by introducing slack time to limit resource contention. In: Proceedings of the 25th international conference on real-time networks and systems, RTNS 2017, Grenoble, France, October 04–06, 2017, pp 188–197

  29. Nélis V, Yomsi PM, Pinho LM, Fonseca JC, Bertogna M, Quiñones E, Vargas R, Marongiu A (2014) The challenge of time-predictability in modern many-core architectures. In: 14th international workshop on worst-case execution time analysis, OpenAccess series in informatics (OASIcs), vol 39, pp 63–72

  30. Nélis V, Yomsi PM, Pinho LM (2016) The variability of application execution times on a multi-core platform. In: 16th international workshop on worst-case execution time analysis (WCET 2016), OpenAccess series in informatics (OASIcs), pp 1–11

  31. Nemer F, Cassé H, Sainrat P, Awada A (2007) Improving the worst-case execution time accuracy by inter-task instruction cache analysis. In: IEEE second international symposium on industrial embedded systems, SIES, pp 25–32

  32. Nemhauser GL, Wolsey LA (1999) Integer and combinatorial optimization. Wiley interscience series in discrete mathematics and optimization. Wiley, New York

    Google Scholar 

  33. Nguyen VA, Hardy D, Puaut I (2017) Cache-conscious offline real-time task scheduling for multi-core processors. In: 29th Euromicro conference on real-time systems (ECRTS 2017), pp 14:1–14:22

  34. Pellizzoni R, Betti E, Bak S, Yao G, Criswell J, Caccamo M, Kegley R (2011) A predictable execution model for cots-based embedded systems. In: Proceedings of the 2011 17th IEEE real-time and embedded technology and applications symposium, RTAS ’11, pp 269–279

  35. Perret Q, Maurère P, Noulard E, Pagetti C, Sainrat P, Triquet B (2016a) Mapping hard real-time applications on many-core processors. In: Proceedings of the 24th international conference on real-time networks and systems, RTNS ’16. ACM, pp 235–244

  36. Perret Q, Maurère P, Noulard E, Pagetti C, Sainrat P, Triquet B (2016b) Temporal isolation of hard real-time applications on many-core processors. In: 2016 IEEE real-time and embedded technology and applications symposium (RTAS), pp 37–47

  37. Phatrapornnant T, Pont MJ (2006) Reducing jitter in embedded systems employing a time-triggered software architecture and dynamic voltage scaling. IEEE Trans Comput 55(2):113–124. https://doi.org/10.1109/TC.2006.29

    Article  Google Scholar 

  38. Phavorin G, Richard P, Goossens J, Chapeaux T, Maiza C (2015) Scheduling with preemption delays: anomalies and issues. In: Proceedings of the 23rd international conference on real time and networks systems, RTNS ’15, pp 109–118

  39. Potop-Butucaru D, Puaut I (2013) Integrated worst-case execution time estimation of multicore applications. In: 13th international workshop on worst-case execution time analysis, vol 30, pp 21–31

  40. Puaut I, Decotigny D (2002) Low-complexity algorithms for static cache locking in multitasking hard real-time systems. In: Proceedings of the 23rd IEEE real-time systems symposium, pp 114–123

  41. Puffitsch W, Noulard E, Pagetti C (2015) Off-line mapping of multi-rate dependent task sets to many-core platforms. Real-Time Syst 51(5):526–565

    Article  Google Scholar 

  42. Rihani H, Moy M, Maiza C, Davis RI, Altmeyer S (2016) Response time analysis of synchronous data flow programs on a many-core processor. In: Proceedings of the 24th international conference on real-time networks and systems, RTNS ’16, pp 67–76

  43. Rouxel B, Derrien S, Puaut I (2017) Tightening contention delays while scheduling parallel applications on multi-core architectures. ACM Trans Embed Comput Syst 16:164:1–164:20

    Article  Google Scholar 

  44. Sodani A, Gramunt R, Corbal J, Kim HS, Vinod K, Chinthamani S, Hutsell S, Agarwal R, Liu YC (2016) Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 36:34–46

    Article  Google Scholar 

  45. Suhendra V, Raghavan C, Mitra T (2006) Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: International conference on compilers, architecture and synthesis for embedded systems, CASES ’06, pp 401–410

  46. Tendulkar P, Poplavko P, Galanommatis I, Maler O (2014) Many-core scheduling of data parallel applications using SMT solvers. In: 17th Euromicro conference on digital system design, DSD, pp 615–622

  47. Tessler C, Fisher N (2016) BUNDLE: real-time multi-threaded scheduling to reduce cache contention. In: IEEE real-time systems symposium, RTSS, pp 279–290

  48. Thies W, Amarasinghe S (2010) An empirical characterization of stream programs and its implications for language and compiler design. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, PACT ’10, pp 365–376

  49. Venkatachalam V, Franz M (2005) Power reduction techniques for microprocessor systems. ACM Comput Surv 37(3):195–237

    Article  Google Scholar 

  50. Ward BC, Thekkilakattil A, Anderson JH (2014) Optimizing preemption-overhead accounting in multiprocessor real-time systems. In: Proceedings of the 22nd international conference on real-time networks and systems, RTNS ’14, pp 235:235–235:243

  51. Wentzlaff D, Griffin P, Hoffmann H, Bao L, Edwards B, Ramey C, Mattina M, Miao CC, Brown JF III, Agarwal A (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27:15–31

    Article  Google Scholar 

  52. Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T, Mueller F, Puaut I, Puschner P, Staschulat J, Stenström P (2008) The worst-case execution-time problem: overview of methods and survey of tools. ACM Trans Embed Comput Syst 7(3):36:1–36:53

  53. Wilhelm R, Grund D, Reineke J, Schlickling M, Pister M, Ferdinand C (2009) Memory hierarchies, pipelines, and buses for future architectures in time-critical embedded systems. IEEE Trans. CAD Integr Circ Syst 28(7):966–978

    Article  Google Scholar 

  54. Yao G, Pellizzoni R, Bak S, Betti E, Caccamo M (2012) Memory-centric scheduling for multicore hard real-time systems. Real-Time Syst 48(6):681–715

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Byron Hawkins and anonymous reviewers for their useful comments on this paper. This work was partially funded by European Unions Horizon 2020 research and innovation program under Grant Agreement No. 688131, Project Argo (http://www.argo-project.eu/), and by PIA project CAPACITES (Calcul Parall-le pour Applications Critiques en Temps et Sret), Reference P3425-146781.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Viet Anh Nguyen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nguyen, V.A., Hardy, D. & Puaut, I. Cache-conscious off-line real-time scheduling for multi-core platforms: algorithms and implementation. Real-Time Syst 55, 810–849 (2019). https://doi.org/10.1007/s11241-019-09333-z

Download citation

Keywords

  • Real-time scheduling
  • Cache-conscious schedules
  • Schedule implementation
  • Multi-core architectures
  • ILP
  • Static list scheduling