Abstract
Most schedulability analysis techniques for multi-core architectures assume a single worst-case execution time (WCET) per task, which is valid in all execution conditions. This assumption is too pessimistic for parallel applications running on multi-core architectures with local instruction or data caches, for which the WCET of a task depends on the cache contents at the beginning of its execution, itself depending on the tasks that were executed immediately before the task under study. In this paper, we propose two scheduling techniques for multi-core architectures equipped with local instruction and data caches. The two techniques schedule a parallel application modeled as a task graph, and generate a static partitioned non-preemptive schedule, that takes benefit of cache reuse between pairs of consecutive tasks. We propose an exact method, using an integer linear programming formulation, as well as a heuristic method based on list scheduling. The efficiency of the techniques is demonstrated through an implementation of these cache-conscious schedules on a real multi-core hardware: a 16-core cluster of the Kalray MPPA-256, Andey generation. We point out implementation issues that arise when implementing the schedules on this particular platform. In addition, we propose strategies to adapt the schedules to the identified implementation factors. An experimental evaluation reveals that our proposed scheduling methods significantly reduce the length of schedules as compared to cache-agnostic scheduling methods. Furthermore, our experiments show that among the identified implementation factors, shared bus contention has the most impact.
Similar content being viewed by others
Notes
Note that although designed for multi-core platforms, our proposed techniques can also be used on a single core to account for reuse among tasks executing on the same core.
For the experiments, M is the sum of all tasks’ WCETs when not reusing cache contents, to ensure that M is greater than the finish time of any task.
For the experiments, M is the sum of the worst-case execution time of all tasks adjusted for worst-case shared bus contention. Specifically, the total number of memory requests that interfere with the execution of \(\tau _j\) is equal to \(MR_{\tau _j} * (|c| - 1)\), where |c| is the number of cores to which tasks are assigned. This ensures M will be greater than the finish time of any task.
The rank of a task is defined as the longest path in terms of the number of nodes to reach that task from the entry task.
A task \(\tau _j\) can possibly execute after another task \(\tau _i\) if the sequence \(<\tau _i,\tau _j>\) may exist in a valid schedule regarding precedence constraints between tasks. This means that \(\tau _i\) is neither a direct nor indirect successor of \(\tau _j\) in the task graph, i.e. \(\tau _i \in nSucc(\tau _j)\) according to the notation introduced in Table 1.
Note that executing T6 directly after T1 on the same core is not a violation of precedence constraints between tasks, provided that tasks T2 to T5 are assigned to another core, and there is an idle time between the end of T1 and the start of T6.
With those benchmarks we do not have to modify the code of tasks to have a communication buffer per pair of communicating tasks. It is very costly to modify the code of the other benchmarks for having the same property.
Note that the symbol \(WCET_{\tau _j}\) has different meaning with the one used in Sect. 3. Here, \(WCET_{\tau _j}\) is predetermined according to its known execution order.
Only 15 out of the 16 cores of the cluster are used in our implementation on the Kalray MPPA, because one core is dedicated to the spawning of the tasks on the cluster cores.
References
Abdallah L, Jan M, Ermont J, Fraboul C (2016) Reducing the contention experienced by real-time core-to-i/o flows over a tilera-like network on chip. In: 28th Euromicro conference on real-time systems, ECRTS 2016, Toulouse, France, July 5–8, vol 86–96
Altmeyer S, Davis RI, Indrusiak L, Maiza C, Nelis V, Reineke J (2015) A generic and compositional framework for multicore response time analysis. In: International conference on real time and networks systems, RTNS ’15, pp 129–138
Arnaud A, Puaut I (2006) Dynamic instruction cache locking in hard real-time systems. In: International conference on real-time networks and systems (RTNS), pp 1–10
Bahn JH, Yang J, Bagherzadeh N (2008) Parallel FFT algorithms on network-on-chips. In: Fifth international conference on information technology: new generations (ITNG 2008), pp 1087–1093
Becker M, Dasari D, Nikolic B, Akesson B, Nélis V, Nolte T (2016) Contention-free execution of automotive applications on a clustered many-core platform. In: 28th Euromicro conference on real-time systems, ECRTS, pp 14–24
Calandrino JM, Anderson JH (2009) On the design and implementation of a cache-aware multicore real-time scheduler. In: 21st Euromicro conference on real-time systems, pp. 194–204
Carle T, Djemal M, Potop-Butucaru D, de Simone R, Zhang Z (2014) Static mapping of real-time applications onto massively parallel processor arrays. In: Proceedings of the 2014 14th international conference on application of concurrency to system design, ACSD ’14, pp 112–121
Chattopadhyay S, Roychoudhury A, Mitra T (2010) Modeling shared cache and bus in multi-cores for timing analysis. In: Proceedings of the 13th international workshop on software & compilers for embedded systems, SCOPES ’10, pp 6:1–6:10
Dasari D, Nélis V (2012) An analysis of the impact of bus contention on the WCET in multicores. In: Min G, Hu J, Liu LC, Yang LT, Seelam S, Lefèvre L (eds) 14th IEEE international conference on high performance computing and communication & 9th IEEE international conference on embedded software and systems, HPCC-ICESS 2012, Liverpool, UK, June 25–27, 2012. IEEE Computer Society, pp 1450–1457. https://doi.org/10.1109/HPCC.2012.212
Dasari D, Andersson B, Nélis V, Petters SM, Easwaran A, Lee J (2011) Response time analysis of cots-based multicores considering the contention on the shared memory bus. In: IEEE 10th international conference on trust, security and privacy in computing and communications, TrustCom 2011, Changsha, China, 16–18 November, 2011. IEEE Computer Society, pp 1068–1075. https://doi.org/10.1109/TrustCom.2011.146
Davis RI, Burns A (2011) A survey of hard real-time scheduling for multiprocessor systems. ACM Comput Surv 43(4):35:1–35:44
Ding H, Liang Y, Mitra T (2013) Shared cache aware task mapping for WCRT minimization. In: 8th Asia and south Pacific design automation conference, ASP-DAC, pp 735–740
Dupont de Dinechin B, van Amstel D, Poulhiès M, Lager G (2014) Time-critical computing on a single-chip massively parallel processor. In: Proceedings of the conference on design, automation & test in Europe, DATE ’14, pp 97:1–97:6
Fernandez G, Abella J, Quiñones E, Rochange C, Vardanega T, Cazorla FJ (2014) Contention in multicore hardware shared resources: understanding of the state of the art. In: 14th international workshop on worst-case execution time analysis, OpenAccess series in informatics (OASIcs), pp 31–42
Geer D (2005) Industry trends: chip makers turn to multicore processors. Computer 38:11–13
Guan N, Stigge M, Yi W, Yu G (2009) Cache-aware scheduling and analysis for multicores. In: Proceedings of the seventh ACM international conference on embedded software, EMSOFT ’09, pp 245–254
Gurobi Optimization, Inc. (2015) Gurobi optimizer reference manual. Gurobi Optimization, Inc., Oregon
Hardy D, Piquet T, Puaut I (2009) Using bypass to tighten WCET estimates for multi-core processors with shared instruction caches. In: Proceedings of the 30th IEEE real-time systems symposium, RTSS, pp 68–77
Kasahara H, Narita S (1984) Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans Comput 33(11):1023–1029
Kelter T, Falk H, Marwedel P, Chattopadhyay S, Roychoudhury A (2014) Static analysis of multi-core tdma resource arbitration delays. Real-Time Syst 50(2):185–229
Kim H, de Niz D, Andersson B, Klein MH, Mutlu O, Rajkumar R (2014) Bounding memory interference delay in cots-based multi-core systems. In: 20th IEEE real-time and embedded technology and applications symposium, RTAS 2014, Berlin, Germany, April 15–17, 2014. IEEE Computer Society, pp 145–154. https://doi.org/10.1109/RTAS.2014.6925998
Kim H, de Niz D, Andersson B, Klein MH, Mutlu O, Rajkumar R (2016) Bounding and reducing memory interference in cots-based multi-core systems. Real-Time Syst 52(3):356–395. https://doi.org/10.1007/s11241-016-9248-1
Kwok YK, Ahmad I (1999a) Benchmarking and comparison of the task graph scheduling algorithms. J Parallel Distrib Comput 59:381–422
Kwok YK, Ahmad I (1999b) Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput Surv 31(4):406–471
Li YTS, Malik S (1995) Performance analysis of embedded software using implicit path enumeration. In: Proceedings of the 32nd annual ACM/IEEE design automation conference, pp 456–461
Liang Y, Ding H, Mitra T, Roychoudhury A, Li Y, Suhendra V (2012) Timing analysis of concurrent programs running on shared cache multi-cores. Real-time Syst 48(6):638–680
Maaita A, Pont MJ (2005) Using “planned pre-emption” to reduce levels of task jitter in a time-triggered hybrid scheduler. In: Proceedings of the second UK embedded forum (Birmingham, UK), pp 18–35
Martinez S, Hardy D, Puaut I (2017) Quantifying wcet reduction of parallel applications by introducing slack time to limit resource contention. In: Proceedings of the 25th international conference on real-time networks and systems, RTNS 2017, Grenoble, France, October 04–06, 2017, pp 188–197
Nélis V, Yomsi PM, Pinho LM, Fonseca JC, Bertogna M, Quiñones E, Vargas R, Marongiu A (2014) The challenge of time-predictability in modern many-core architectures. In: 14th international workshop on worst-case execution time analysis, OpenAccess series in informatics (OASIcs), vol 39, pp 63–72
Nélis V, Yomsi PM, Pinho LM (2016) The variability of application execution times on a multi-core platform. In: 16th international workshop on worst-case execution time analysis (WCET 2016), OpenAccess series in informatics (OASIcs), pp 1–11
Nemer F, Cassé H, Sainrat P, Awada A (2007) Improving the worst-case execution time accuracy by inter-task instruction cache analysis. In: IEEE second international symposium on industrial embedded systems, SIES, pp 25–32
Nemhauser GL, Wolsey LA (1999) Integer and combinatorial optimization. Wiley interscience series in discrete mathematics and optimization. Wiley, New York
Nguyen VA, Hardy D, Puaut I (2017) Cache-conscious offline real-time task scheduling for multi-core processors. In: 29th Euromicro conference on real-time systems (ECRTS 2017), pp 14:1–14:22
Pellizzoni R, Betti E, Bak S, Yao G, Criswell J, Caccamo M, Kegley R (2011) A predictable execution model for cots-based embedded systems. In: Proceedings of the 2011 17th IEEE real-time and embedded technology and applications symposium, RTAS ’11, pp 269–279
Perret Q, Maurère P, Noulard E, Pagetti C, Sainrat P, Triquet B (2016a) Mapping hard real-time applications on many-core processors. In: Proceedings of the 24th international conference on real-time networks and systems, RTNS ’16. ACM, pp 235–244
Perret Q, Maurère P, Noulard E, Pagetti C, Sainrat P, Triquet B (2016b) Temporal isolation of hard real-time applications on many-core processors. In: 2016 IEEE real-time and embedded technology and applications symposium (RTAS), pp 37–47
Phatrapornnant T, Pont MJ (2006) Reducing jitter in embedded systems employing a time-triggered software architecture and dynamic voltage scaling. IEEE Trans Comput 55(2):113–124. https://doi.org/10.1109/TC.2006.29
Phavorin G, Richard P, Goossens J, Chapeaux T, Maiza C (2015) Scheduling with preemption delays: anomalies and issues. In: Proceedings of the 23rd international conference on real time and networks systems, RTNS ’15, pp 109–118
Potop-Butucaru D, Puaut I (2013) Integrated worst-case execution time estimation of multicore applications. In: 13th international workshop on worst-case execution time analysis, vol 30, pp 21–31
Puaut I, Decotigny D (2002) Low-complexity algorithms for static cache locking in multitasking hard real-time systems. In: Proceedings of the 23rd IEEE real-time systems symposium, pp 114–123
Puffitsch W, Noulard E, Pagetti C (2015) Off-line mapping of multi-rate dependent task sets to many-core platforms. Real-Time Syst 51(5):526–565
Rihani H, Moy M, Maiza C, Davis RI, Altmeyer S (2016) Response time analysis of synchronous data flow programs on a many-core processor. In: Proceedings of the 24th international conference on real-time networks and systems, RTNS ’16, pp 67–76
Rouxel B, Derrien S, Puaut I (2017) Tightening contention delays while scheduling parallel applications on multi-core architectures. ACM Trans Embed Comput Syst 16:164:1–164:20
Sodani A, Gramunt R, Corbal J, Kim HS, Vinod K, Chinthamani S, Hutsell S, Agarwal R, Liu YC (2016) Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 36:34–46
Suhendra V, Raghavan C, Mitra T (2006) Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: International conference on compilers, architecture and synthesis for embedded systems, CASES ’06, pp 401–410
Tendulkar P, Poplavko P, Galanommatis I, Maler O (2014) Many-core scheduling of data parallel applications using SMT solvers. In: 17th Euromicro conference on digital system design, DSD, pp 615–622
Tessler C, Fisher N (2016) BUNDLE: real-time multi-threaded scheduling to reduce cache contention. In: IEEE real-time systems symposium, RTSS, pp 279–290
Thies W, Amarasinghe S (2010) An empirical characterization of stream programs and its implications for language and compiler design. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, PACT ’10, pp 365–376
Venkatachalam V, Franz M (2005) Power reduction techniques for microprocessor systems. ACM Comput Surv 37(3):195–237
Ward BC, Thekkilakattil A, Anderson JH (2014) Optimizing preemption-overhead accounting in multiprocessor real-time systems. In: Proceedings of the 22nd international conference on real-time networks and systems, RTNS ’14, pp 235:235–235:243
Wentzlaff D, Griffin P, Hoffmann H, Bao L, Edwards B, Ramey C, Mattina M, Miao CC, Brown JF III, Agarwal A (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27:15–31
Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T, Mueller F, Puaut I, Puschner P, Staschulat J, Stenström P (2008) The worst-case execution-time problem: overview of methods and survey of tools. ACM Trans Embed Comput Syst 7(3):36:1–36:53
Wilhelm R, Grund D, Reineke J, Schlickling M, Pister M, Ferdinand C (2009) Memory hierarchies, pipelines, and buses for future architectures in time-critical embedded systems. IEEE Trans. CAD Integr Circ Syst 28(7):966–978
Yao G, Pellizzoni R, Bak S, Betti E, Caccamo M (2012) Memory-centric scheduling for multicore hard real-time systems. Real-Time Syst 48(6):681–715
Acknowledgements
The authors would like to thank Byron Hawkins and anonymous reviewers for their useful comments on this paper. This work was partially funded by European Unions Horizon 2020 research and innovation program under Grant Agreement No. 688131, Project Argo (http://www.argo-project.eu/), and by PIA project CAPACITES (Calcul Parall-le pour Applications Critiques en Temps et Sret), Reference P3425-146781.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nguyen, V.A., Hardy, D. & Puaut, I. Cache-conscious off-line real-time scheduling for multi-core platforms: algorithms and implementation. Real-Time Syst 55, 810–849 (2019). https://doi.org/10.1007/s11241-019-09333-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11241-019-09333-z