Graphics processing units, GPUs, are powerful processors that can offer significant performance advantages over traditional CPUs. The last decade has seen rapid advancement in GPU computational power and generality. Recent technologies make it possible to use GPUs as co-processors to CPUs. The performance advantages of GPUs can be great, often outperforming traditional CPUs by orders of magnitude. While the motivations for developing systems with GPUs are clear, little research in the real-time systems field has been done to integrate GPUs into real-time multiprocessor systems. We present two real-time analysis methods, addressing real-world platform constraints, for such an integration into a soft real-time multiprocessor system and show that a GPU can be exploited to achieve greater levels of total system performance.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Notable platforms include the Compute Unified Device Architecture (CUDA) from NVIDIA (CUDA Zone, URL http://www.nvidia.com/object/cuda_home_new.html), Stream from AMD/ATI (ATI Stream Technology, URL http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM-TECHNOLOGY/Pages/stream-technology.aspx), OpenCL from Apple and the Khronos Group (OpenCL. URL http://www.khronos.org/opencl/), and DirectCompute from Microsoft (Microsoft DirectX, URL http://www.gamesforwindows.com/en-US/directx/).
China’s new nebulae supercomputer is no. 2, URL http://www.top500.org/lists/2010/06/press-release.
Parallel computing with SciFinance, URL http://www.scicomp.com/parallel_computing/SciComp_NVIDIA_CUDA_OpenMP.pdf.
GeForce graphics processors, URL http://www.nvidia.com/object/geforce_family.html.
Intel microprocessor export compliance metrics, URL http://www.intel.com/support/processors/xeon/sb/CS-020863.htm.
CUDA community showcase, URL http://www.nvidia.com/object/cuda_apps_flash_new.html.
AMD Fusion Family of APUs, URL http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf.
Intel details 2011 processor features, offers stunning visuals build-in, URL http://download.intel.com/newsroom/kits/idf/2010_fall/pdfs/Day1_IDF_SNB_Factsheet.pdf.
The sample NVIDIA CUDA SDK programs were modified to use pinned memory, which prevents these memory segments from being potentially paged to disk. The use of pinned memory can significantly reduce communication overheads as the system can take advantage of direct memory access (DMA) data transfers. For example, the communication-to-execution ratio for the eigenvalue program increases to about 30% without it.
NVIDIA’s Fermi architecture allows limited simultaneous execution of kernels as long as these kernels are sourced from the same host-side context/thread. In this work, we will not consider such uses.
The GTX-295 actually provides two independent GPUs on a single card, though only one GPU was used in this work.
Some have recently speculated that the earliest-deadline-zero-laxity (EDZL) algorithm may be better suited to accounting for self-suspensions (caused, for example, by using a GPU) (Lakshmanan et al. 2010), though actionable results have yet to be presented, so better suspension accounting remains an open problem.
For performance, GPU operations may be performed asynchronously by the GPU-using job. This allows several GPU operations to be batched together and treated as a single operation, reducing the number of times the job must suspend to wait for GPU results. No changes to our task model are necessary to support this type of operation.
A window-constrained scheduling algorithm prioritizes a job by a time point contained within an interval window that also contains the job’s release and deadline.
Common workload profiles were solicited from research groups at UNC that frequently make use of CUDA. A poll was also informally taken at the NVIDIA CUDA online forums. Similar timing characteristics were later confirmed in the domain of computer vision for real-time automotive applications (Muyan-Ozcelik et al. 2011).
Please note that some graphs appear to be missing data points at lower and upper system utilization ranges. This is caused by the occasional inability to generate task sets meeting particular scenario constraints. This was usually due to the inability to generate a task set with at least two GPU-using tasks under the given constraints.
Graphs for all scenarios are available at http://www.cs.unc.edu/~anderson/papers.html.
k-exclusion locks protect a resource or resource pool, allowing up to k simultaneous accesses.
Abhijeet G, Muni TI (2009) GPU based sparse grid technique for solving multidimensional options pricing PDEs. In: Proceedings of the 2nd workshop on high performance computational finance, pp 1–9
Aila T, Laine S (2009) Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the conference on high performance graphics, pp 145–149
Baruah S (2000) Scheduling periodic tasks on uniform processors. In: Proceedings of the EuroMicro conference on real-time systems, pp 7–14
Baruah S (2004) Feasibility analysis of preemptive real-time systems upon heterogeneous multiprocessor platforms. In: Proceedings of the 25th IEEE real-time systems symposium, pp 37–46
Block A, Leontyev H, Brandenburg B, Anderson J (2007) A flexible real-time locking protocol for multiprocessors. In: Proceedings of the 13th IEEE international conference on embedded and real-time computing systems and applications, pp 47–57
Brandenburg B, Anderson J (2010) Optimality results for multiprocessor real-time locking. In: Proceedings of the 31st IEEE real-time systems symposium, pp 49–60
Calandrino J, Leontyev H, Block A, Devi U, Anderson J (2006) LITMUSRT: A testbed for empirically comparing real-time multiprocessor schedulers. In: Proceedings of the 27th IEEE real-time systems symposium, pp 111–123
Childs S, Ingram D (2001) The Linux-SRT integrated multimedia operating system: bringing QoS to the desktop. In: Proceedings of the 7th real-time technology and applications symposium, p 135
Devi U, Anderson J (2008) Tardiness bounds under global EDF scheduling on a multiprocessor. In: Real-time systems, vol 38, pp 133–189
Dwarakinath A (2008) A fair-share scheduler for the graphics processing unit. Master’s thesis, Stony Brook University
Erickson J, Devi U, Baruah S (2010) Improved tardiness bounds for global EDF. In: Proceedings of the 22nd EuroMicro conference on real-time systems, pp 14–23
Funk S, Goossens J, Baruah S (2001) On-line scheduling on uniform multiprocessors. In: Proceedings of the 22nd IEEE real-time systems symposium, pp 183–202
Gai P, Abeni L, Buttazzo G (2002) Multiprocessor DSP scheduling in system-on-a-chip architectures. In: Proceedings of the 14th EuroMicro conference on real-time systems, pp 231–238
Harrison O, Waldron J (2008) Practical symmetric key cryptography on modern graphics hardware. In: Proceedings of the 17th conference on security symposium, pp 195–209
Kang W, Son SH, Stankovic JA, Amirijoo M (2007) I/O-aware deadline miss ratio management in real-time embedded databases. In: Proceedings of the 28th IEEE real-time systems symposium, pp 277–287
Kato S, Ishikawa Y (2009) Gang EDF scheduling of parallel task systems. In: Proceedings of the 30th IEEE real-time systems symposium, pp 459–468
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011a) Resource sharing in GPU-accelerated windowing systems. In: Proceedings of the 17th IEEE real-time and embedded technology and application symposium
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011b) TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the USENIX annual technical conference
Lakshmanan K, Kato S, Rajkumar R (2010) Open problems in scheduling self-suspending tasks. In: Proceedings of the 1st international real-time scheduling open problems seminar, pp 12–13
Leontyev H, Anderson J (2009) A hierarchical multiprocessor bandwidth reservation scheme with timing guarantees. Real-Time Syst 43(1):60–92
Lipari RPG (2007) Holistic analysis of asynchronous real-time transactions with earliest deadline scheduling. J Comput Syst Sci 73:186–206
Manica N, Abeni L, Palopoli L (2008) QoS support in the ×11 window system. In: Proceedings of the 14th IEEE real-time and embedded technology and applications symposium, pp 103–112
Muyan-Ozcelik P, Glavtchev V, Ota JM, Owens JD (2011) Real-time speed-limit-sign recognition an embedded system using a GPU. In: GPU Computing Gems, pp 473–496
Ong CY, Weldon M, Quiring S, Maxwell L, Hughes M, Whelan C, Okoniewski M (2010) Speed it up. IEEE Microw Mag 11(2):70–78
Pieters B, Hollemeersch CF, Lambert P, de Walle RV (2009) Motion estimation for H.264/AVC on multiple GPUs using NVIDIA CUDA. In: Applications of digital image processing XXII, vol 7443, p 74430X
Raravi G, Andersson B (2010) Calculating an upper bound on the finishing time of a group of threads executing on a GPU: a preliminary case study. In: Work-in-progress session of the 16th IEEE international conference on embedded and real-time computing systems and applications, pp 5–8
Sasinowski JE, Strosnider JK (1995) ARTIFACT: a platform for evaluating real-time window system designs. In: Proceedings of the 16th IEEE real-time systems symposium, pp 342–352
Watanabe Y, Itagaki T (2009) Real-time display on Fourier domain optical coherence tomography system using a graphics processing unit. J Biomed Opt 14, 060506
Work supported by NSF grants CNS 0834270, CNS 0834132, and CNS 1016954; ARO grant W911NF-09-0535; AFOSR grant FA9550-09-1-0549; and AFRL grant FA8750-11-1-0033.
About this article
Cite this article
Elliott, G.A., Anderson, J.H. Globally scheduled real-time multiprocessor systems with GPUs. Real-Time Syst 48, 34–74 (2012). https://doi.org/10.1007/s11241-011-9140-y
- Real-time systems
- Global scheduling
- Semaphore protocols
- Bandwidth servers