Cudagrind: Memory-Usage Checking for CUDA
The Memcheck tool, build on top of the popular Valgrind framework, offers a reliable way to perform memory correctness checking in arbitrary x86 programs. At runtime Valgrind dynamically transforms the program into intermediate code which is then being executed on an emulated CPU. Due to a full shadow copy of the memory used by each program the Memcheck tool is able to perform various, bitwise precise runtime checks. These include, but are not limited to, detection of memory leaks, checking the validity of memory accesses and tracking the definedness of memory regions. But despite the wide applicability of this approach, it is bound to fail when accelerator based programming models are involved. Kernels running on such devices, like NVIDIA’s Tesla series, are completely separated from the host. The memory on the device is only accessible through an API provided by the driver or from inside the kernels. Due to this indirect approach Valgrind is not able to understand, instruments or even recognize memory operations being executed on the device. Freeing Valgrind from this limitation has been the focus of the work presented here. A set of wrappers for a subset of the CUDA driver API has been introduced. These allow tracking of (de-)allocation of memory regions on the device as well as memory copy operations needed to place and retrieve data in device memory. This provides the ability to check whether memory is fully allocated during a transfer and, thanks to the host memory checking performed by Valgrind, whether the memory transferred to the device is fully defined and addressable on the host. This techniques allows detection of a number of common programming mistakes, many of which can be rather difficult to debug by other means. These wrappers, combined with Valgrind’s Memcheck tool, is being called Cudagrind.
This work was supported by the H4H project funded by the German Federal Ministry for Education and Research (grant number 01IS10036B) within the ITEA2 framework (grant number 09011).
- 1.Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for Bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, Vienna, pp. 353–364. ACM, New York (2010). http://doi.acm.org/10.1145/1854273.1854318
- 2.Farooqui, N., Kerr, A., Diamos, G., Yalamanchili, S., Schwan, K.: A framework for dynamically instrumenting GPU compute applications within GPU Ocelot. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, Newport Beach, pp. 9:1–9:9. ACM, New York (2011). http://doi.acm.org/10.1145/1964179.1964192
- 3.Johnson, S.C.: Lint: A C Program Checker. Computing Science Technical Report 65, Bell Laboratories, (1977)Google Scholar
- 4.Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not. 42(6), 89–100 (2007). http://doi.acm.org/10.1145/1273442.1250746
- 5.Seward, J., Nethercote, N.: Using valgrind to detect undefined value errors with bit-precision. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, Anaheim, pp. 17–30. USENIX Association, Berkeley (2005)Google Scholar