Unified Memory For CUDA Novices

", introduced the basics of CUDA programming by showing how to jot down a simple program that allotted two arrays of numbers in memory accessible to the GPU and then added them collectively on the GPU. To do this, I launched you to Unified Memory, which makes it very simple to allocate and entry information that can be utilized by code working on any processor in the system, CPU or GPU. I completed that publish with a few simple "exercises", certainly one of which inspired you to run on a recent Pascal-primarily based GPU to see what occurs. I was hoping that readers would try it and comment on the outcomes, and some of you did! I urged this for 2 causes. First, as a result of Pascal GPUs such as the NVIDIA Titan X and the NVIDIA Tesla P100 are the primary GPUs to include the Web page Migration Engine, which is hardware help for Unified Memory page faulting and migration.

The second reason is that it supplies a fantastic alternative to be taught extra about Unified Memory. Quick GPU, Fast Reminiscence… Proper! But let’s see. First, I’ll reprint the outcomes of operating on two NVIDIA Kepler GPUs (one in my laptop and one in a server). Now let’s try running on a very quick Tesla P100 accelerator, based on the Pascal GP100 GPU. Hmmmm, that’s underneath 6 GB/s: slower than operating on my laptop’s Kepler-primarily based GeForce GPU. Don’t be discouraged, although; we are able to repair this. To understand how, I’ll should let you know a bit extra about Unified Memory. What's Unified Memory? Unified Memory is a single memory deal with house accessible from any processor in a system (see Determine 1). This hardware/software know-how permits purposes to allocate data that may be read or written from code running on either CPUs or GPUs. Allocating Unified Memory is so simple as replacing calls to malloc() or new with calls to cudaMallocManaged(), an allocation operate that returns a pointer accessible from any processor (ptr in the following).

When code operating on a CPU or GPU accesses data allotted this fashion (typically called CUDA managed data), the CUDA system software and/or the hardware takes care of migrating Memory Wave Routine pages to the memory of the accessing processor. The important point right here is that the Pascal GPU structure is the first with hardware assist for digital memory page faulting and migration, through its Web page Migration Engine. Older GPUs based on the Kepler and Maxwell architectures also help a more limited type of Unified Memory. What Occurs on Kepler After i name cudaMallocManaged()? On programs with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates dimension bytes of managed memory on the GPU machine that's energetic when the call is made1. Internally, the driver also units up page table entries for all pages coated by the allocation, in order that the system is aware of that the pages are resident on that GPU. So, in our example, working on a Tesla K80 GPU (Kepler structure), x and y are each initially fully resident in GPU memory.

Then in the loop starting on line 6, the CPU steps by way of each arrays, initializing their elements to 1.0f and 2.0f, respectively. For the reason that pages are initially resident in machine memory, a page fault happens on the CPU for every array web page to which it writes, and the GPU driver migrates the web page from device memory to CPU memory. After the loop, all pages of the two arrays are resident in CPU memory. After initializing the information on the CPU, the program launches the add() kernel to add the elements of x to the weather of y. On pre-Pascal GPUs, upon launching a kernel, the CUDA runtime should migrate all pages previously migrated to host memory or to a different GPU back to the system memory of the machine running the kernel2. Since these older GPUs can’t web page fault, all knowledge must be resident on the GPU just in case the kernel accesses it (even if it won’t).

Memory Wave Routine