GPU based techniques for deep image merging

Deep images store multiple fragments perpixel, each of which includes colour and depth, unlike traditional 2D flat images which store only a single colour value and possibly a depth value. Recently, deep images have found use in an increasing number of applications, including ones using transparency and compositing. A step in compositing deep images requires merging per-pixel fragment lists in depth order; little work has so far been presented on fast approaches. This paper explores GPU based merging of deep images using different memory layouts for fragment lists: linked lists, linearised arrays, and interleaved arrays. We also report performance improvements using techniques which leverage GPU memory hierarchy by processing blocks of fragment data using fast registers, following similar techniques used to improve performance of transparency rendering. We report results for compositing from two deep images or saving the resulting deep image before compositing, as well as for an iterated pairwise merge of multiple deep images. Our results show a 2 to 6 fold improvement by combining efficient memory layout with fast register based merging.


Introduction
This paper explores time and memory performance of storing and merging deep images on the GPU using OpenGL and GLSL.We assume the deep images are stored in graphics memory, leaving broader investigation of approaches which include reading deep images from persistent storage for future work.
A topic of increasing interest [1,2], deep image compositing presents new opportunities and challenges compared to standard image compositing.Among these challenges is performance, as compositing many fragments per-pixel per-image requires more processing than just a single fragment per-pixel.GPUs are naturally suited for this task.
Merging two deep images (see Fig. 1) requires first loading them to GPU global memory, either entirely if possible, or in large blocks.Per-pixel threads then read data from both deep images, merging and compositing fragments to produce a final 2D (flat) image, or alternatively merging and saving the resulting deep image before compositing.The resulting deep image can then be used in other deep image operations, such as further merging in an iterated pairwise or k-way fashion.This paper specifically focuses on merging two deep images, either to give a merged result or for iterated pairwise merging, leaving the problem of k-way deep image merging for future work.
A simple merging approach is to step through fragment data in sorted order for both deep images, comparing fragments from each based on depth, and compositing before moving to the next fragment, using a basic linear time per-pixel stepwise merge of two sorted lists.Our approach improves on this by reading and processing blocks of data using registers.
Deep images are typically stored in graphics memory using one of two main formats for GPU processing: as per-pixel linked lists, or as linearised arrays of fragments.We explore differences between these approaches in terms of memory usage and processing time; linked lists require more memory while linearised arrays require more processing during construction.We also explore an interleaved array format which improves performance of deep image merging through better memory read coherence.We investigate performance of merging deep images in graphics memory using a stepwise approach, and an improved approach using blocks of registers.Finally we introduce a blocked interleaved array format which leverages blocked merging to give a combined 2 to 6 fold performance improvement.

Related work
Storing deep images in GPU memory as linked lists and linearised arrays has been explored in the context of transparency rendering in computer graphics [3,4].Linked lists have been found to generally provide better performance for processing data, while linearised arrays use less memory.Using fast GPU registers for sorting deep image data was presented in Ref. [5], a concept that this work extends.
General image compositing operations were first proposed in Ref. [6], and recently expanded to deep images [2,7], using the OpenEXR format [8] for external storage.Such work focuses on how composite operations are performed.Performance of compositing deep images in memory on the GPU using different merging approaches and storage formats has to our knowledge not been presented, and is the focus of this work.

Deep image formats
Arranging fragment data into appropriate buffers in memory is critical for fast processing on the GPU.As mentioned earlier, two main approaches exist: perpixel linked lists and linearised arrays.
Building a deep image as per-pixel linked lists requires a global atomic counter and the allocation of three storage buffers, with one integer per-pixel for the head pointers and then buffers of arbitrary size for fragment data and next pointers.The head pointers are initialised to null (0) before rendering.If the size is too small to store all fragment data, then the atomic counter is used to allocate buffers of sufficient size before re-rendering.
As geometry is rasterized, fragments are added to the data array using a global atomic counter, and appended to the corresponding pixel's list using an atomic exchange which inserts the fragment at the front of the list.The fragment's next pointer is then set to the previous head node.In this fashion fragments are continuously added to the head of the corresponding pixel's linked list.Example GLSL code for adding fragment data to a linked list, and traversing it, is shown below: Building a deep image can be done quickly, as fragments can be written to the next available place as they are rendered or captured.Traversing a pixel's fragment list starts at the index given by the pixel's head pointer, and follows each fragment's next pointer respectively until a null terminator is reached.
Merging two deep images and saving the resulting merged deep image in this format reverses the order of the per-pixel fragment lists, as fragment data is, always added to the head of the list.This must be accounted for in the next merge or composite step.
Figure 2 shows per-pixel fragment colours stored as linked lists using three separate buffers with blue/red/green, blue/green, and blue/red fragment colours for the bottom left, bottom right, and top left pixels respectively.Fragments for a given pixel can be anywhere in the data buffer.
Linearised arrays require only two buffers: see Fig. 3, which shows the same per-pixel fragment colours as Fig. 2. Unlike the linked-list approach in which fragment data can be anywhere, in linearised arrays, fragment data for a given pixel is coherent: it is localised with all fragments for a given pixel stored contiguously.
Building a deep image in this format may be summarized as follows: • Allocate buffer of per-pixel counts, initialised to zero.• Render geometry depths and atomically increment counts in the fragment shader in an initial rendering pass.
Fig. 2 Per-pixel blue/red/green, blue/green, and blue/red fragment colours as linked lists.
Fig. 3 Per-pixel blue/red/green, blue/green, and blue/red fragment colours as linearised arrays.
• Perform parallel prefix sums scan on counts to produce an array of offsets.These determine the location of each pixel's memory in the global data array.• Allocate data buffer of size given by final offset.
• Render full geometry data in a second rendering pass; offsets are atomically incremented in the fragment shader to give the location at which each fragment is written in the data buffer.Traversing a pixel's fragment data requires reading the index offset and number of fragments, given by subtraction from the next pixel's offset, then reading the fragment data sequentially.Example GLSL code is given below for adding fragment data to a linearised array, and traversing it; note that the same buffer is used for both counts and offsets: Building a deep image in this format is typically slower, as it requires computing offsets from per-pixel fragment counts in a separate initial counting pass before writing fragment data in a second capturing pass.However, it requires less memory as there are no next pointers.
Deep image compositing requires fragment lists in depth sorted order.As mentioned in Section 2, the currently fastest technique for deep image sorting is register-based block sort [5] which uses a sorting network of fast registers.In cases where lists are longer than the number of available per-thread registers, backwards memory allocation [9] partitions the sort into blocks.This combined approach is used for sorting deep images in this paper.

Memory hierarchy
Deep images can be large; on the GPU, pixels are processed per-thread in parallel.GPUs have a hierarchy of memory as shown in Fig. 4, with a large amount of relatively slow global memory, and a smaller amount of fast memory such as local memory, and then an even smaller number of very fast As stated previously, compositing deep images requires first loading or capturing them to slow global memory.Global memory has high latency, particularly as fragment reads are not necessarily coherent.A stepping approach that reads then composites before reading the next fragment in turn is highly vulnerable to this latency.
Processing data by reading blocks from slow to fast memory is an established concept, and applies to merging.One approach is to merge blocks of data by reading fragments from global memory to local memory before compositing, reducing the impact of latency.Using blocks of local memory requires copying data from global to local memory, then reading from local memory to perform the comparison and composition operations in registers.
Registers are much faster than global and local memory.GPUs typically have on the order of thousands of registers, typically 255 per-thread or core, so fragments can be read to per-thread blocks of registers directly rather than to local memory first.This has the benefit of both reducing the impact of latency, and avoiding writing to and then reading from local memory.

Register block merging
The merging operation is performed by reading blocks of data directly from global memory to fast registers, bypassing local memory.We term this approach register block merging (RBM).It is summarised in the following steps, which performs a stepwise merge, reading to blocks of registers: The first program iterates a number of times determined at runtime and therefore the loop cannot be unrolled at compile time.As register usage is decided at compile time, registers cannot be used in this case and local memory is used instead, seen by the lmem0 [8] local memory allocation.The use of registers requires either manual loop unrolling or use of a bounding compile-time constant, as shown by R0, R1, R2, R3 in the second example.The same unrolling technique is used when reading and merging.A block of registers can also be used when writing the merged deep image, although we found this to be faster only when using linearised arrays.
As GPUs keep all active threads resident, the number of threads that can be scheduled and executed simultaneously is limited by available per-thread resources such as local memory and registers.However, instead of threads causing waiting when reading from global memory, other threads are executed to reduce the impact of memory latency and increase throughput.This means GPUs typically have many more active threads than available cores.Storing fragments in per-thread registers reduces the number of possible simultaneous threads.To achieve greater throughput this needs to be balanced by keeping block sizes relatively small, typically using 4 to 16 fragments.

Interleaved arrays
Instead of using linked lists and linearised arrays to store deep images, a faster technique is to use interleaved arrays.GPUs execute threads in groups, where instructions across the group are executed in lock-step.This means the first fragment for each pixel in a thread group is processed before the second fragment.Improved memory performance requires coherent memory reads for fragment data in a thread group, rather than for each individual pixel.Arranging fragment data in order of per-group reads instead of per-pixel reads improves memory coherence.
One approach is to interleave fragment data across groups for all pixels, based on the group's maximum fragment count.This requires padding each group so that all lists have the same length, consequently increasing memory requirements; this was by a factor of 2-3 for our test scenes.We instead interleave up to the shortest fragment list for any pixel in a group, with remaining fragments stored at the end of each group with no padding as shown in Fig. 5.
Building a deep image in this format is done in a similar way to the approach used for a linearised array, and requires an extra buffer for per-group minimum counts, and a buffer of per-pixel counts in addition to the offsets: • Minimum counts are allocated as number of pixels divided by 32, initialised to zero.• Per-pixel counts are determined in the same manner.
• Compute threads are executed for each group of 32 pixels, each thread determining and writing the minimum count of its respective group to the minimum counts buffer.• The prefix sums scan then computes per-pixel offsets from per-pixel counts as for the linearised arrays case, and allocates a data buffer of sufficient size.• Complete geometry is rendered in a second pass and saved as for linearised arrays, but with modified indexing.Example GLSL code for adding fragment data to and traversing an interleaved array is given below:

Blocked interleaving
When using register block merging, coherence is further improved by interleaving fragments in blocks rather than individually when generating deep image data: see Fig. 6.This means the first block of fragments for the first pixel is written to the deep image, then the first block for the second pixel in turn.The same block size is used when building and merging the deep images.With blocked interleaving, the per-group minimum counts must be a multiple of the block size, which can result in more noninterleaved fragments stored at the end.
Building a deep image in this format follows the same approach as for an interleaved array, with modified indexing in the fragment shader, as shown below: If the final result is written back to global memory, compute threads can be executed in order of memory layout.However, if the resulting image is Fig. 6 Per-pixel blue/red/green, blue/green, blue/red, and green/red fragment colours as a blocked interleaved array with block size 2.
rasterized, then threads are instead executed in pixel rasterization order.nVidia GPUs typically rasterize pixels in a tile-based fashion, where a 2×8 tile is rasterized in a zig-zag pattern.Figure 7 shows the rasterization order for a 4×8 block of pixels; numbers represent execution order of per-pixel fragment shader threads as determined by atomic counters.Indexing pixels to more closely match the repeating 2 × 8 tiled raster pattern when building the deep image improves merging performance by approximately 1.5 to 2 fold for all approaches.Fig. 7 Typical thread execution order (raster pattern) for a 4×8 block of pixels on an nVidia GPU.

Results
We compare performance of merging two deep images using three different scenes: see Fig. 8. Additionally we compare an iterated pairwise merge, where four deep images are first merged to give the two deep images shown, before merging the two resulting deep images.
The first and second scenes are the Sponza Atrium and the Powerplant, each separated into interior and exterior deep images.The third referred to as the Hairball, is a synthetic scene of a hairball merged with a set of randomly generated spheres.These scenes are available from Ref. [10].Not shown is another synthetic scene referred to as the Planes, which has 256 screen-aligned quads with linked list data in approximately coherent order merged with a set of randomly generated spheres.For all measurements, deep image data is arranged in raster pattern, which is 1.5-2 times faster for all test cases.
The Atrium scene typically has fewer than 20 per-pixel fragments in each deep image, while the other scenes have up to hundreds.The Atrium and Powerplant are divided mainly into interior and exterior geometry.Thus, as merging progresses, data is mainly read from one deep image then the other in turn.The Hairball and Planes have spheres randomly distributed, with memory being read more evenly across both deep images as a consequence.
The storage approaches discussed in Sections 3 and 4 were compared: linked lists (LLs), linearised arrays (LAs), interleaved linearised arrays (IAs), and blocked interleaved linearised arrays (BIAs).The results in Tables 1-3 show RBM offers up to 4-fold performance improvement in the best case and no performance penalty in the worst case, regardless of whether compositing during merging, saving the merged image or using pairwise merging.This is due to memory latency for incoherent reads being reduced by reading memory in blocks.The largest performance improvement by RBM is for linearised arrays and blocked interleaved arrays, as block memory reads are typically more coherent in these formats.RBM offers a smaller performance improvement for mostly coherent data, or when there is little data to merge, as in the Atrium scene.
Blocked interleaved arrays are faster than RBM is more effective with blocked interleaved arrays, as memory is specifically arranged to improve this approach.Compared to the worst case approach of each scene, BIA-RBM gives a 2 to 6 fold performance improvement.This improvement is less significant in the Atrium scene where less geometry is present and thus fewer merging operations performed.
When saving the merged deep image or using an iterated pairwise approach, linked lists are typically faster for the Atrium scene.Saving a deep image using linearised arrays, interleaved arrays, or blocked interleaved arrays requires first building an array of offsets before writing any fragment data, unlike linked lists for which next pointers and fragments are written simultaneously.The cost of first building the offsets is outweighed by any potential merging improvements when less geometry is present.
Linearised arrays use less space than other formats as shown in Table 4, while interleaved arrays and blocked interleaved arrays require a little more due to the per-group minimum fragment counts, which depend on image resolution.Linked lists use the most memory in all cases, as expected, due to the next pointers.In all cases RBM and stepwise merging require no extra global memory.

Conclusions
This paper has presented RBM and shown it to be a better merging approach, and has shown blocked interleaved arrays to be a better deep image format.It has also explored and compared stepwise merging and other existing deep image formats.Interleaved deep images have little memory overhead and fast merging time due to improved memory coherence, while register block merging improves performance of merging fragment data.Combined, these approaches give up to 2 to 6 fold performance improvement compared to non-interleaved stepwise merging.The interleaved arrays and blocked interleaved arrays approaches interleave fragment data based on per-group minimum fragment counts, with all remaining fragments stored in a non-interleaved linear fashion.Interleaving remaining fragment data past the minimum fragment list length without padding may offer further performance improvement.As iterated pairwise merging requires multiple writes to global GPU memory, an alternative is to use kway merging which we suspect may offer improved performance as it only writes to global memory once per-fragment.

Fig. 1
Fig. 1 Merged interior and exterior Atrium deep images.

Fig. 4
Fig. 4 Example memory hierarchy of an nVidia GPU.
Merging techniques were stepwise (S) and register block merging (RBM).The test platform was an nVidia GeForce GTX 1060, driver version 390.25.The deep images were HD (1920×1080) resolution.For each technique we report memory usage for the deep images and total merging time in milliseconds.We do not report the memory cost of RBM or stepwise merging, as these techniques do not require extra global memory.Results for compositing when merging two deep images are shown in

•
Begin with two per-pixel sorted fragment lists and two register blocks, one for each deep image.• If either register block is empty, read values from the corresponding deep image.• Merge values in both blocks in depth order until one block is exhausted.• Merged data is either written to an output deep image, or composited to a flat (2D) image.• After exhausting one fragment list, merge the remaining block and fragment values from the other list.Local variables or arrays with fixed indices known at compile time must be used in order to ensure that the GLSL compiler will store fragments in fast GPU registers.Code examples for reading fragment data from a deep image are shown below, along with the intermediate shader assembly output produced by the nVidia compiler, based on a similar example given in Ref. [5]: // Loop limit known at compile time for ( int i = 0; i < count && i < SIZE ; i ++) data [i] = readNext (...) ; produces : ... TEMP R0 , R1 , R2 , R3 ; TEMP RC , HC ; ... SLT .S R0 .y, {0 , 0, 0, 0}.x , c [0]. x; MOV .U. CC RC .x, -R0 .y;IF NE .x;

Table 1 ,
while those for merging and saving the resulting merged deep image are shown in

Table 2 .
Iterated pairwise merging results are shown inTable 3 for the Atrium and Powerplant scenes, with geometry divided into two interior and two exterior deep images.Results are average time from rendering and capturing scenes as separate deep images on the GPU and then merging; merge time reported includes either compositing a flat (2D) image or saving the resulting deep image.

Table 1
Merging time for two input deep images, compositing during merging

Table 4
Data usage for different deep image formats