An efficient implementation of one-dimensional discrete wavelet transform algorithms for GPU architectures

In this paper, the authors present several self-developed implementation variants of the Discrete Wavelet Transform (DWT) computation algorithms and compare their execution times against the commonly approved ones for representative modern Graphics Processing Units (GPUs) architectures. The proposed solutions avoid the time-consuming modulo divisions and conditional instructions used for DWT filters wrapping by proper expansion of the DWTs input data vectors. The main goal of the research is to improve the computation times for popular DWT algorithms for representative modern GPU architectures while retaining the code’s clarity and simplicity. The relations between algorithms execution time improvements for GPUs are also compared with their counterparts for traditional sequential processors. The experimental study shows that the proposed implementations, in the case of parallel realization on GPUs, are characterized by very simple kernel code and high execution time performance.


Introduction
The digital signal processing (DSP) has become an integral part of everyday life. The increasing number of electronic devices has led to a situation where almost everyone has to deal with digitally processed data. A digital signal can be a sequence 1 3 of samples extracted from a continuous signal or any discrete set of data like, e.g., image in bitmap form. However, still more often, the digital signal is directly derived from the continuous signal as its representation [1].
Nowadays, numerous computational problems are so complex that they not only require computer aided processing, but also a very efficient processing of huge data sets, often in real time [2]. Therefore, there are efforts to optimize the known algorithms both from the perspective of the processing time and the precision of the results as well. Current processors are often optimized to an extreme physical operational conditions, e.g., the widths of the electric paths are regularly close to the atomic size. This causes the need to look for new techniques to increase the computational efficiency. Increasing the frequency also meets the physical barriers. Those are only few of the reasons why parallel computing is becoming more and more popular [3,4]. The conversion of traditional, sequential computation algorithms to their parallel counterparts requires suitable implementations and poses a real challenge for software engineers involved in algorithm optimization [5].
The quality of the developed algorithms is, of course, subject to evaluation. There are different evaluation methods. Often we refer to computational complexity or time complexity. In the case of parallel devices and, in particular, graphics cards, the problem is much more complicated. The number of basic operations such as arithmetic, logic, and memory accesses does not directly affect the actual computation time. Extra parameters like versatility, simplicity, ease of implementation, ease of modification, utilization of hardware resources, achieved acceleration, complexity of communication, synchronizations, portability and others are also highly important. Among this and other reasons, time efficiency models are created for specific architectures. In the end, it often eventually turns out that only experimental studies give a real insight into performance characteristics of a chosen computational method. Software designer always tries to minimize the computational complexity, memory usage or other criteria of the overall cost [6]. However, such approaches are not always successful. It is quite often the case that a simpler algorithm with worse traditional computational complexity performs better because of the advantages resulting from its computational structure design.
In this paper, we present several optimization variants of commonly used DWT computation algorithms, namely the matrix and the lattice structure-based approaches, and compare their execution time effectiveness for both CPU and GPU implementations. The results indicate that, despite of twofold reduction in computational complexity of the lattice structure-based approach in comparison with the matrix-based method, the former algorithm performs significantly worse for large transform sizes due to its more complex computational structure when implemented on GPU.

Discrete wavelet transform
An important reason for transforming data from one form into another is the desire to analyze the features of the signal that are more visible after the transformation than in the original form. Such transformations are also sometimes called mappings.

3
An efficient implementation of one-dimensional discrete… The fundamental group of transformations is linear transformations which are one of the axioms of linear algebra. These are homomorphic mappings preserving the spatial structure. A special case of linear transformations is orthogonal ones possessing an important property of their inverse matrix being the transposition of the forward one. Computing the inverse matrix for multidimensional data can be a much more complex problem than a simple transposition for transformations. It is common to use harmonic functions as basis functions due to the fact that harmonic signals do not change their shape after passing through a linear system (only their phase and amplitude changes) [7].
The wavelet transform provides a time-frequency representation of a signal. It is similar to the Short Time Fourier Transform (STFT), but the wavelet transform uses a multi-resolution technique where different frequencies are analyzed at different resolutions. Its name is derived from the term "wavelet" meaning localized wave (Fig. 1). The energy of such signal is concentrated in time and space. Another difference between wavelet transform and Fourier transform is that Fourier transform uses waves to analyze signal, while wavelet transform uses short waves of finite energy. It can be stated that wavelet transform gives good time resolution at high frequencies and poor frequency resolution at low frequencies.
The discrete wavelet transform (DWT) is a discretized version of the continuous wavelet transform (CWT). In CWT, the signals are analyzed using a set of basis functions which relate to each other by simple scaling and translation. In the case of DWT, a timescale representation of the digital signal is obtained using digital filtering techniques. The signal is passed through filters with different cutoff frequencies at different scales [8].
The basic method of calculating the linear transformation is an algebraic approach based on multiplying a matrix by a vector of the input signal. This operation can be simply stated as where is a N-element input vector, is a N-element output vector, and is a N ×N -element matrix performing the filtering. There are infinitely many wavelets, so there is also an infinite number of possible wavelet transforms. The wavelet transform for discrete signals is specified by the following formula: where x represents the signal, and Ψ j,k is the wavelet transform kernel for which the parameters j and k represent shifts in the frequency and time domain. Wavelet transformation can be implemented using low-pass and high-pass filters. The organization of the calculations means that in each step, the high-pass filter produces detailed results, while the result of the low-pass filter depends on the scaling function. Thus, the time resolution remains at a good level for high frequencies and the frequency resolution remains at a good level for low frequencies. The number of decomposition stages depends on the size of the transform. The reconstruction process is based on the inverted version of the scheme presented in Fig. 2.
The discrete wavelet transform in direct matrix form can be implemented as a stage in the analysis of a two-channel filter bank with a finite impulse response. The two filters h and g with the number of coefficients K each compose the transformation matrix [9][10][11]: Two channel filter banks can be implemented in various ways. These are direct form, polyphase structure, lattice structure and lifting structure [12]. From the efficiency point of view, each of these has its advantages and drawbacks. For example, the lattice structure having good computational complexity cannot be used for all types of filters, and in the case of parallel implementation, its structure forces the need for synchronizations. In the case of polyphase implementations, it is easy to achieve a high degree of parallelism of computations, but the compute complexity is higher than in the case of the lattice form [1]. In the case of a lifting structure, the problem of quantization error arises, which is increased with each step. Despite this, Fig. 2 Example decomposition tree of DWT with filters h and g the lattice structure is acclaimed as an efficient tool in the implementation of two channel filter banks with finite impulse response. The computations for the first K/2 stages of lattice structure-based DWT are described by the Γ i,j operations (denoted by ' • ' in Fig. 3): where s i,j , t i,j are parameters whose values are determined during the factorization process, i = 0, 1, … , K∕2 − 1 and j = 0, 1, … , N∕2 − 1 . A single Γ i,j operation consists of two multiplications and two additions.
The wavelets used in DWT can be classified into two categories: orthogonal and biorthogonal. The coefficients of orthogonal filters are real numbers. Both filters have the same length and are not symmetric. If we mark the low-pass filter as g and the high-pass filter as h, then the following relation should be true: . For biorthogonal filters, the low-pass filter and the high-pass filter have different lengths. The low-pass filter is symmetric and the coefficients are real numbers or integers. There are many functions that can be used to prepare wavelets for the DWT. Some of the most common ones are shown in Fig. 4.
One of the earlier wavelets examples is the Haar wavelet, in today's practice though Daubechies wavelets are very common. Many wavelets are created for specific features. The field of designing wavelets is very wide and still growing. It involves both theoretical aspects of wavelets and their implementation. Wavelets are the result of work in many fields, including mathematics, physics, computer science and others but the target of research is to develop tools for describing functions in time and frequency domain at the same time. Each practical application of DWT requires wavelet bases with different properties. In practical applications, the larger the set of wavelets the better. In order to prepare the correct and useful wavelets, a set of restrictions is imposed. Among many conditions, most important are regularity, symmetry, compact support, orthogonality with polynomials of degree n. It is not possible to achieve all carious properties of wavelets at the same time because different properties may be incompatible with each other. Biorthogonal wavelets have most of the properties of orthogonal wavelets, with the advantage that they are more flexible. There are many more biorthogonal wavelets than orthogonal wavelets. Detailed rules for designing wavelets are discussed in the paper [7]. The DWT finds countless applications today, such as digital signal processing, high-speed data transmission protocols, biometric data and image compression, e.g., in JPEG2000 standard, in case of which DWT enables attaining high compression ratios with good quality of image reconstruction.

Theoretical effectiveness of selected DWT algorithms
We have chosen to base our considerations on two selected implementations of the DWT, namely the direct matrix-based approach, also called the convolutional form, and the lattice structure-based approach. The direct matrix approach and the lattice structure-based one both have features that are particularly important in the context of parallel implementations on a GPU. The convolutional form is very simple, and its matrix structure is beneficial for the GPU architecture. In contrast, the lattice form reduces computational complexity, but its staged structure generates some implementation difficulties and sequentially parallel execution. Moreover, the stepwise structure creates another obstacle for the implementation, because in the traditional approach, the kernel code for the last step should be different than all previous steps, due to the fact that the last step consists only of multiplication, while each An efficient implementation of one-dimensional discrete… previous step consists of two multiplications and two additions. This generates the need to call a special kernel code for the last stage or to use a conditional instruction in the kernel code and both solutions are clearly disadvantageous in terms of GPU computing performance. However, let's take a closer look at the theoretical time complexity of these algorithms. Computational complexity of the direct matrix-based approach to DWT calculation requires C MAT MUL = NK multiplications and C MAT ADD = N(K − 1) additions, where K is the filter length, and N is the size of wavelet transformation. Of course, these values can be lowered for a lattice structure due to the use of a factorization process. Such factorization exploits the dependencies between the filters of the two banks that are inherent in the perfect reconstruction condition imposed on the filter banks. As a result, the computational complexity of the wavelet transform using a lattice structure is characterized by a theoretical computational complexity of C LAT MUL = 1 2 N(K + 2) multiplications and C LAT ADD = 1 2 NK additions. Therefore, it can be said that from the point of view of both multiplication and addition operations, the lattice structure has twice better computational complexity. The relationship between theoretical computational complexities is particularly important in the case of sequential implementations, but in the case of parallel processors, a very precise model is necessary to translate the theoretical complexities into real computational time. Among the group of purely theoretical evaluation tools for parallel algorithms, there is a property called step-complexity [13,14]. Stepwise complexity is defined as the minimum required number of sequential steps necessary for a parallel algorithm to complete its calculations, considering the infinite amount of arithmetic processing units available during its activity. Using this definition and combining the operations of addition and multiplication, we can say that the step-complexity of the matrix algorithm is S MAT ALL = 2K − 1 , while the step-complexity of the lattice algorithm is S LAT ALL = 2K + 2 . Evaluation of parallel algorithms is a tricky issue because the number of elementary operations influencing the computational complexity and memory utilization lose their significance. Instead, important are aspects such as universality, simplicity of the algorithm, ease of implementation and possible modification, degree of utilization of available resources, acceleration achieved, cost, communication complexity, portability. Efforts to develop accurate execution time prediction models for parallel platforms have been carried out for a period many years now, but still no model has been accepted as a general parallel computing model or even GPU limited model. There are simulators, but they are expensive to use in terms of simulation time and difficult to maintain for developers as they require constant updates for new layouts. For this reason, conventional experimental research is still very popular in the case of GPGPU.

Related work
There are some different architectures for implementing a two-channel DWT filter bank. The most popular implementations are direct matrix-based structure, polyphase structure, lifting structure and lattice structure. The transformation itself is highly acclaimed and finds countless practical applications [32]. The classical DWT 1 3 algorithm is a direct implementation of a 2-channel filter bank based on building a transformation matrix [7]. Lattice factorization of two-channel orthogonal filter bank for wavelet transform has been proposed in the following papers [9,15,16]. The effectiveness of the mentioned lattice factorization against existing solutions has been tested for CPU in paper [3] and for GPU in paper [17]. The research on efficient realizations of DWT, e.g., using lattice structures is also very common for both CPU and GPU processors [18].
The proposed implementation of the DWT without wrapping, to the best of our knowledge, is a novelty in the implementation of both the lattice and matrix-based DWT algorithms. By extending the input data sets, we are able to avoid time-expensive modulo divisions and conditional instructions. All implementation details are presented in the next chapter. In addition, we present the performance of the developed implementations for a few different granularities of computations division. As shown in the research [19,20], the partitioning of a computational task directly affects general performance.
Although computational complexities of the proposed algorithms and their stepefficiencies, which we have presented in the previous chapter, can be quite easily derived, we have decided to evaluate the proposed solutions experimentally, along with validation of the achieved numerical accuracy. This comes from the fact that based on our previous experiences we know that in the case of GPU implementations, often only experimental tests can provide a real insight into the algorithms performance. However, it is still worth noting that there are many models for parallel computations, such as, e.g., the general PRAM model (Parallel Random Access Machine) which are useful in, at least coarse, estimation of parallel algorithms execution performance. Unfortunately, many of them are significantly inaccurate because of the complexity of GPUs architectures, which combine together the features of both sequential and parallel computations [5]. We here should note the existing analytical, statistical, simulation-based and hybrid models [21][22][23][24][25][26][27][28][29]. The precision of those models, depending on the type of algorithm, oscillates from about 90% upward, with a maximum prediction mismatch of up to 50%. The interested Reader can find a comprehensive review of the existing models in our paper [30] in which we have also proposed the new execution time prediction model for parallel GPU implementations of the discrete transforms computation algorithms, i.e., for the algorithms analyzed in the present work.

DWT implementations
In this section, we present our self-developed and tested implementations of the DWT computation algorithms along with their reference implementations. During our study, we have tested a significant amount of speed up techniques of the algorithms analyzed in this paper which cannot all be presented in this work for obvious reasons. Among all the variations we have prepared and tested, we have chosen those which gave the best performance increase during the tests. The improvements were made at the level of implementation, compilation and the algorithm structure. We noticed that wrapping the filters within the transformation matrix is computationally expensive so we proposed solution without wrapping for both (matrix-based and lattice) algorithms. We also tested the impact of granularity of the computations on the total time in the case of graphics cards.
For the sake of clarity, we have labeled the selected sequential implementations analyzed in this article as cpu_dwt_m2_ww_ref, cpu_dwt_m2_nw, cpu_dwt_l2_ww, cpu_dwt_l2_nw, and the parallel GPU implementations as gpu_dwt_m1_ww_ref, gpu_dwt_m2_ww, gpu_dwt_m4_ww, gpu_dwt_m2_nw, gpu_dwt_m4_nw, gpu_dwt_ l2_ww and gpu_dwt_l2_nw. The naming convention used for the implementations is designed in such a way to immediately give an intuitive hint on how a given algorithm is implemented. A name starting with "cpu" means that it is an implementation designed for CPUs, while the beginning "gpu" indicates that the implementation is designed for GPUs. The next segment, "dwt", is exactly the same for all implementations, as they all implement the DWT transformation (using different algorithms). The following segment indicates whether the implementation uses a lattice algorithm or a direct matrix approach along with the algorithm's granularity, e.g., "m4" stands for a matrix algorithm with granularity "4", "l2" stands for a lattice algorithm with granularity "2". The next segment specifies if the implementation contains wrapping, "ww" is used for implementations that wrap the DWT filters and "nw" is used for implementations that bypass the filter wrapping. At the very end of the tested implementation name, the "ref" postfix may appear to indicate that we are treating this particular implementation as a reference for the others.
Let us now present the details of the proposed implementations. For the wrapped matrix model, the input signal has a length of N samples and the transformation matrix has a size of N ×N-elements. In the case of the matrix model without wrapping, for an input signal of length N samples the transformation matrix has size N ×(N + K − 2) where K represents the filters length. The input signal is expanded by K − 2 initial samples. Likewise, for the lattice model without wrapping, the input signal must be extended by K − 2 initial samples (K defines the length of the filters). Then, because of the introduction of additional operations in each step, wrapping can be bypassed.
Parallel implementations were developed with different granularity. The granularity choice when implementing an algorithm is especially important for parallel architectures, but it is also affecting sequential program performance. For matrix-based algorithms, granularity "1" means the maximum level of parallelism, unfortunately it means at the same time the need to use the low-pass and high-pass filter selection logic. Granularity "2" eliminates the need to select the appropriate filter for even and odd indexes of the input signal, in other words, means processing of 2 rows of the matrix by a single thread, analogically, granularization "4" means processing of "4" rows of the matrix by a single thread. In the case of the lattice-based DWT algorithms, the most intuitive granularity is a multiple of "2," because every single butterfly operation always works on "2" elements of the input vector. The detailed impact of granularity selection for GPU efficiency has been tested using the Fast Fourier Transform in the following paper [19]. All the code listings, presented in the following section, use a consistent

cpu_dwt_m2_ww_ref -sequential, matrix-based, 2-point, with wrapping
We will start our analysis by presenting a chosen reference implementation. There are many examples of DWT implementations for CPUs in the literature. However, we have chosen an implementation presented in the book [7] whereby the authors proposed a pseudocode for the forward fast wavelet transform, as well as its inversed version. What's more, code was presented in a clear manner and supported by mathematical derivation of the transform formulas with finite sequence of coefficients. The authors of the book are acknowledged professionals in the field of discrete linear transformations and although their aim was not to optimize the implementation, we believe that it is an excellent starting point for optimization and further research. This implementation is representative of direct matrix approach Fig. 5a using modulo division to perform filter wrapping. We have optimized this proposal by reducing number of modulo divisions, integration of for loops and reducing the organisation cost. In our implementation, instead of two internal for loops, there is only one, where single iteration computes two elements of the output vector. The filter arrays h and g remain separated. The implementation is considered a reference implementation for the other sequential implementations. An efficient implementation of one-dimensional discrete… for ( int Nd2 = N / 2, i1 = 0 , i = 0; i < N; i += 2, i1 ++)

cpu_dwt_m2_nw -sequential, matrix-based, 2-point, without wrapping
Based on the reference implementation cpu_dwt_m2_ww_ref, we present an implementation of cpu_dwt_m2_nw. The main difference is that this implementation completely bypasses the need to wrap the filters by using the concept presented in the previous section (Fig. 5b). In fact, it is very similar to cpu_dwt_m2_ww_ref but does not have any modulo operations or conditional statements. To achieve this, the input signal had to be extended by K − 2 initial samples. Therefore, there is an additional loop in the code performing K − 2 iterations. This implementation is the first evaluator of the proposed no-wrap optimization 5(a). The filters remain separated into two arrays. Every single loop iteration computes two samples of the output vector with the use of two separated low-pass and high-pass filters.
Listing 2 Code fragment of the cpu dwt m2 nw implementation.

cpu_dwt_l2_ww -sequential, lattice, 2-point, with wrapping
The cpu_dwt_l2_ww is the first presented lattice implementation with wrapping (Fig. 6a). It has been developed based on the work of published in papers [3,9,15,16]. This implementation uses, instead of two filters, a single vector of K∕2 + 1 samples of factorized two-point base operations with tangent multipliers. Wrapping for butterfly operations out of the array size is implemented with the use of time-efficient helper variables. We will call such an implementation the wrapped one. A single butterfly operation works on two input samples. The number of steps in the structure depends on the size of the input vectors and is K/2, noting that the last step consists of multiplications only.
Listing 3 Code fragment of the cpu dwt l2 ww implementation.

3
An efficient implementation of one-dimensional discrete…

cpu_dwt_l2_nw -sequential, lattice, 2-point, without wrapping
The cpu_dwt_l2_nw is modified cpu_dwt_l2_ww lattice implementation in which wrapping has been bypassed according to the scheme Fig. 6b involving the extension of the input vector by K − 2 initial samples. Like the base implementation, the filters were factored to a single vector of K∕2 + 1 coefficients of the tangent basis operations. The final step of the structure is combined with the organization of the data in the resulting vector. This is a sequential implementation of the lattice algorithm without wrapping.
Listing 4 Code fragment of the cpu dwt l2 nw implementation. y

gpu_dwt_m1_ww_ref -parallel, matrix-based, 1-point, with wrapping
The general approach of parallel computing optimization aims to maximize the level of parallelism. For this reason, the first implementation we present and, at the same time, take as a reference for the following ones is a parallel direct matrix with 1 point granularity. This means that a single thread computes one element of the output vector, and the number of parallel threads is N. For such implementation, it is necessary to switch filters h and g for even and odd elements of the vector, respectively. Because the filters are stored in memory one by one, their selection is done with shifting the index by a constant value. The determination of the parity is done by modulo two divisions. The implementation implements a direct matrix approach with wrapping. We consider it as a reference implementation for the others analyzed later.

gpu_dwt_m2_ww -parallel, matrix-based, 2-point, with wrapping
The gpu_dwt_m2_ww is an optimization of the gpu_dwt_m1_ww_ref implementation.
Here, a reduced granularity of the computational task partitioning is used. The maximum level of parallelism is lower, it is possible to run max N/2 threads in parallel, but each thread is responsible for slightly more computations because it computes two elements of the output vector. There is no need to choose the filter. Both filters are stored in separate arrays h and g. Wrapping is implemented via "if" instruction. The implementation is direct matrix approach with wrapping.
Listing 6 Kernel code of the GPU-B implementation.

gpu_dwt_m4_ww -parallel, matrix-based, 4-point, with wrapping
The gpu_dwt_m4_ww implementation is very similar to the implementation gpu_dwt_ m2_ww. The only difference is once again the reduced level of parallelism. Here, it is possible to run N/4 threads, and each thread is responsible for yet more calculations as it determines 4 elements of the output vector. The filters are stored in separate arrays h and g. The implementation uses a direct matrix approach with wrapping via a conditional "if" instruction.

3
An efficient implementation of one-dimensional discrete… Listing 7 Kernel code of the gpu dwt m4 ww implementation.

gpu_dwt_m2_nw -parallel, matrix-based, 2-point, without wrapping
The gpu_dwt_m2_nw implementation is similar to the implementation gpu_dwt_ m2_ww. The major difference is that filter wrapping is bypassed, as in 5(b). The maximum level of parallelism is the same as for gpu_dwt_m2_ww because a max limit of N/2 parallel threads. The filters are stored in separate arrays. In contrast, there are no conditional instructions or modulo division. To make this possible, the input signal had to be extended by K − 2 of initial samples, which was done in an efficient native way using the cudaMemcpy function. This implementation is a direct matrix approach without wrapping, with granularity labeled by us as 2.
Listing 8 Kernel code of the gpu dwt m2 nw implementation.

gpu_dwt_m4_nw -parallel, matrix-based, 4-point, without wrapping
The gpu_dwt_m4_nw implementation is a direct modification of the implementation of gpu_dwt_m2_nw. The only difference is the twofold reduction in the maximum level of parallelism. The implementation works on up to N/4 threads. Filters without changes are stored in two arrays. Each single thread determines four samples of the output vector. There are no conditional instructions or modulo division. The input signal is extended by K − 2 initial samples. This implementation is a direct matrix approach without wrapping with granularity we label as 4.
Listing 9 Kernel code of the gpu dwt m4 nw implementation.

gpu_dwt_l2_ww -parallel, lattice-based, 2-point, with wrapping
The gpu_dwt_l2_ww is the second last implementation evaluated in this work. However, it is also the first parallel implementation that uses a lattice structure. In general terms, it is a parallel version of the cpu_dwt_l2_ww implementation, additionally optimized to run on GPUs. The implementation has maximum available granularity by running on N/2 parallel threads. Each thread performs a single butterfly operation. The filters are stored in the form of a single auxiliary array, in which both the low-pass and high-pass filters are properly factorized into two-point base operations with tangent multipliers. It is worth noting that the lattice structure is staged, so K/2 kernel function calls are needed. The last stage is different than others, so we added an "if" conditional instruction in the kernel code. The kernel is launched sequentially using for loop. This is a 2 point parallel implementation of a lattice structure with wrapping.

gpu_dwt_l2_nw -parallel, lattice-based, 2-point, without wrapping
The last implementation studied, gpu_dwt_l2_nw, is a modification of the gpu_dwt_ l2_ww that is similar to cpu_dwt_l2_nw, bypass wrapping as proposed in the previous section. Lattice-based kernel for GPU is utilizing N/2 parallel threads. Both h and g filters are transformed and stored in an array, in which both low-pass and highpass filters are properly factorized into two-point basis operations with tangent multiplications. The input signal is extended by K − 2 samples to bypass wrapping with the use of native on-gpu cudaMemcpy function. The kernel code is short and simple. This implementation is a parallel lattice algorithm without wrapping.

Results of experimental research
In this section, we present the experimental results of research on the effectiveness of the proposed DWT implementations on GPUs. The experiments were performed on a server equipped with a GeForce RTX 2080 Ti graphics cards with 4352 CUDA cores each, and an Intel Xeon Silver 4112 CPU processors with 4 cores (8 threads) each and 256GB of DDR4 RAM memory. Programs were written in C language using CUDA Toolkit 11.2. Presented time results are averaged from at least 100 trials. GPU was warmed up before the computations to eliminate the problem of its low performance at first run. To measure the time, we used techniques based on processor cycles. The kernel code time was measured directly on the device using the techniques proposed by the chip manufacturer, namely the "cudaEventRecord" function. Total time and CPU time were measured using "QueryPerformanceCounter" function. All implementations were designed to work on 32-bit floating point variables and provide numerically exactly the same results what was confirmed for each experiment using few error measures. The knowledge about the execution time of kernel functions is critically important for implementation optimization. There are several ways to measure kernel performance. An excellent solution is to use a dedicated profiler. This toolkit is called nvprof and collects detailed information about the timeline of CPU and GPU activity during program execution. It captures details of kernel execution, data transfers and all CUDA API calls. The tool is powerful in helping to understand many of the dependencies, such as the time required for actual computation and the time required for data transfers. For an efficient algorithm implementation, it is essential to balance between communication and computation. If the application spends more time on calculation than on data transfer, it may be possible to overlap these computations and completely hide the delay associated with data transfer. If the application spends less time on calculation than on data transfer, it is important to minimize the transfer between the host and the device. During implementation optimization, it is also possible to determine how the application stands up to the device's theoretical limits. The measured values can be compared to theoretical peak values, and it can be determined whether the application is limited by arithmetic or by memory bandwidth. The device's peak single precision floating operations per second (FLOPS) performance can be determined using the following formula: where P is FP32 peak performance, N cl is GPU clock, N gppd is number of graphics processors per device, N smpgp is number of streaming multiprocessors per graphics processor, N cpsm is number of cores per streaming multiprocessor and N ipc is number of instructions per cycle. Substituting the data of the GeForce RTX 2080 Ti graphics card read directly from device, i.e., N cl = 1545MHz , N gppd = 1 , N smpgp = 68 , N cpsm = 128 , N ipc = 1 , we get the peak performance for the tested GPU: The Peak Memory Bandwidth can be determined in the similar way using the following expression: where B is the peak memory bandwidth, N gppd is the number of graphics processors per device, N mcl is the memory clock, N mbw is the memory bus width and M is memory type multiplayer, for GDDR3 M = 2 , for GDDR5 M = 4 and for GDDR6 M = 8 . By applying device parameters to the above formula, we get: To measure the achieved percentage of utilization with respect to the theoretical maximum, we used the newest NVIDIA Nsight Compute toolset. The study clearly revealed that the highest level out of all, in case of computing SoL, achieves the gpu_dwt_m2_nw implementation with the value of 54% in peak. In the case of the memory SoL criterion, the best performing implementation is gpu_dwt_l2_nw with a value of 83% in peak. Hence, both leading implementations are non-wrapping ones. CUDA toolkit provides a very precise timing tool on the GPU side based on events. Despite the fact that the kernel call is asynchronous for the host, the cudaDe-viceSynchronize command can wait for all parallel threads to finish computations. An event in CUDA is a marker in a CUDA stream associated with a certain point in the flow of operations in that stream. It is possible to measure the elapsed time of CUDA operations marked by two events using the cudaEventElapsedTime function. This function returns the elapsed time in milliseconds between two events. The events start and stop do not need to be associated with the same CUDA stream. Such a timing technique is also widely accepted for performance studies in the case of GPUs [5].
In Table 1, we list the most important characteristics of the GPU we've used in our tests. The results in Figs. 7, 8, 9, 10, 11 and 12 show the actual computation times in milliseconds. The columns successively represent the size of the transformation formerly marked as N. The size of the input vectors has always been powers (2) P = N cl * N gppu * N smpgp * N cpsm * N ipc , P = 154MHz * 1 * 68 * 128 * 1 = 13, 447 * 10 6 MFLOPS = 13, 45 TFLOPS.  Total time includes preprocessing, postprocessing and copying data to and from the device. We consistently use the proposed names of the implementations that we discussed in detail in the previous section.

Summary of experimental research on CPU
We have tested four sequential implementations running on the CPU: cpu_dwt_m2_ ww_ref, cpu_dwt_m2_nw, cpu_dwt_l2_ww, and cpu_dwt_l2_nw. The algorithms used in cpu_dwt_m2_ww_ref and cpu_dwt_m2_nw are based on the convolutional approach, while in cpu_dwt_l2_ww and cpu_dwt_l2_nw are based on the lattice structure. We have evaluated the standard implementations with wrapping and the proposed implementations without wrapping. The implementation used in cpu_dwt_ m2_ww_ref and cpu_dwt_l2_ww performs wrapping with the use of conditional instructions or modulo division. In the case of the algorithms cpu_dwt_m2_nw and cpu_dwt_l2_nw, no wrapping was needed because the modification described earlier. The implementation cpu_dwt_m2_ww_ref is considered to be the reference implementation. All mentioned implementations were self-developed and tested. The first insight confirms the theoretical computational complexity, and the lattice structure is superior to the matrix implementation. When comparing the implementation of cpu_dwt_m2_ww_ref to cpu_dwt_l2_ww, namely, the classic matrixbased implementation to the lattice structure without wrapping, it turns out that the lattice structure is clearly faster. In fact, average advantage after all the experiments is even larger than expected since it is 2.21× , while the theoretical advantage is at a level of about 2× . The next conclusion is that the reference method cpu_dwt_m2_ ww_ref can be clearly accelerated with wrapping elimination. The implementation without wrapping is clearly faster than the implementation with wrapping. The speedup after averaging all results is 1.85× , and still the cpu_dwt_m2_nw compared to cpu_dwt_m2_ww_ref has slightly worse computational and memory complexity.
After evaluation of the lattice implementation without wrapping, cpu_dwt_l2_nw, it turns out that also in the case of the lattice implementation, bypassing the wrapping allows to speed up the computations, but speedup this time is minimal and is close to 3%. However, it is enough to make the cpu_dwt_l2_nw implementation the most efficient among all tested ones.
This part can be summarized in two points. First, for the implementation of the DWT for CPUs, the non-wrapping algorithms indicate better performance than the algorithms with wrapping. Second, the lattice algorithm without wrapping is the most efficient one tested, although its advantage over the lattice algorithm with wrapping and the optimized matrix algorithm is quite low.

Summary of experimental research on GPU
We have prepared seven parallel implementations profiled for GPUs. These implementations are gpu_dwt_m1_ww_ref, gpu_dwt_m2_ww, gpu_dwt_m4_ww, gpu_dwt_m2_nw, gpu_dwt_m4_nw, gpu_dwt_l2_ww, and gpu_dwt_l2_nw. The algorithms used in gpu_dwt_m1_ww_ref, gpu_dwt_m2_ww, gpu_dwt_m4_ww, gpu_ dwt_m2_nw, gpu_dwt_m4_nw are implementations of the traditional convolutional algorithm, while for the implementations of gpu_dwt_l2_ww and gpu_dwt_l2_nw, a lattice structure was used.
We have studied the traditional implementations with wrapping and the proposed implementations without wrapping. Implementations with wrapping are gpu_dwt_ m1_ww_ref, gpu_dwt_m2_ww, gpu_dwt_m4_ww, gpu_dwt_l2_ww and without wrapping are gpu_dwt_m2_nw, gpu_dwt_m4_nw and gpu_dwt_l2_nw. In addition, we have also studied the granularity aspect of partitioning the computational task by reducing the overall level of parallelism by charging individual threads with more computations. Thus, we came up with implementations using 1-point granularity: gpu_dwt_m1_ww_ref, 4-point granularity: gpu_dwt_m4_ww , gpu_dwt_m4_nw and 2-point granularity: gpu_dwt_m2_ww, gpu_dwt_m2_nw, gpu_dwt_l2_ww, gpu_ dwt_l2_nw. As reference approach, we have chosen the gpu_dwt_m1_ww_ref implementation because it is closest to the traditional parallel implementation of the DWT algorithm in matrix-based form.
This time, we measured the total time as well as the time of the kernel function itself. The approach of presenting only kernel time is well known there, because in future, all computations can hopefully be performed internally within the GPU. However, currently, the GPU and CPU cooperation is still the main solution. Therefore, we believe that for our implementations, it is also appropriate to present the total processing time, because, e.g., for an implementation without wrapping, the input signal has to be extended, which, although it is already performed on the GPU by simple memory copying operations, does not belong to the kernel code but is a fundamental part of the algorithm.
Looking only at the kernel code time, one can easily see a very interesting relationship. Implementations gpu_dwt_l2_ww and gpu_dwt_l2_nw which are both tested variants of the lattice algorithm have significantly lower performance than any implementation of the matrix-based algorithm. The loss of the lattice algorithm is very noticeable and what is more, it grows as the amount of processed data increases. This is due to the design of the lattice algorithm for which synchronizations after each stage are essential. Thus, the cost of organizing the computations is so large that it has a greater impact on the total time than the computations themselves. A comparison of kernel codes for the gpu_dwt_l2_ww and gpu_dwt_l2_nw, i.e., the lattice structure with and without wrapping, indicates almost identical performance which only confirms the earlier point.
In terms of matrix-based implementations, the versions with wrapping, without wrapping, with 2 point granularity, and with 4 point granularity namely gpu_dwt_ m2_ww, gpu_dwt_m2_nw, gpu_dwt_m4_ww, gpu_dwt_m4_nw, respectively, have nearly identical kernel code performance. One can see a minimal advantage of the optimized 2-point implementation without wrapping, i.e., gpu_dwt_m2_ww. In addition, changing the granularity to 4-point, which means reducing the total number of threads working in parallel, does not positively affect the computation time. Finally, it is worth mentioning that the reference implementation, i.e., gpu_dwt_m1_ ww_ref is noticeably slower than the proposed implementations with 2-point granulation. The device limits expressed in TFLOPS are most closely approached by the gpu_dwt_m4_nw implementation, this is due to the high ratio of instructions in the kernel code to execution time.
In the case of the total processing time where preprocessing, postprocessing and all copy times to and from the device are included, the situation is slightly different. First, it should be noted that the total time is shorter than the CPU computation time only for large transformations and long filters slightly exceeding 30% speedup. Of course, the advantage would be clear when using filters with a much more samples, but we focused on the practical cases, so the filters of lengths of 4, 8 and 12 points were chosen. The results are very encouraging because it is clear that after the shift to 2D signals, the use of GPU will surely be beneficial.
The first conclusion is that total times for all implementations are very close to each other. Only the 1-point (reference) implementation, i.e., gpu_dwt_m1_ww_ref is, on average, a slightly slower, however, this did not stop it from being very competitive in the case of short input signals. The 4-point granulation applied to the gpu_dwt_m4_ww and gpu_dwt_m4_nw implementations similarly to the kernel time itself did not provide a clear advantage, however again, the times are very close. When considering total time, both lattice implementations only perform better than the reference implementation. The simple matrix form, although characterised by higher computational complexity, is faster. The times of all the proposed matrix implementations are very similar. Bypassing wrapping, in each case, gives at least a minimal advantage. The dependencies gained on the CPU do not carry over to the GPU, so formulating such clear conclusions as for the CPU is not possible. There is no single best solution, but the results presented should help one to choose the most beneficial implementation for a specific computational problem.

Perspective on further research
Based on the presented conclusions, we infer that it undoubtedly seems to be an interesting direction to study the developed implementations for 2D signals. This will surely be the part of our future research on the performance of DWT implementations on GPUs. Once again, it is interesting to focus on lattice structure and matrix-based implementations, because from the computational point of view, they represent inherently different approaches in terms of their computational structures when implemented on GPUs.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.