Hybrid CPU–GPU implementation of the transformed spatial domain channel estimation algorithm for mmWave MIMO systems

Hybrid platforms combining multicore central processing units (CPU) with many-core hardware accelerators such as graphic processing units (GPU) can be smartly exploited to provide efficient parallel implementations of wireless communication algorithms for Fifth Generation (5G) and beyond systems. Massive multiple-input multiple-output (MIMO) systems are a key element of the 5G standard, involving several tens or hundreds of antenna elements for communication. Such a high number of antennas has a direct impact on the computational complexity of some MIMO signal processing algorithms. In this work, we focus on the channel estimation stage. In particular, we develop a parallel implementation of a recently proposed MIMO channel estimation algorithm. Its performance in terms of execution time is evaluated both in a multicore CPU and in a GPU. The results show that some computation blocks of the algorithm are more suitable for multicore implementation, whereas other parts are more efficiently implemented in the GPU, indicating that a hybrid CPU–GPU implementation would achieve the best performance in practical applications based on the tested platform.


Introduction
High performance computers with parallel computing capabilities are used in multiple segments of the industry [1,2]. Most of the current fastest supercomputers are built from nodes containing several multicore central processing units (CPU) and one or multiple graphic processing units (GPU). 1 Many computationally expensive problems can be tackled by appropriately leveraging both kinds of processing elements, or a combination of them through hybrid CPU-GPU implementations. This requires an application-based customized analysis to identify possible bottlenecks and link properly each computational resource with the appropriate task, so that overall computational performance is maximized. In particular, digital signal processing for wireless communications is one of the fields which has largely benefited from these devices, as it is the case for efficient multiple-input multiple-output (MIMO) algorithm's implementations [3,4].
In current 5G and future 6G systems, the use of millimetre Wave (mmWave) frequencies is one of the key technology drivers towards achieving enhanced Mobile Broadband services (eMBB) [5,6]. In mmWave, the use of beamforming techniques with highly directional beams is mandatory to increase the gain of the communication link between Transmitter (Tx) and Receiver (Rx). This is achieved in practice by using massive MIMO systems, with several tens or hundreds of antenna elements. One of the drawbacks of massive MIMO communications is that they often require more complex signal processing techniques, that could benefit from efficient implementations. In this work, we focus on the channel estimation stage of a massive MIMO system. Recently, a Transformed Spatial Domain Channel Estimation method (TSDCE) for analog mmWave MIMO systems has been proposed in Ref. [7], which has been shown to outperform other approaches based, for instance, on orthogonal matching pursuit (OMP) [8] or Discrete Fourier Transform (DFT) processing [9]. The core operations of the method are mainly based on the Fast Fourier Transform (FFT) and Singular Value Decomposition (SVD), which are suitable for parallel processing. Although reference [7] shows that TSDCE's complexity is below well-known channel estimation schemes, it could still benefit from a parallel implementation of its main blocks.
A first approach to efficiently implement the TSDCE method was carried out by the authors in Ref. [10], through exploring the parallelism of the quad-core ARM Cortex-A53 processor contained in the embedded Xilinx Zynq UltraScale+ EG Heterogeneous MPSoC system [11]. Results of such multicore implementation allowed to identify some constraints and bottlenecks in the implementation. For instance, the multicore implementations of the block based on FFT were in some cases less efficient than the sequential implementation, suggesting that a GPU-based implementation could be better suited for such block. Furthermore, cellular base stations could in practice accommodate non-embedded devices with larger general purpose GPUs, which could run channel estimation methods such as the TSDCE.

3
Hybrid CPU-GPU implementation of the transformed spatial domain… In this work, we assess the potential benefits of a hybrid CPU-GPU implementation of the TSDCE method. We now consider a general purpose platform composed of an eight-core CPU and a high-end CUDA-based GPU which allow, not only to combine CPU and GPU code, but also to exploit the potential of the numerical algebra libraries in both versions, namely, CPU (BLAS [12] and LAPACK [13]) and GPU (cuBLAS and cuSOLVERS). 2 The execution times on CPU and GPU of the three main computing blocks of the algorithm (two of them based on FFT and the third based on SVD) are assessed and compared to the sequential implementation time (in a single CPU core). The reason to change the considered platform from the one used in Ref. [10] is that the latter is of embedded nature and presents a low-end Mali GPU [14] with limited computational performance regarding floating point precision [15,16]. The results indicate that each main computing block benefits from a different implementation, supporting the need to complement the multicore-based analysis in Ref. [10] with a hybrid CPU-GPU overall implementation assessment.
The remainder of the paper is structured as follows. Section 2 presents the basis of the TSDCE channel estimation procedure. In Sect. 3, we describe the hybrid CPU-GPU implementation of the TSDCE. Next, in Sect. 4, we detail the experimental evaluation. Finally, Sect. 5 provides some concluding remarks.

Channel model
Let us assume a single-user mmWave MIMO geometric channel [17,18], where both the Tx and Rx are equipped with a uniform linear array (ULA) of n t and n r antenna elements, respectively. L denotes the number of scatterers, each one contributing with a single Tx-Rx propagation path. The complex channel coefficient for each path is defined by l , l = 1, … , L , while l and l stand for the Angle-of-Arrival (AoA) and Angle-of-Departure (AoD), respectively. Using the full parametric model, the channel is expressed as: T is the parameter vector. Note that | l | and ∠ l stand for the magnitude and phase of each channel coefficient. l -s are independent identically distributed (i.i.d.) random variables with distribution l ∼ CN 0, 2 . AoAs ( l ) and AoDs ( l ) are drawn from a uniform distribution The antenna array responses at the Tx and Rx, under a half-wavelength antenna separation assumption, can be expressed as, respectively:

3
The problem of estimating the channel ( ) becomes a problem of estimating the parameters in . MmWave channels are highly sparse, as demonstrated by measurements [19]. Hence, the L paths are likely to be separated from each other, which simplifies the channel estimation task.

Pilot-based training phase
A previous step to the estimation of the channel using the TSDCE or similar methods is the open-loop pilot-based training phase, where a set of possible Tx and Rx directional beams is tested. More specifically, the beam search space is given by a codebook comprising P and Q codewords/directions at the Tx and Rx side, respectively. An observation matrix is formed after transmitting the pilot symbol through the Q × P direction combinations. The process is as follows: A pilot symbol s , known by Tx and Rx, is transmitted and received through a subset of P ≤ N max and Q ≤ N max spatial directions, respectively. N max denotes the maximum number of angle quantization levels. Note that this parameter is limited due to assuming realistic phase shifters with limited angle resolution. In the analog beamforming case, the Tx and Rx have only one radiofrequency chain, so the beamforming and combining operations are carried out in the analog domain [7].
Beamforming/combining vectors are calculated to match the channel response [20], so = t ̄p for p = 0, 1, … , P − 1 , and = r ̄q for q = 0, 1, … , Q − 1 . For each {q, p} direction pair, the received signal is where ∈ ℝ + is the transmit power. The noise term ∼ CN 0, is a complex additive white Gaussian noise 1 × n r vector with covariance = 2 n n r , where n r stands for the n r × n r identity matrix. The symbol s is set to 1 during training for the sake of simplicity. Hence, the system signal-to-noise ratio is given by ∕ 2 n . The observation matrix is obtained after transmitting the pilot symbols through the Q × P directions The noise matrix ∈ ℂ Q×P contains i.i.d. ∼ CN 0, 2 n elements, and ∈ ℂ Q×P encodes the channel parameter vector .
Elements in the observation matrix can be rearranged to separate the effect of the different path components In the above expression, the observation matrix is written as a sum of path contributions (l) ( l ) ∈ ℂ Q×P , each one being dependent on a parameter vector T . This formulation paves the way for the implementation of the TSDCE method.

TSDCE method
TSDCE is an iterative algorithm which works over the observation matrix to estimate the parameters of the channel. Fig. 1 shows the main steps or building blocks of the TSDCE. It also includes the representation of the input and output signals in the most relevant steps through an example, where the observation matrix shows the presence of three significant paths in the channel (i.e. L = 3). The process starts with the estimation of the parameters of the largest gain path component ( l = 1 ). First, a two-dimensional inverse FFT (2D-IFFT) is applied to the observation matrix, resulting in matrix . Then, a cropping procedure extracts the upper-left submatrix which contains the relevant information, leading to a new matrix ̄ . The rest of the paths are estimated using the same procedure, although the cropped matrix is updated by successive interference cancellation to remove the contribution of the estimated paths, leading to a new matrix ̄′ . By performing a SVD of ̄′ , an estimate of the contribution of the current path component is obtained, achieving a rank-one approximation by means of the dominant singular value. In the next step, a 2D sample Autocorrelation Function (2D-ACF) is applied due to its denoising properties. The resulting matrix (the phase angles of its elements) contains the information for estimating l and l . Finally, as discussed in Ref. [7], spatial frequency estimation is carried out in a four-stage process, which allows to estimate the path complex coefficient (both |̂l| and ∠̂l). Note that the most computationally demanding blocks of the TSDCE algorithm are highlighted in grey in Fig. 1.

Hybrid CPU-GPU TSDCE implementation
The implementation of the TSDCE method used a desktop computer with an NVIDIA GeForce RTX 3060 GPU and an Intel Core i7-9700F with 8 core processors running at 3.0 GHz and 4GB DDR4-2666 RAM, although only 4 out of the 8 cores have been used as in Ref. [10]. Each CPU core is equipped with a 256 KiB L1 data cache and a 256 KiB L1 instruction cache. The cores share a 2MB L2 unified cache. The maximum bandwidth is 41.6 GB/s. The GPU was composed of one of the newest NVIDIA Ampere architectures and gathers a total of 3584 CUDA cores spread in 28 GPU multiprocessors. 3 We tackled the TSDCE implementation by focusing on the three most computationally demanding individual blocks: IFFT, SVD and 2D-ACF (see Fig. 1), with the aim of exploring an efficient hybrid solution. The GPU implementation used the cuBLAS and cuSOLVER libraries for linear algebra, and the FFT library cuFFT. 4 The 2D-ACF block has been implemented mainly through an IFFT, a matrix conjugate and a matrix product. The implementation in the multicore CPU has been done through the OpenMP programming model with the analogous linear algebra libraries, namely, BLAS [12] and LAPACK [13], and the FFTW library [21]. The specific routines used for the multicore CPU implementation where already reported in Ref. [10]. Table 1 shows the specific routines used for the GPU implementation and a description of their usage. We detail next which functions have been used for the implementation of each of the blocks under study: -IFFT block: The implementation of the IFFT block requires only the usage of the cuFFTPlanMany and cufftExecZ2Z routines. -SVD block: First, the cublasZgeam routine is called to obtain the transpose of matrix ̄′ . Then, the SVD of ̄′ is calculated by using the cusolverDn-ZgesvdbufferSize and cuSolverDnZgesvd routines. Finally, to calculate the rank-one approximation after the SVD, the cublasZgemm routine is called. -2D-ACF block: to implement this block, first an FFT is obtain through the use of the cufftPlan2d and cufftExecZ2Z routines. Next, as an intermediate step, an element-wise matrix product is performed. Finally, the inverse FFT is applied over the result of the previous step.
Hybrid CPU-GPU implementation of the transformed spatial domain…  Table 1 also describes the cublasCreate, cublasZdscal, cublasSet-Matrix, cublasGetMatrix and cublasZcopy routines of cuBLAS library which perform auxiliary steps such as initialization, copy, and data transfer, which have been used also in the SVD and 2D-ACF blocks as needed.

Experiments
The experimental evaluation assumes equal values for the parameters P, Q, n r and n t for the sake of simplicity (note that in more realistic systems these parameters may differ), which implies that the size of the observation matrix is directly n r × n t . The values that have been evaluated are 16, 32, 64, 128, and 256. Although the TSDCE algorithm also depends on the number of path components L, the evaluations in this work are focused on the execution time to estimate one of the paths. Thus, the results are independent of L. In order to better assess the performance enhancement of the GPU and multicore CPU implementations, the Speedup measure has been used, which is defined as the quotient between the execution time with a single CPU core over the execution time with the GPU or with several CPU cores (2, 3 or 4). For each block under analysis and each processing element, 30 time measurements have been recorded and sorted in ascending order. The final Speedup value has been calculated as the average of the 5 central values to avoid bias due to outliers. The execution time of the IFFT block is assessed in Fig. 2, representing the Speedup for the IFFT block using 2, 3 or 4 CPU cores and when using the GPU as the number of antennas increases. Note that these results depend exclusively on P and Q. It can be observed that the GPU achieves the largest Speedup values in all cases, which is around 2. The maximum Speedup is 2.75 and appears for Hybrid CPU-GPU implementation of the transformed spatial domain… n r = 256 , indicating that in massive MIMO systems, the IFFT block largely benefits from a GPU implementation. In fact, the Speedup values for the multicore implementations are in several cases under 1, an effect that was already observed in Ref. [10] for the implementation of the 2D-ACF block, also based on FFT.
The Speedup results for the SVD block are shown in Fig. 3. Note that the SVD execution time depends mainly on n r and n t . As it can be seen, in this case the Speedup is larger in the multicore CPU options than in the GPU, regardless of the number of cores and with a maximum of 1.6, i.e. it is lower than the one of the IFFT block. It is worth noting that the GPU Speedup is below 1, which implies  Fig. 4 Speedup for the 2D-ACF block by using 2, 3, 4 CPU cores and GPU that the execution time of the SVD on GPU is inefficient and worse than the single-core CPU implementation.
Finally, Fig. 4 shows the Speedup for the 2D-ACF block as a function of the number of antennas. The Speedup values are in all cases superior with the GPU, showing a remarkable increase with respect to the CPU Speedup values for n r ≥ 64 . For instance, when n r = 256 , the GPU Speedup is larger than 4 in contrast to only around 1.5 on the CPU. Since the 2D-ACF is based on FFT processing, it is also interesting to compare its Speedup results with those for the IFFT block in Fig. 2. For n r ≤ 64 , the GPU Speedups have similar trends for both blocks, with larger Speedup values for the IFFT case. The reason for this Speedup difference may come from the data transfers between GPU and CPU necessary for the additional operations in the 2D-ACF. However, when n r ≥ 128 , the GPU Speedup for the 2D-ACF block is much higher than for the IFFT. As the number of antennas grows, the time spent for data transfers in the 2D-ACF is negligible within the overall execution time, which becomes dominated by the algebraic operations with large matrices.

Conclusion
In this paper, a hybrid CPU-GPU implementation of a transformed spatial domain MIMO channel estimation algorithm has been performed, with the aim of exploring the optimal link between computational resource and algorithm's computing block. The results have been obtained in terms of execution time for a single path channel estimation and are represented through the speedup over a single-core implementation. Under the assumption of equal antenna and beam search space dimensions, it was observed that the multicore resources are less efficient for FFT-related computational blocks, showing in several cases speedup values under 1, as already happened in a previous work focused only on multicore implementation. Regarding SVD-related computations, in this case the multicore options are more advantageous than the GPU processing, confirming that each of the three main algorithm's computing blocks could benefit from a different implementation type. In future works, the general multi-path case will be considered to explore the impact of the iterative nature of the algorithm on the optimal computational resource management.
Author Contributions All authors contributed equally to this work. Availability of data and materials No additional data or materials available.

Conflict of interest
The authors declare that they have no known competing financial interest or personal relation-