Performance analysis of a millimeter wave MIMO channel estimation method in an embedded multi-core processor

The emerging Multi-Processor System-on-Chip (MPSoC) technology, which combines heterogeneous computing with the high performance of field programmable gate arrays (FPGA), is a promising platform for a large number of applications, including wireless communications and vehicular technology. In this specific application context, when multiple-input multiple-output (MIMO) scenarios are considered, the system usually has to manage a large number of communication links among sensors and antennas involving different vehicles and users. Millimeter wave (mmWave) communications are one of the key technology enablers toward achieving high data rates in beyond 5G systems (B5G). Communication at these frequency bands usually involves the use of large antenna arrays, often requiring high computational resources. One of the candidate platforms able to manage a huge number of communications is the Xilinx Zynq UltraScale+ EG Heterogeneous MPSoC, which is composed of a dual-core Cortex-R5, a quad-core ARM Cortex-A53, a graphics processing unit (GPU) and a high-end FPGA. This work analyzes the computational performance that requires a recent mmWave MIMO channel estimation algorithm in a platform of this kind. As a first approach, we will focus our work on the performance that can be achieved via the quad-core ARM Cortex-A53. To this end, we will use the libraries for numerical algebra (BLAS and LAPACK). The results show that our reference implementation is able to manage a large MIMO communication system with 256 antennas without exhausting platform resources.


Introduction
In the last decade, parallel systems are being employed in all segments of the industry in the form of multicore processors and many-core hardware accelerators [1][2][3]. Digital signal processing for wireless communications is one of the fields which has been largely benefited from these devices, for instance, to efficiently implement algorithms enabling multiple-input multiple-output (MIMO) communications [4,5].
Millimeter wave (mmWave) communications are one of the key technology enablers toward achieving high data rates in 5G [6] and future 6G systems [7]. The use of these frequency bands considers beamforming techniques with highly directional beams that increase the gain of the communication link between transmitter (Tx) and receiver (Rx). This gain is achieved in practice by using massive MIMO systems, with a high number of antenna elements (in the order of several tens or hundreds). The use of MIMO systems with a large number of antennas increases further the complexity of many signal processing algorithms such as channel estimation, which could benefit from computationally efficient implementations.
The Xilinx Zynq Ultrascale+ MPSoC platform offers high levels of heterogeneity and parallelism since it is composed by four different processing elements: a dual-core Cortex-R5, a quad-core ARM Cortex-A53, a low-end Graphics Processing Unit (GPU) and a high-end FPGA. Therefore, this MPSoC offers multiple parallelism levels, although leveraging properly its computational resources becomes a challenging task that is being currently studied with basic operations such as matrix multiplication [8]. This work goes a step further and aims to implement and evaluate a MIMO mmWave channel estimation algorithm in a platform of the above kind, analyzing the device capability to handle the high computational demands of large communication systems. Specifically, we will tackle the novel implementation of the transformed spatial domain channel estimation method (TSDCE) for analog mmWave MIMO systems proposed in [9]. It is an iterative algorithm which performs, as core operations, singular value decompositions (SVD) and discrete Fourier transforms (DFTs) to estimate the channel from the available observations. Since both SVD and DFT can be computationally demanding as the number of antennas in the system grows, the aim of this work is to find a TSDCE implementation exhibiting a proper trade-off between performance and complexity.
This paper is structured as follows. Section 2 offers a brief overview of channel estimation theory for millimeter wave MIMO communications. In Sect. 3, we briefly describe the computational resources of the MPSoC Xilinx Zynq Ultrascale+. Next, in Sect. 4, we detail the implementation process and analyze the influence of the quadcore ARM Cortex-A53 on the performance. Finally, Sect. 5 provides some concluding remarks.

Channel model
Let us consider an mmWave MIMO channel characterized by L paths as in [10] [11], where the angle-of-arrival (AoD) and angle-of-departure (AoD) of path l are denoted by l and l , respectively. Each channel path is also affected by a complex gain coefficient l . The full parametric channel model can be characterized as: We consider a system where the Tx and Rx use both a uniform linear array of antennas for beamforming and combining with n t and n r antennas each, respectively. The antenna array responses at the Tx and Rx, assuming half-wavelength antenna separation, can be expressed as In the channel model considered in this work, the l -s are independent identically distributed (i.i.d.) random variables with distribution l ∼ CN(0, 2 ∕L) , while the AoA ( l ) and AoD ( l ) are drawn from a uniform distribution ∈ [0, 2 ].
Hence, the problem of estimating the channel ( ) can be addressed by finding out all the parameters in . Considering that measurements at mmWave have demonstrated that the channel at these frequencies is highly sparse [12], in practice the L paths are likely to be separated from each other, which simplifies the channel estimation task.

Pilot-based training phase
In order to estimate the parameters of the channel paths, a pilot-based training phase is carried out first. A pilot symbol s , known by Tx and Rx, is transmitted and received through a subset of P ≤ N max and Q ≤ N max spatial directions, respectively. N max stands for the maximum number of angle quantization levels, which are limited due to assuming realistic phase shifters with limited angle resolution.
As in [9], the Tx and Rx have only one radio-frequency chain, and hence, the beamforming and combining operations are carried out in the analog domain. As in previous works [11], we particularize the beamforming/combining vectors to Performance analysis of a millimeter wave MIMO channel… match the channel response, i.e., = t (̄p) for p = 0, 1, … , P − 1 and = r (̄q) for q = 0, 1, … , Q − 1 . For each {q, p} direction pair, the received signal is where ∈ ℝ + is the transmit power. The noise term ∼ CN(0, ) is a complex additive white Gaussian noise 1 × n r vector with covariance = 2 n n r , where n r denotes the n r × n r identity matrix. Then, the system signal-to-noise ratio is given by ∕ 2 n . We assume, for simplicity, that the symbol s is set to 1 during training. After the pilot transmission through the Q × P directions, an observation matrix is obtained The noise matrix ∈ ℂ Q×P contains i.i.d. ∼ CN(0, 2 n ) elements, and ∈ ℂ Q×P encodes the channel parameter vector . The effect of the different path components can be separated to write the observation matrix as a sum of path contributions (l) ( l ) ∈ ℂ Q×P , each one being dependent on a parameter vector

TSDCE method
TSDCE is an iterative channel estimation algorithm whose main steps are summarized in Fig. 1. Starting from the most powerful path component of the channel ( l = 1 ), the first step applies the two-dimensional inverse DFT (2D-IDFT) to the observation matrix in (6), resulting in a new matrix . Next, a cropping step is performed to extract the upper-left submatrix containing the informative part of matrix . Then, an estimate of the contribution corresponding to the most powerful path component is obtained after performing the SVD of the cropped matrix, to achieve a rank-one approximation through the dominant singular-value. Fig. 1 Steps of the TSDCE method The following step is based on applying the 2D sample autocorrelation function (ACF) to exploit its denoising properties. The phase angles of the elements of the resulting matrix contain the necessary information for the estimation of l and l .
In a final stage, spatial frequency estimation is performed. Following the discussion in [9], the frequency estimation can be obtained through a four-step process: unwrapping the phases, estimating their slopes by weighted least squares (WLS), designing the weights for the WLS optimization problem and estimating the path complex coefficient (both |̂1| and ∠̂1).

Exploring the Xilinx Zynq Ultrascale+ MPSoC
Nowadays, high end FPGAs contain not only programmable logic but a variety of components that make possible an outstanding computational capacity. This work is based in the Xilinx Zynq UltraScale+ MPSoC family [13] which is subdivided into three subfamilies intended for different application fields. The CG subfamily is intended for optimized industrial applications, covering among others: IoT, motor control, sensors, etc. The EG subfamily is intended for aerospace and defense applications, covering 5G communications and cloud computing, and the EV subfamily is intended for high definition video applications. All elements of all subfamilies contain a dual-core ARM Cortex R5 [14] and a quad core ARM cortex A53 [15]. The EV and EG subfamilies have also a low end Mali GPU [16] and only subfamily EV adds an H.264/H.265 video codec. In all MPSoC devices from Zynq UltraScale family the system is subdivided into two main parts: • PS (Processing System), which contains all the microprocessors. On the one hand, the quad-core ARM Cortex-A53 that implements the ARM v8-A 64-bit instruction set with frequency up to 1.5 GHz. We are accessing to these cores in two different ways : (1) by using the Xilinx Software Development Kit (SDK) [17] and (2) by using the operating system Petalinux [18] which allows us to customize, build and deploy embedded Linux solutions on Xilinx processing systems. On the other hand, the dual-core ARM Cortex-R5 that belongs to the family of 32-bit RISC ARM v7-R processors can be used in lockstep mode (micro-synchronized dual execution), and in split mode (each core can work in parallel and executing different tasks Inside the device, multiple interconnection options between PL and PS are available, making possible high-speed data transfer suitable for the most demanding applications. Figure 2 shows a general scheme of the Xilinx Zynq UltraScale+ EG subfamily including its main components.
Since the platforms of this kind will share its computational resources among different applications running at the same time, this work will focus on the main resource, the quad-core ARM Cortex-A53, which can be easily configured for exploiting parallelism among different operations regardless of the type of operations or instructions to be executed.

Multi-core-based implementation
The main goal of this paper is to evaluate the computational time that requires the execution of the TSDCE method. To this end, we leverage versions of software libraries for numerical linear algebra, such as LAPACK (Linear Algebra Package) [24] and BLAS (Basic Linear Algebra Subprograms) [25] which are used for carrying out SVD decompositions and Matrix multiplications. Other computational expensive block of the algorithm is the 2D-ACF block, see Fig. 1, which is based on FFT and 2D-IDFT. These computations are performed in our code via the implementation available in the FFTW library for ARM architectures [26]. All of the previous libraries allow multi-thread support. The data type used, both in the FFT and in the SVD, as well as in the rest of the blocks that require complex numbers, is double complex. For the rest of the data, double or int have been used, depending on the specific case. Table 1 shows the routines that have been used for implementing the TSDCE method [9].  Performance analysis of a millimeter wave MIMO channel…

Experiments
The algorithm has been tested for several values of the parameters involved. Although in practical systems the number of transmitting antennas can differ from the number of receiving antennas, for the sake of simplicity we considered a n T = n R , leading to square matrices for processing. In particular, the number of antennas was varied among 16, 32, 64, 128 and 256. Regarding the number of channel paths L, we considered a realistic mmWave where the number of paths is usually low and below L = 10 . Regarding the number of quantized angles in transmission and reception, P and Q, respectively, we considered them equal to the number of antennas, as in [9]. It can be observed that the TSDCE algorithm has two blocks that fundamentally influence the overall computational cost and that will be assessed independently: the SVD and the 2D-ACF (based on FFT). Starting from the SVD, each execution of the SVD depends mainly on the values of n T and n R and it does not depend on the number of paths, thus, the value of L does not affect the SVD runtime. Figure 3 shows the speedup of an iteration of the SVD block for various values of n T = n R . It can be seen that some improvement is obtained when increasing the number of cores for n T values greater than 64. As expected, the speedup when using 2, 3 or 4 cores, defined as the ratio of execution time with many cores over the sequential execution time with a single core, is more noticeable as the values of n T and n R increase. The maximum achieved speedup is equal to 2, and it is obtained for the 256-antennas case. Figure 4 shows the speedup of one iteration of the 2D-ACF block as a function of the number of transmitting/receiving antennas. The 2D-ACF curves, based on FFT processing, show a similar trend as the SVD implementation. The FFTW3 libraries offer a speedup improvement for values of n T greater than 128, thus, an improvement is obtained by increasing the number of cores for those cases. However, it is important to note that, unlike for the SVD, the speedup using 4 cores with respect to using 3 cores is practically the same, so this block should be in practice parallelized with up to 3 cores. Furthermore, the maximum speedup achieved is 1.6, which is lower than for the SVD block. The speedup for the implementation of the complete TSDCE algorithm using 2, 3 or 4 cores for different numbers of antennas and channel paths is shown in Fig. 5. It is worth noting that the number of SVD and 2D-ACF executions for each algorithm run depend both on L. More specifically, for each execution of the TSDCE algorithm, the SVD block is run L − 1 times, whereas the 2D-ACF block is run L 2 times. It can be seen in Fig. 5 that speedups higher than 1 are only achieved for numbers of antennas greater or equal than 64, being the maximum speedup around 1.6 for the 4-cores implementation with 256 antennas. Since the system is still far from achieving the maximum speedup of 4, this result shows that there is still room for improving the overall algorithm implementation, for instance, by optimizing the used data structures to reduce the number of writing/reading accesses to memory.

Conclusions
In this paper, we have explored the capabilities of the quad-core ARM Cortex-A53 that is present at the Xilinx Zynq Ultrascale+, a candidate embedded system to carry out tasks in 5G systems. Therefore, we have used this system to tackle a novel implementation of the transformed spatial domain channel estimation method (TSDCE) inside a MIMO communication system and, in this sense, to assess how the proposed implementation is able to exploit the quad-core parallelism. We evaluated the execution times of the two most computationally demanding blocks (SVD and 2D-ACF) by using different numbers of threads and, also, the performance of the complete TSDCE algorithm. We observed that our implementation can manage a MIMO system composed of 256 antennas without exhausting the embedded system, which indicates that this kind of platforms are promising for carrying out real-time communication tasks. However, we also detected that, although there exists an improvement in performance by using a high number of cores, there is still a bottleneck in the data structures that limits the achievement of the maximum speedup, especially for systems with less than 64 antennas. To cope with this, as a future work, it would be of interest to explore heterogeneous implementations of the method, combining the multi-core resources with other resources such as GPU or FPGA.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission