# Hardware-accelerated design space exploration framework for communication systems

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s10470-013-0127-6

- Cite this article as:
- Kock, M., Hesselbarth, S., Pfitzner, M. et al. Analog Integr Circ Sig Process (2014) 78: 557. doi:10.1007/s10470-013-0127-6

- 325 Views

## Abstract

The efficient hardware implementation of signal processing algorithms requires a rigid characterization of the interdependencies between system parameters and hardware costs. Pure software simulation of bit-true implementations of algorithms with high computational complexity is prohibitive because of the excessive runtime. Therefore, we present a field-programmable gate array (FPGA) based hybrid hardware-in-the-loop design space exploration (DSE) framework combining high-level tools (e.g. MATLAB, C++) with a System-on-Chip (SoC) template mapped on FPGA-based emulation systems. This combination significantly accelerates the design process and characterization of highly optimized hardware modules. Furthermore, the approach helps to quantify the interdependencies between system parameters and hardware costs. The achievable emulation speedup using bit-true hardware modules is a key enabling the optimization of complex signal processing systems using Monte Carlo approaches which are infeasible for pure software simulation due to the large required stimuli sets. The framework supports a divide-and-conquer approach through a flexible partitioning of complex algorithms across the system resources on different layers of abstraction. This facilitates to efficiently split the design process among different teams. The presented framework comprises a generic state of the art SoC infrastructure template, a transparent communication layer including MATLAB and hardware interfaces, module wrappers and DSE facilities. The hardware template is synthesizable for a variety of FPGA-based platforms. Implementation and DSE results for two case studies from the different application fields of synthetic aperture radar image processing and interference alignment in communication systems are presented.

### Keywords

Design space exploration (DSE)EmulationFixed-point arithmeticSynthetic aperture radar (SAR)Interference alignment (IA)## 1 Introduction

Design space exploration (DSE) for signal processing hardware can be conducted using models on a variety of abstraction levels, usually resulting in a trade-off between implementation effort, model accuracy and simulation speed. Models on high abstraction levels are typically quicker to be implemented and verified and provide shorter execution times than their lower level counterparts, thus allowing the coverage of a larger design parameter space at an early design stage [10]. Assessing low level design parameters like bit width precision implications across all stages of a system towards the target hardware implementation of a DSE process requires bit-true models. This contribution focuses on the exploration of low level DSE parameters, with a focus on design and verification time and simulation speed of bit-true models.

Field-programmable gate array (FPGA) based rapid prototyping systems are widely used in algorithm research and development. Combined with automatic and semi-automatic high-level description to hardware description language (HDL) code generation tools, they facilitate quick hardware deployment invaluable for proof-of-concept studies. However, the hardware efficiency achievable by this design flow is often limited and insufficient for resource limited applications and the verification of application-specific integrated circuit (ASIC) designs. Techniques from FPGA-based ASIC verification and rapid prototyping are combined in this project for the bit-true DSE of highly optimized hardware architectures.

Design time is a limited resource, thus a high design efficiency is an important goal. In a typical implementation scenario for complex designs, high level reference models are used. These models consist of several modules to be integrated. The choice of modules to be optimized is often based on profiling results, with those modules contributing significantly to the overall resource requirements being chosen for optimization. This leads to a hybrid design consisting of a mixture of high level modules and highly optimized modules, running on hardware ranging from general purpose processors, application-specific instruction-set processors, FPGA-based rapid prototyping systems and dedicated hardware accelerators.

Signal processing reference algorithms are typically implemented using floating-point data types using high level software tools. However, fixed-point or non-standard floating-point data types are preferred for optimized dedicated hardware implementations due to their higher area and power efficiency. The bit widths for both data types can be chosen arbitrarily in dedicated hardware implementations, even separately for each processing element (PE). Migrating to fixed-point arithmetic opens a large parameter space to be optimized, which is infeasible for pure software simulation due to the large data sets required for numerical characterization. Common metrics for algorithms used in communication systems are the resulting overall system performance in terms of throughput, signal-to-noise ratio (SNR), error probability etc.

*UEMU*) are to provide:

a DSE environment for the characterization of bit-true models

a hybrid SW/HW co-simulation environment

synthesizable state of the art System-on-Chip (SoC) infrastructure for ASIC and FPGA targets

a platform-independent abstraction layer for design tools

Section 2 gives a brief overview of related work. The main contributions of this article are as follows. The DSE process using bit-true models for computationally intense signal processing systems is presented in Sect. 3 A state of the art SoC infrastructure template and development framework is presented in Sect. 4.

Two case studies from different areas of signal processing are presented in Sects. 5 and 6. The underlying applications and algorithms of both case studies are briefly sketched in their respective chapters.

## 2 Related work

FPGA-based emulation systems are commonly used for ASIC design verification, simulation acceleration and rapid prototyping. While the underlying hardware platforms exhibit similar characteristics, the deployed design flow greatly varies depending on the application.

A typical rapid prototyping design flow allows the integration of hardware accelerators, e.g. signal processing modules into a high-level MATLAB/Simulink system model for simulation acceleration. The hardware description is usually generated from high-level models. While facilitating a quick hardware deployment in algorithm research, proof-of-concept studies and first hardware demonstrators, the resulting design efficiency may not be suitable for production. DSEs at an early design stage often build on FPGA acceleration for a coverage of a larger design space with coarse accuracy. The automatically generated hardware or parts thereof serve as a starting point for an optimized implementation.

An FPGA-based design and verification framework is presented in [4] as a software and hardware co-emulation approach used in a case study designing a hardware soft-input soft-output multiple-input multiple-output (MIMO) sphere decoder. Efficient emulator utilization and an optimized simulation speedup during the design and verification phases are achieved by coupling a hardware emulation system with a parallel computing cluster. Test stimuli are generated by software and multiplexed to the design under test by the framework, classifying it as a simulation accelerator and hardware testbed. A similar testbed approach is presented in [9], with an implementation case study for an IEEE 802.11n MIMO baseband transceiver.

To our best knowledge, there are no publications on efficient hardware architectures for the digital baseband processing in interference alignment (IA) systems with similar goals as the case study presented in Sect. 6. In [13], a rapid-prototyping based realtime testbed involving FPGA acceleration is presented as a proof-of-concept demonstrator. Offline processing of captured data is used in the evaluation of IA under real-world conditions in [8]. Both publications use common off-the-shelf hardware with a focus on algorithm exploration, with little details and concepts on the implemented digital baseband signal processing hardware.

Depending on the development strategy, two different main design approaches can be distinguished. First, cycle-true or event triggered simulation and verification, second data flow based verification by comparing algorithm output at different stages. In terms of cycle-true simulation, different approaches can be found in literature to exploit FPGA-based emulation systems for accelerating the simulation speed.

For example, Moraes et al. [14] present a generic FPGA emulation framework which link a synthesized design under verification with the software simulator. While improving the simulation and verification speed on signal level, the framework does not target on successive implementation of signal processing subtasks in the context of an existing reference algorithm. A similar cycle-accurate approach is proposed by Del Valle et al. [7]. An FPGA-based emulation framework is used for characterization of multi-processor SoC architectures focusing on porting complex algorithms to embedded systems. In terms of a DSE, it is designed for evaluation and analysis of characteristic processor features, e.g. memory hierarchy and influence of different processing cores.

In contrast to cycle accurate approaches, FPGA-based emulation systems are used to accelerate the development, verification and execution of digital signal processing tasks inside a given reference software algorithm. Thereby, the main focus is not the extraction of single signals or performance counters but a comparison of the overall algorithmic output. A commercial tool flow for mapping complex algorithms to FPGAs is the BEEcube Platform Studio [2]. Distinct signal processing tasks can be implemented by using provided library elements or directly implemented in HDL. Those elements can be associated with MATLAB/Simulink components and allow for successive implementation of the reference algorithm. However, data exchange between MATLAB and the emulated hardware is limited.

The *UEMU* framework concept presented here extends existing hardware-in-the-loop approaches by means of supporting hardware engineers during concurrent implementation of algorithms at early stages. Therefore, the framework provides a flexible interface between MATLAB and the emulation system at arbitrary stages of the reference algorithm. This allows for hardware modules to be implemented and verified in the context of the whole target algorithm easily. Data transfer can be initiated transparently from MATLAB via Ethernet leaving the control flow at software level. Communication with software HDL simulation tools is fully integrated into the software and hardware interfaces provided by *UEMU*. This enables live communication between software, emulated hardware and HDL simulation in a closed loop, leading to a significant HDL simulation speedup for partitioned system configurations. Tools are provided for dynamic on-line design parameter sweeps by instrumentation of the optimized RTL hardware models.

## 3 Signal processing systems DSE

The process of designing complex digital electronic circuits offers a large variety of options to the designer. There are many valid possible implementations that fulfill the specification, but they differ in certain properties, e.g. silicon area, power efficiency, flexibility, testability and design effort. These parameters and properties span the so-called design space. They can not be chosen independently from each other. For example, flexibility and power efficiency are often contradictory requirements. At an early design stage, the designer has to choose important design parameters that will eventually determine the final implementation properties. Most of the parameters like the resulting power dissipation can not be directly chosen, but they are rather a result of many design choices: for example, hardwired architectures tend to be more power efficient than programmable architectures.

A DSE establishes relations between possible points in the design space, ultimately leading to cost functions [3] modeling the quantitative relation between design parameters and the resulting hardware properties and resource requirements. These models serve as a basis for important design decisions in an early design phase.

Certain parameters are of special interest in the domain of wireless communication platforms. The limited power budget in mobile devices puts hard constraints on the power efficiency, requiring power optimization across all layers of algorithm development, design implementation and semiconductor technology. This often conflicts with the demand for flexibility, which is another important requirement due to rapidly evolving communication standards. Flexibility and programmability is also required to keep pace with shorter product lifecycles to enable the re-use of the same hardware platform across multiple product generations. Also motivated by economical reasons, the total design time is of great importance.

The hardware resource requirements of wireless communication algorithms can be reduced by exploiting their numerical properties and finding appropriate approximations. Examples include word length limited fixed-point number representation, substituting bit-true elementary mathematical functions by numerical approximations, but also heuristics used in communication specific signal processing blocks like MIMO decoding and forward error correction (FEC). In contrast to highly precise computations available in high-level algorithm development environments, resource limited communication system implementations have to deal with limited accuracy in each successive block.

Typically, these approximations trade accuracy for timing improvement, area and power saving and can often be assessed in terms of resulting bit error rate, throughput and related quantities of the overall processing chain. However, to facilitate a systematic approach to a fine-grained optimization, an evaluation of the quantitative impact of an approximation on the overall system performance is required for a huge number of points in the design space. Though these relations may be coarsely estimated from high-level models, this approach can not cover all effects present in an implementation containing approximations. The highly nonlinear characteristics found in many signal processing blocks like MIMO decoding and FEC lead to hardly analytically predictable interdependencies between the approximation parameters and system level performance degradation. The strategy adopted in this paper is to enable the characterization of the above mentioned approximations over a wide parameter range by instrumentation of the optimized RTL code. For example, modules for the dynamic adaptation of word lengths at runtime can be inserted into key signals of interest. This simple instrumentation is carried out by hand on the RTL source code level. These modules can be globally configured to be transparent for the final production synthesis.

Deriving comprehensive cost models using Monte Carlo methods requires visiting a significantly larger number of points in the design space compared to existing heuristically driven parameter optimization approaches covered by existing FPGA-based simulation acceleration systems. The achievable simulation speedup is a key factor enabling the characterization and optimization of complex communication systems using Monte Carlo approaches which are infeasible for pure software simulation due to the large required stimuli sets. Long simulation times in the order of days or even weeks heavily hinder the design process, as designing VLSI circuits is an iterative process and simulations typically have to be re-run several times during the design phase. Thus, increasing simulation speed by a few orders of magnitude significantly increases the designer’s efficiency.

## 4 Hybrid emulation framework

### 4.1 Scope of the framework

In the course of this work an FPGA-based hybrid hardware-in-the-loop research and DSE framework was created. It allows the combination of high-level tools (e.g. MATLAB/Simulink and C/C++) with optimized hardware modules.

Target device and commercial tool independency is achieved through abstraction layers and common interfaces provided in software and hardware. This significantly simplifies the adaption of existing designs to new targets, as only the framework has to be extended to new platforms once, while all designs linking against the common interfaces do not require any modifications. Furthermore, the designer is relieved from creating and maintaining design- and target specific infrastructure. Thus, the suggested design guideline is to use the interfaces provided by *UEMU* whenever possible and avoid adding target specific modifications.

The proposed *UEMU* framework complements automated high level hardware generation tools and libraries like Xilinx CoreGen and MATLAB to HDL toolboxes by providing a portable industry-grade SoC infrastructure suitable for both debug and verification as well as ASIC deployment. Although these tools are not integral parts of the framework, integrating generated HDL code and netlists is common practice for third-party intellectual property cores and uncritical functions.

*UEMU*based designs (Fig. 1).

The framework is based on a SoC centric approach that also makes it suitable for usage in ASIC technology design flow and thus enable test, debugging and characterization of signal processing modules in their target environment.

### 4.2 Framework description

The framework comprises a host PC, a software library providing a transparent communication application programming interface (API), an FPGA-based emulation system, fully synthesizable VHDL SoC infrastructure, dedicated accelerators, and processor soft-cores.

A subsystem template abstracts from emulation system specific infrastructure, making the user designed core signal processing modules independent of the underlying emulation system. During algorithm development, computationally intensive signal processing modules can be implemented in Verilog or VHDL and added to the subsystem. Remaining processing modules may continue to run as high-level models, enabling a divide-and-conquer implementation and verification approach. This allows signal processing modules to be split and run distributed on a highly heterogeneous signal processing system, enabling a fine-grained module-wise migration from high-level software reference to optimized production quality hardware. All resources are accessible by software and hardware, providing flexible partitioning and migration of processing tasks between high-level software, embedded software and dedicated hardware modules.

### 4.3 Emulation system abstraction approach

The VHDL SoC infrastructure comprises an open core protocol (OCP) multi-layer bus, an Ethernet DMA interface, SDRAM controllers, on-chip memories, standard RISC soft-core processors and a massively parallel parameterizable VLIW ASIP [17]. It has been adapted to and tested on each of the following three different emulation systems ranging from high-end to portable systems.

The high-performance BEEcube BEE4 [2] rapid prototyping system incorporates four Xilinx Virtex-6 LX550T FPGAs each featuring 4 GB DDR3-PC1066 SDRAM, Gigabit Ethernet, 20 Gbps QSFP+ interface, FMC-HPC expansion connector, and PCIExpress ×8 slot.

The mid-range Xilinx Virtex-6 LX240T ML605 Evaluation Kit [23] is equipped with 4 GB DDR3-PC800 SDRAM, Gigabit Ethernet, SFP interface, FMC-HPC and FMC-LPC expansion connectors, and PCIExpress ×4 slot.

For each emulation system an abstraction layer has been developed that takes care of board-specific interfaces, e.g. RAM and Ethernet. The abstraction layer also implements hardware interfaces for the software communication API and a OCP multi-layer bus between RAM controllers, on-chip memory, and board-independent user design. This abstraction allows for using the same user design in each of the supported emulation systems without changing the design at all.

Currently, also the Xilinx Virtex-7 VC707 and Digilent ZedBoard with Xilinx Zynq 7Z020 programmable SoC are evaluated and will be included in the framework.

## 5 Case study: SAR

In the field of remote sensing and security applications, a variety of different active and passive sensor systems can be found. SAR is an active radar technology which is used to generate high resolution images by repetitive transmission and reception of electromagnetic pulses. The principle of the image formation process is based on the relative movement between sensor and an observed area (Doppler effect). Typically, SAR is used for airborne platforms but can also be found in a variety of stationary security scanners [11, 20]. Compared to electro-optical and infra-red sensors, complex digital signal processing algorithms need to be applied to the SAR raw data to obtain the final image. In addition, state-of-the-art SAR systems provide very high data rates. For these reasons, efficient hardware architectures with high processing performance are mandatory.

### 5.1 System signal model

*w*

_{r}(

*t*

_{r}) defines the range amplitude,

*f*

_{c}is the carrier radio frequency (RF) and

*K*

_{r}=

*B*

_{bb}/

*T*

_{r}is the range sweep rate for a given pulse duration

*T*

_{r}and baseband bandwidth

*B*

_{bb}. Because of the propagation speed

*c*of an electromagnetic wave, the range time variable

*t*

_{r}is often referred to as fast-time.

*f*

_{prf}seconds whereby

*f*

_{prf}determines the pulse repetition frequency (PRF). Each reflector, which is covered by the beam footprint, is illuminated several times by the advancing antenna at different angles of incidence. The relative movement between reflector and antenna contributes to the Doppler modulation. The two-way distance

*R*

_{sl}(

*t*

_{a}) between an arbitrary azimuth position

*x*=

*v*

*t*

_{a}and a reflector can be expressed as a function of azimuth time. For typical application scenarios (effective azimuth antenna opening angle

*θ*

_{a}≪ 20°), a Taylor series expansion is commonly used as an approximation for the square root dependency as follows in Eq. 2.

*R*

_{sl}(

*t*

_{a}) in Eq. 2 the Doppler frequency modulation of the azimuth signal component can now be expressed as a function of azimuth time

*t*

_{a}.

*θ*

_{s}induces a positive shift of the Doppler spectrum. For the zero-squint case, the Doppler frequency

*f*

_{a}(

*t*

_{a}) is centered to

*f*

_{dc}= 0 which is typical for stationary applications. The azimuth sweep rate

*K*

_{a}defines the variation of Doppler frequency as a function of azimuth time

*t*

_{a}which is often referred to as slow-time because of the sensor propagation speed

*v*≪

*c*.

By combining the linear range FM pulse from Eq. 1 with the Doppler modulation in Eq. 3, the two-dimensional SAR impulse response *h*_{imp}(*t*_{a}, *t*_{r}) of a single point reflector is given by Eq. 4. In detail, *w*_{r} and *w*_{a} are the time delayed amplitudes of the received signal. The first exponential term defines the Doppler modulation that results from the range distance variation *R*_{sl}(*t*_{a}), the second term represents the delayed chirp from Eq. 1. It has to be noted that this expression assumes that the RF component of the transmitted pulse (*s*_{RF} = cos(2π *f*_{c}*t*_{r}) has been removed before analog to digital conversion (ADC) as part of a quadrature demodulation process.

*h*

_{imp}of a single point reflector, the signal model of a ground surface with an arbitrary reflectivity distribution can be expressed as a two-dimensional convolution of the ground reflectivity

*g*(

*t*

_{a},

*t*

_{r}) and

*h*

_{imp}(

*t*

_{a},

*t*

_{r}). An additional noise component

*n*(

*t*

_{a},

*t*

_{r}) is inserted as it is present in all practical systems.

The purpose of a SAR processor is to recover the ground reflectivity *g*(*t*_{a}, *t*_{r}) from the measured baseband signal *s*_{bb}. This deconvolution process is challenging because the impulse response *h*_{imp} is both range and azimuth dependent and has a range-varying migration of signal energy as a result of the range distance variation *R*_{sl}(*t*_{a}). This migration effect is often referred to as range cell migration (RCM) and has to be corrected during the image formation process.

### 5.2 SAR image reconstruction using the range-doppler (RDA) algorithm

After demodulation and A/D conversion, sensor raw data is compressed in range by means of a precalculated replica of the transmitted chirp pulse. After an inverse Fourier transform, subsequent processing is performed in the so called RDA domain i.e. in the azimuth frequency, range time domain. Before the range dependent azimuth compression can be applied, the range varying cell migration has to be corrected by means of an interpolation. This step, commonly referred to as range cell migration correction (RCMC) straightens out the curved reflector’s trajectories so that they run parallel in the azimuth frequency axis. The final inverse azimuth Fourier transform completes the image formation process.

### 5.3 FPGA-based RDA hardware implementation

As mentioned before, SAR image generation has to cope first with the inherent algorithm complexity, second with increasing sensor data rates of state of the art SAR sensor systems. Especially compact SAR systems demand for an efficient hardware architecture which offers high throughput rates at moderate power consumption. Besides these system limitations, flexibility in terms of algorithm mapping as well as processing precision demands have to be taken into account.

A key challenge during the hardware development process is the choice of an appropriate trade-off between precision demands and hardware resource allocation. As this task requires a multi-dimensional parameter decision, the design space covers a large set of possible data path configurations and therefore practical limitations in terms of simulation time. Previous work has shown that for an exemplary image dimension of 16k × 8k = 128 MPixel, the overall software simulation time for a bit-true data path model ranges from ~80 min (single core) to ~7 min (16 cores) [18]. In terms of required simulation time per output image, this bit-true model would result in about 8 runs per hour. Considering even a small design space with three different parameters with 16 possible values each, a set of 16^{3} = 4,096 permutations would result in 512 h simulation time which would exceed practical limitations by the order of magnitudes. To cover large parameter combinations in reasonable time, hardware based emulation is mandatory.

Efficient use of hardware resources can be obtained by switching from floating- to fixed-point arithmetic which induces limitations in terms of precision and dynamic range. To evaluate the influence of hardware related parameters, the reference algorithm has been examined concerning maximum achievable precision boundaries first. For this purpose, a SAR sensor data set (near field bike scan, provided by Ruhr-University-Bochum) using a 80 GHz (25 GHz effective bandwidth) frequency modulated continuous wave sensor [20] has been chosen.

*Q*

_{fp}= [

*Q*

_{i},

*Q*

_{f}] = [2, 14] which separates the total word length of 16-bit into a 2-bit integer and 14-bit fractional component.

Based on the *UEMU* framework presented in Sect. 4, the RDA has been mapped to the Xilinx ML605 Evaluation Board. Therefore, a new subsystem has been implemented which includes specific accelerators for the SAR image processing case. Besides the FFT, which has been implemented as a hybrid floating-point radix-2^{3} decimation in time [12], two additional modules realize the remaining processing steps of the RDA. First, a MF PE which can be configured for range compression, MoCom, and range adaptive azimuth compression. Second a finite impulse response filter based interpolation PE which is used for azimuth MoCom and RCMC. While the reference algorithm is running on the host-PC in MATLAB, an interactive switching between soft- and hardware domain is used for DSE and analysis tasks.

*e*

^{jϕ}= cos(ϕ) +

*j*sin(ϕ) which is used to convert the range-dependent phase component into a complex valued signal. The PE interfaces to the OCP bus via two separated read/write master ports while a single slave port is used to access the configuration register file (green). Additional multiplexer structures are used to switch signal components depending on the current use case i.e. azimuth compression MF, MoCom or range compression MF. The accumulator realizes the range dependence of the azimuth MF phase component.

*I*

_{max}= 14 which equals to the fractional part

*Q*

_{f}of the aforementioned fixed-point configuration

*Q*

_{fp}= [2, 14]. The corresponding CORDIC rotation angles equal to a sequence of

*a*tan(2

^{−i}) with

*i*= [0, 1, …,

*I*

_{max}− 1] which results in an increasing output precision of about one bit during each iteration. Increasing the number of iterations above

*I*

_{max}does not contribute to the output precision as the binary representation of the rotation angles equals to zeros. In Fig. 11, the influence of a reduced word length on the average PSNR is depicted. As a result, this plot shows that for 9-bit and above, an additional gain in focusing precision cannot be achieved.

As mentioned at the beginning, software simulation time exceeds practical time limits even for small design spaces. Based on the *UEMU* framework, the current hardware implementation requires 22 s in average per run for 16k × 4k = 64 MPixel, including all Ethernet transfers and data type conversions between the host-PC running MATLAB and the Xilinx ML605 Emulator at all stages (4× ETH WR/RD @ 2.5 s). This results in an average throughput of 163 runs per hour. In case that only the output result is needed, intermediate data exchange and conversion is avoided (1× ETH WR/RD @ 2.5 s) which decreases the overall processing time to 7 s. Therefore, the processing performance can be increased again by a factor of 3 which equals to 514 runs per hour. Compared to the aforementioned bit-true software model approach (128 MPixel/7 min = 0.3 MPixel/s) a speed-up of 30 is achieved. Besides the potential speed-up of an emulator based approach, development and verification time for the bit-true software model is no longer needed.

## 6 Case study: IA

High throughput wireless communication systems including LTE, ECMA-368 (WiMedia) and IEEE 802.11ac are built around sophisticated digital signal processing algorithms. Among the research goals for future communication standards are higher spectral efficiency, higher energy efficiencies towards the Shannon limit and increased data rates. Naturally, these benefits come at the price of higher computational complexity. The demand for flexible realtime hardware platforms capable of delivering the required huge number of operations per second at a severely limited power and silicon area budget has led to the development of specialized hardware platforms for software defined radio (SDR) applications.

As a case study using the *UEMU* framework, the implementation and characterization of IA algorithms is presented in this section. Implementation results for 3-user 2 × 3 MIMO antenna selection IA are presented in Sect. 6.2. In Sect. 6.5, a parameterized hardware complexity estimation of K-user iterative minimum mean square error (MMSE) IA is presented, identifying it as a demanding candidate for an optimized hardware implementation and DSE.

### 6.1 IA system model

*N*

_{t}and receive antennas

*N*

_{r}each has to be larger than the number of transmitted data streams

*d*for a solution to exist.

*i*is

*j*and \(\boldsymbol{n}_i\) being the noise picked up by receiver

*i*. \(\boldsymbol{H}_{ii}\) are the desired channels, all \(\boldsymbol{H}_{ij}\) convey interference. In the case of flat fading, IA can be applied to discrete-time signals.

*d*is the number of data streams per user.

### 6.2 3-User zero-forcing IA

*K*= 3 user case as [5]

*E*) denotes an arbitrary Eigenvector of

*E*, \(\mathrm{null(\cdot)}\) is the nullspace. Note that the choice of the Eigenvector has an impact on the overall system performance.

Our implementation supports antenna selection based IA. Here, the transmitters are equipped with three antennas, of which only two are to be actively used for data transmission. The IA algorithm picks the best antenna combination as explained below. This leads to an increased channel orthogonality for the chosen channels at a reduced number of RF front ends.

The antenna selection can be done at either the transmitter, receiver or both ends. Basically, this leads to a reduced number of required RF front ends and reduced power consumption compared to a system using all antennas. By leaving the worst modes unused, the resulting RF hardware complexity is reduced significantly at a moderate penalty in sum data rate.

*n*combinations can be formulated as

*k*for a given antenna combination

*i*. Equation 17 is solved by visiting all

*n*antenna combinations. The precoding and decoding matrices \({\varvec{V}}\) and \({\varvec{U}}\) are computed for every combination of antennas according to Eqs. 10, 11, 12, 13, 14, 15, and 16.

- (1)
channel rate processing (CRP) and

- (2)
symbol rate processing (SRP).

Based on a full set of given channel estimations, the CRP computes the best antenna combination and the corresponding precoding- and decoding-matrices \({\varvec{V}}\) and \({\varvec{U}}\) for all devices. On the transmitter side, the SRP distributes each transmit symbol to both antennas by multiplication with \({\varvec{V}}\). The two received antenna symbols are linearly combined into a single symbol by multiplication with \({\varvec{U}}^H\) on the receiver side.

OFDM is chosen as the modulation scheme, resulting in separate \({\varvec{V}}\) and \({\varvec{U}}\) for each subcarrier.

### 6.3 Closed-form IA computational complexity

The resource requirements of an optimized efficient integer implementation of the antenna-selection IA algorithm is presented in this section, based on FPGA implementation results. Target systems include SDR platforms, FPGAs and ASICs.

*C*is computed for both Eigenvectors. All intermediate matrices can be independently scaled by arbitrary scalars without affecting the antenna decision or \({\varvec{V}}\) and \({\varvec{U}}\). Exploiting this makes the cost of all involved 2 × 2 matrix inversions negligible and allows intermediate matrices to be block-normalized by shifting, i.e. extract a common power of 2 from all matrix elements. This results in reduced integer word lengths and thus reduced hardware costs. Table 1 summarizes the number of required real-valued mathematical base operations for antenna selection and the computation of \({\varvec{V}}\) and \({\varvec{U}}\) per antenna combination and subcarrier, without a final normalization step of \({\varvec{V}}\) and \({\varvec{U}}\). Complex multiplications are composed of three real multiplications, three additions and two subtractions, INVSQRT denotes the reciprocal square root [21].

Operation counts for the computation of *C* per antenna combination *i* and subcarrier

OP | ADD | MUL | SQRT | INVSQRT |
---|---|---|---|---|

Matrix mult. | 696 | 348 | 0 | 0 |

Eigenvectors | 15 | 8 | 3 | 0 |

Metric score | 46 | 82 | 6 | 2 |

#OPC | 757 | 438 | 9 | 2 |

To keep the total transmit power constant, the chosen antenna combination’s precoding matrices \({\varvec{V}}\) need to be normalized, resulting in three ADD, eight MUL and one INVSQRT additional operations #OPN per transmitter and subcarrier. The above analysis implies that in general, the implementation cost is dominated by the multiplications in terms of silicon area and power consumption. As an example, a typical 128 subcarrier OFDM system with two antennas chosen out of three antennas per transmitter and 1 ms IA processing latency requires ~861 million MUL/s for realtime operation.

*K*= 3 users case with

*N*

_{t,a}= 2 active transmit antennas used out of

*N*

_{t,p}physical antennas per transmitter and

*N*

_{r,a}= 2 receive antennas used out of

*N*

_{r,p}antennas per receiver, the number of combinations to be inspected per subcarrier is

*T*

_{0}. Assigning relative operation costs α

_{i}to each operation type OP

_{i}, the total computational cost for

*M*subcarriers becomes

### 6.4 Hardware cost estimation

*N*

_{t,p}= 3 antennas,

*M*= 128 subcarriers and \(T_0 = 1\, \mathrm{ms}, \) the total IA costs are estimated to be C=1.875 GOPS.

Relative silicon area costs of 16-bit arithmetic operations

OP | ADD | MUL | SQRT | INVSQRT |
---|---|---|---|---|

α | 0.108 | 1 | 1.73 | 3 |

For the configuration above, the original MATLAB algorithm takes 3.63 s on an Intel Xeon 2.4 GHz CPU running MATLAB R2012a for the computation of the optimal antenna combination \(\hat i\) and its corresponding precoding and decoding matrices \({\varvec{V}}_k\) and \({\varvec{U}}_k\) from a set of channel information *H*. The FPGA implementation created in this case study achieves realtime operation, requiring 380 µs at 100 MHz clock frequency on a Xilinx Virtex-6 LX550T FPGA in a BEE4 emulation system. Thus, the achieved speedup is 9,553.

By using the *UEMU* approach, the design effort of building a dedicated bit-true software simulation model has been saved. The required numerical precision could be determined by instrumentation of the highly optimized hardware modules. The resulting DSE hardware acceleration allows the coverage of a much larger design parameter space compared to software simulation. Furthermore, the significant speedup allows the repeated execution of a DSE over a large parameter space, which accelerates the cycle times of the iterative hardware design process.

### 6.5 Cost estimation for iterative K-user IA

For the general *K* > 3 user case, there are no known closed-form solutions, but iterative algorithms exist. In this section, we present implementation complexity estimates of the minimum mean squared error IA algorithm in [22].

_{k}≥ 0 is computed to satisfy \(\| {\varvec{V}}_k \|_2^2 \le 1\) by Newton iteration.

The high computational complexity of iterative IA algorithms makes their characterization by software simulation infeasible. For bit-true models, using hardware acceleration is mandatory to cover a sufficiently large design space.

## 7 Summary and conclusion

We have presented an FPGA-based framework for the design of highly optimized hardware modules. Leveraging two case studies from different fields of signal processing, the framework has proved to be suitable for the DSE of computational demanding algorithms and architectures. By building on infrastructure synthesizable for ASICs and FPGAs, the same highly optimized hardware modules used for ASIC synthesis are instrumented and used for the emulation based DSE on FPGAs. Therefore, building a bit-true software simulation model for the hardware characterization is not required. The infrastructure provided by the framework is optimized as an industry-grade SoC template for ASIC synthesis. Strict interface definitions within the framework are the key to portability and support of a range of emulation platforms with varying complexity. Combined with the simplified design partitioning approach, parallel development and verification of distinct system modules can be carried out on a number of different emulation platforms, resulting in an accelerated system integration. Seamless multi-target support for mid- to high-end emulation systems and a flexible simulation/emulation partitioning approach has proven to be valuable for concurrent development and reduced requirement for expensive hardware emulation systems.

The SAR implementation case study has revealed the precision requirements for SAR processing and the influence of the numerical precision in distinct processing steps on the picture quality. As the presented framework provides a powerful communication infrastructure, the design effort could be focused on the core signal processing algorithms. By simple instrumentation, the highly optimized hardware modules have been used for a DSE, saving the design effort of building a dedicated bit-true software simulation model. Furthermore, the higher emulation speed compared to pure software simulation significantly extended the coverable design space.

In the second case study, hardware costs for dedicated IA implementations have been studied. IA is a potential candidate to increase the throughput and spectral efficiency in wireless communication systems. The high computational effort of IA algorithms combined with the energy constraints of mobile applications makes the use of dedicated hardware accelerators and their rigid optimization mandatory. Again, the DSE simulation runtime would be prohibitive for pure software simulation. The required design effort for instrumenting the hardware modules for a DSE was significantly lower compared to setting up a bit-true software simulation model.

The case studies presented in this paper have shown that for complex signal processing tasks, a comprehensive DSE within reasonable time is possible by using such instrumented hardware modules.