Analog Integrated Circuits and Signal Processing

, Volume 78, Issue 3, pp 557–571

Hardware-accelerated design space exploration framework for communication systems

Case studies in synthetic aperture radar and interference alignment processing


    • Institute of Microelectronic SystemsLeibniz Universität Hannover
  • Sebastian Hesselbarth
    • Institute of Microelectronic SystemsLeibniz Universität Hannover
  • Martin Pfitzner
    • Institute of Microelectronic SystemsLeibniz Universität Hannover
  • Holger Blume
    • Institute of Microelectronic SystemsLeibniz Universität Hannover

DOI: 10.1007/s10470-013-0127-6

Cite this article as:
Kock, M., Hesselbarth, S., Pfitzner, M. et al. Analog Integr Circ Sig Process (2014) 78: 557. doi:10.1007/s10470-013-0127-6


The efficient hardware implementation of signal processing algorithms requires a rigid characterization of the interdependencies between system parameters and hardware costs. Pure software simulation of bit-true implementations of algorithms with high computational complexity is prohibitive because of the excessive runtime. Therefore, we present a field-programmable gate array (FPGA) based hybrid hardware-in-the-loop design space exploration (DSE) framework combining high-level tools (e.g. MATLAB, C++) with a System-on-Chip (SoC) template mapped on FPGA-based emulation systems. This combination significantly accelerates the design process and characterization of highly optimized hardware modules. Furthermore, the approach helps to quantify the interdependencies between system parameters and hardware costs. The achievable emulation speedup using bit-true hardware modules is a key enabling the optimization of complex signal processing systems using Monte Carlo approaches which are infeasible for pure software simulation due to the large required stimuli sets. The framework supports a divide-and-conquer approach through a flexible partitioning of complex algorithms across the system resources on different layers of abstraction. This facilitates to efficiently split the design process among different teams. The presented framework comprises a generic state of the art SoC infrastructure template, a transparent communication layer including MATLAB and hardware interfaces, module wrappers and DSE facilities. The hardware template is synthesizable for a variety of FPGA-based platforms. Implementation and DSE results for two case studies from the different application fields of synthetic aperture radar image processing and interference alignment in communication systems are presented.


Design space exploration (DSE)EmulationFixed-point arithmeticSynthetic aperture radar (SAR)Interference alignment (IA)

1 Introduction

Design space exploration (DSE) for signal processing hardware can be conducted using models on a variety of abstraction levels, usually resulting in a trade-off between implementation effort, model accuracy and simulation speed. Models on high abstraction levels are typically quicker to be implemented and verified and provide shorter execution times than their lower level counterparts, thus allowing the coverage of a larger design parameter space at an early design stage [10]. Assessing low level design parameters like bit width precision implications across all stages of a system towards the target hardware implementation of a DSE process requires bit-true models. This contribution focuses on the exploration of low level DSE parameters, with a focus on design and verification time and simulation speed of bit-true models.

Field-programmable gate array (FPGA) based rapid prototyping systems are widely used in algorithm research and development. Combined with automatic and semi-automatic high-level description to hardware description language (HDL) code generation tools, they facilitate quick hardware deployment invaluable for proof-of-concept studies. However, the hardware efficiency achievable by this design flow is often limited and insufficient for resource limited applications and the verification of application-specific integrated circuit (ASIC) designs. Techniques from FPGA-based ASIC verification and rapid prototyping are combined in this project for the bit-true DSE of highly optimized hardware architectures.

Design time is a limited resource, thus a high design efficiency is an important goal. In a typical implementation scenario for complex designs, high level reference models are used. These models consist of several modules to be integrated. The choice of modules to be optimized is often based on profiling results, with those modules contributing significantly to the overall resource requirements being chosen for optimization. This leads to a hybrid design consisting of a mixture of high level modules and highly optimized modules, running on hardware ranging from general purpose processors, application-specific instruction-set processors, FPGA-based rapid prototyping systems and dedicated hardware accelerators.

Signal processing reference algorithms are typically implemented using floating-point data types using high level software tools. However, fixed-point or non-standard floating-point data types are preferred for optimized dedicated hardware implementations due to their higher area and power efficiency. The bit widths for both data types can be chosen arbitrarily in dedicated hardware implementations, even separately for each processing element (PE). Migrating to fixed-point arithmetic opens a large parameter space to be optimized, which is infeasible for pure software simulation due to the large data sets required for numerical characterization. Common metrics for algorithms used in communication systems are the resulting overall system performance in terms of throughput, signal-to-noise ratio (SNR), error probability etc.

The main goals of the presented unified emulation framework (UEMU) are to provide:
  • a DSE environment for the characterization of bit-true models

  • a hybrid SW/HW co-simulation environment

  • synthesizable state of the art System-on-Chip (SoC) infrastructure for ASIC and FPGA targets

  • a platform-independent abstraction layer for design tools

Section 2 gives a brief overview of related work. The main contributions of this article are as follows. The DSE process using bit-true models for computationally intense signal processing systems is presented in Sect. 3 A state of the art SoC infrastructure template and development framework is presented in Sect. 4.

Two case studies from different areas of signal processing are presented in Sects. 5 and 6. The underlying applications and algorithms of both case studies are briefly sketched in their respective chapters.

2 Related work

FPGA-based emulation systems are commonly used for ASIC design verification, simulation acceleration and rapid prototyping. While the underlying hardware platforms exhibit similar characteristics, the deployed design flow greatly varies depending on the application.

A typical rapid prototyping design flow allows the integration of hardware accelerators, e.g. signal processing modules into a high-level MATLAB/Simulink system model for simulation acceleration. The hardware description is usually generated from high-level models. While facilitating a quick hardware deployment in algorithm research, proof-of-concept studies and first hardware demonstrators, the resulting design efficiency may not be suitable for production. DSEs at an early design stage often build on FPGA acceleration for a coverage of a larger design space with coarse accuracy. The automatically generated hardware or parts thereof serve as a starting point for an optimized implementation.

An FPGA-based design and verification framework is presented in [4] as a software and hardware co-emulation approach used in a case study designing a hardware soft-input soft-output multiple-input multiple-output (MIMO) sphere decoder. Efficient emulator utilization and an optimized simulation speedup during the design and verification phases are achieved by coupling a hardware emulation system with a parallel computing cluster. Test stimuli are generated by software and multiplexed to the design under test by the framework, classifying it as a simulation accelerator and hardware testbed. A similar testbed approach is presented in [9], with an implementation case study for an IEEE 802.11n MIMO baseband transceiver.

To our best knowledge, there are no publications on efficient hardware architectures for the digital baseband processing in interference alignment (IA) systems with similar goals as the case study presented in Sect. 6. In [13], a rapid-prototyping based realtime testbed involving FPGA acceleration is presented as a proof-of-concept demonstrator. Offline processing of captured data is used in the evaluation of IA under real-world conditions in [8]. Both publications use common off-the-shelf hardware with a focus on algorithm exploration, with little details and concepts on the implemented digital baseband signal processing hardware.

Depending on the development strategy, two different main design approaches can be distinguished. First, cycle-true or event triggered simulation and verification, second data flow based verification by comparing algorithm output at different stages. In terms of cycle-true simulation, different approaches can be found in literature to exploit FPGA-based emulation systems for accelerating the simulation speed.

For example, Moraes et al. [14] present a generic FPGA emulation framework which link a synthesized design under verification with the software simulator. While improving the simulation and verification speed on signal level, the framework does not target on successive implementation of signal processing subtasks in the context of an existing reference algorithm. A similar cycle-accurate approach is proposed by Del Valle et al. [7]. An FPGA-based emulation framework is used for characterization of multi-processor SoC architectures focusing on porting complex algorithms to embedded systems. In terms of a DSE, it is designed for evaluation and analysis of characteristic processor features, e.g. memory hierarchy and influence of different processing cores.

In contrast to cycle accurate approaches, FPGA-based emulation systems are used to accelerate the development, verification and execution of digital signal processing tasks inside a given reference software algorithm. Thereby, the main focus is not the extraction of single signals or performance counters but a comparison of the overall algorithmic output. A commercial tool flow for mapping complex algorithms to FPGAs is the BEEcube Platform Studio [2]. Distinct signal processing tasks can be implemented by using provided library elements or directly implemented in HDL. Those elements can be associated with MATLAB/Simulink components and allow for successive implementation of the reference algorithm. However, data exchange between MATLAB and the emulated hardware is limited.

The UEMU framework concept presented here extends existing hardware-in-the-loop approaches by means of supporting hardware engineers during concurrent implementation of algorithms at early stages. Therefore, the framework provides a flexible interface between MATLAB and the emulation system at arbitrary stages of the reference algorithm. This allows for hardware modules to be implemented and verified in the context of the whole target algorithm easily. Data transfer can be initiated transparently from MATLAB via Ethernet leaving the control flow at software level. Communication with software HDL simulation tools is fully integrated into the software and hardware interfaces provided by UEMU. This enables live communication between software, emulated hardware and HDL simulation in a closed loop, leading to a significant HDL simulation speedup for partitioned system configurations. Tools are provided for dynamic on-line design parameter sweeps by instrumentation of the optimized RTL hardware models.

3 Signal processing systems DSE

The process of designing complex digital electronic circuits offers a large variety of options to the designer. There are many valid possible implementations that fulfill the specification, but they differ in certain properties, e.g. silicon area, power efficiency, flexibility, testability and design effort. These parameters and properties span the so-called design space. They can not be chosen independently from each other. For example, flexibility and power efficiency are often contradictory requirements. At an early design stage, the designer has to choose important design parameters that will eventually determine the final implementation properties. Most of the parameters like the resulting power dissipation can not be directly chosen, but they are rather a result of many design choices: for example, hardwired architectures tend to be more power efficient than programmable architectures.

A DSE establishes relations between possible points in the design space, ultimately leading to cost functions [3] modeling the quantitative relation between design parameters and the resulting hardware properties and resource requirements. These models serve as a basis for important design decisions in an early design phase.

Certain parameters are of special interest in the domain of wireless communication platforms. The limited power budget in mobile devices puts hard constraints on the power efficiency, requiring power optimization across all layers of algorithm development, design implementation and semiconductor technology. This often conflicts with the demand for flexibility, which is another important requirement due to rapidly evolving communication standards. Flexibility and programmability is also required to keep pace with shorter product lifecycles to enable the re-use of the same hardware platform across multiple product generations. Also motivated by economical reasons, the total design time is of great importance.

The hardware resource requirements of wireless communication algorithms can be reduced by exploiting their numerical properties and finding appropriate approximations. Examples include word length limited fixed-point number representation, substituting bit-true elementary mathematical functions by numerical approximations, but also heuristics used in communication specific signal processing blocks like MIMO decoding and forward error correction (FEC). In contrast to highly precise computations available in high-level algorithm development environments, resource limited communication system implementations have to deal with limited accuracy in each successive block.

Typically, these approximations trade accuracy for timing improvement, area and power saving and can often be assessed in terms of resulting bit error rate, throughput and related quantities of the overall processing chain. However, to facilitate a systematic approach to a fine-grained optimization, an evaluation of the quantitative impact of an approximation on the overall system performance is required for a huge number of points in the design space. Though these relations may be coarsely estimated from high-level models, this approach can not cover all effects present in an implementation containing approximations. The highly nonlinear characteristics found in many signal processing blocks like MIMO decoding and FEC lead to hardly analytically predictable interdependencies between the approximation parameters and system level performance degradation. The strategy adopted in this paper is to enable the characterization of the above mentioned approximations over a wide parameter range by instrumentation of the optimized RTL code. For example, modules for the dynamic adaptation of word lengths at runtime can be inserted into key signals of interest. This simple instrumentation is carried out by hand on the RTL source code level. These modules can be globally configured to be transparent for the final production synthesis.

Deriving comprehensive cost models using Monte Carlo methods requires visiting a significantly larger number of points in the design space compared to existing heuristically driven parameter optimization approaches covered by existing FPGA-based simulation acceleration systems. The achievable simulation speedup is a key factor enabling the characterization and optimization of complex communication systems using Monte Carlo approaches which are infeasible for pure software simulation due to the large required stimuli sets. Long simulation times in the order of days or even weeks heavily hinder the design process, as designing VLSI circuits is an iterative process and simulations typically have to be re-run several times during the design phase. Thus, increasing simulation speed by a few orders of magnitude significantly increases the designer’s efficiency.

4 Hybrid emulation framework

4.1 Scope of the framework

In the course of this work an FPGA-based hybrid hardware-in-the-loop research and DSE framework was created. It allows the combination of high-level tools (e.g. MATLAB/Simulink and C/C++) with optimized hardware modules.

Target device and commercial tool independency is achieved through abstraction layers and common interfaces provided in software and hardware. This significantly simplifies the adaption of existing designs to new targets, as only the framework has to be extended to new platforms once, while all designs linking against the common interfaces do not require any modifications. Furthermore, the designer is relieved from creating and maintaining design- and target specific infrastructure. Thus, the suggested design guideline is to use the interfaces provided by UEMU whenever possible and avoid adding target specific modifications.

The proposed UEMU framework complements automated high level hardware generation tools and libraries like Xilinx CoreGen and MATLAB to HDL toolboxes by providing a portable industry-grade SoC infrastructure suitable for both debug and verification as well as ASIC deployment. Although these tools are not integral parts of the framework, integrating generated HDL code and netlists is common practice for third-party intellectual property cores and uncritical functions.

Starting from a high-level algorithm description, more and more sub-modules and algorithms can be seamlessly transferred to hardware and verified within full system context. Moreover, signal processing tasks may be dynamically moved between software and hardware processing modules during runtime. This allows for design, optimization and verification of efficient signal processing modules for computationally demanding algorithms, e.g. next-generation wireless communication systems or on-board synthetic aperture radar (SAR) processing as presented in Sects. 6 and 5, respectively. The translation of critical processing blocks into optimized RTL hardware descriptions is commonly carried out by hardware designers, as the resource efficiency of hand-written RTL code is still superior to its tool-generated counterpart. Uncritical blocks may be generated using high-level synthesis tools and imported into UEMU based designs (Fig. 1).
Fig. 1

Multi-dimensional design space (Source [15])

A typical generic cascaded data processing chain consisting of three blocks is shown in Fig. 2(a). The top row depicts the logical data flow through separate algorithm blocks A to C. Below, the corresponding data flow in a pure software simulation of the system is given. Figure 2(b) shows a possible data flow through the proposed hybrid DSE system. Signal processing blocks B and C have been implemented in hardware, whereas block A is not yet available. Thus, algorithm block A is executed in software, with the results being transferred to the hardware emulation system. Software block B processing is disabled and replaced by hardware block B. Both software and hardware blocks C run in parallel using the same input data generated by hardware block B. The constellation shown for block C is especially useful for verification purposes. Assuming that block B constitutes the bottleneck for pure software simulation, the shown hybrid constellation achieves a significant speedup even without block A being available in hardware. This approach enables more comprehensive DSEs in an earlier design phase. Partitioning the hardware design process across teams is simplified by blockwise verification in the full system context.
Fig. 2

UEMU hybrid emulation flow. a Software-only reference. b Partial hardware implementation and verification (HW/SW co-emulation/simulation)

The framework is based on a SoC centric approach that also makes it suitable for usage in ASIC technology design flow and thus enable test, debugging and characterization of signal processing modules in their target environment.

4.2 Framework description

The framework comprises a host PC, a software library providing a transparent communication application programming interface (API), an FPGA-based emulation system, fully synthesizable VHDL SoC infrastructure, dedicated accelerators, and processor soft-cores.

The host PC is used for executing high-level MATLAB or C/C++ algorithms and is connected to the emulation system via Gigabit Ethernet. Software API libraries provide unified transparent communication between MATLAB, C/C+, embedded software and the hardware running on the emulation system. The same API interfaces are also available for digital simulation via the ModelSim foreign language interface, effectively also providing a simulation and verification environment at minimal extra effort. The framework block diagram is shown in Fig. 3.
Fig. 3

UEMU hybrid emulation framework block diagram with exemplary algorithm mapping to SW tasks and HW modules

A subsystem template abstracts from emulation system specific infrastructure, making the user designed core signal processing modules independent of the underlying emulation system. During algorithm development, computationally intensive signal processing modules can be implemented in Verilog or VHDL and added to the subsystem. Remaining processing modules may continue to run as high-level models, enabling a divide-and-conquer implementation and verification approach. This allows signal processing modules to be split and run distributed on a highly heterogeneous signal processing system, enabling a fine-grained module-wise migration from high-level software reference to optimized production quality hardware. All resources are accessible by software and hardware, providing flexible partitioning and migration of processing tasks between high-level software, embedded software and dedicated hardware modules.

4.3 Emulation system abstraction approach

The VHDL SoC infrastructure comprises an open core protocol (OCP) multi-layer bus, an Ethernet DMA interface, SDRAM controllers, on-chip memories, standard RISC soft-core processors and a massively parallel parameterizable VLIW ASIP [17]. It has been adapted to and tested on each of the following three different emulation systems ranging from high-end to portable systems.

The high-performance BEEcube BEE4 [2] rapid prototyping system incorporates four Xilinx Virtex-6 LX550T FPGAs each featuring 4 GB DDR3-PC1066 SDRAM, Gigabit Ethernet, 20 Gbps QSFP+ interface, FMC-HPC expansion connector, and PCIExpress ×8 slot.

The mid-range Xilinx Virtex-6 LX240T ML605 Evaluation Kit [23] is equipped with 4 GB DDR3-PC800 SDRAM, Gigabit Ethernet, SFP interface, FMC-HPC and FMC-LPC expansion connectors, and PCIExpress ×4 slot.

The third supported emulation system is the MCPA board [1] that has been developed at the Institute of Microelectronic Systems (IMS) as a portable RISC/FPGA demonstrator (see Fig. 4). It comprises a Xilinx Virtex-5 LX220T FPGA with up to 1 GB DDR2 SDRAM, Gigabit Ethernet, 1.65 Gbps channel link interfaces, and an Intel IXP460 ARM XScale RISC processor. The IXP460 runs a Linux 2.6 kernel with the standard software development toolchains available. It is fully integrated in the environment and can be used for control and signal processing tasks.
Fig. 4

FPGA-based emulation system developed at IMS

For each emulation system an abstraction layer has been developed that takes care of board-specific interfaces, e.g. RAM and Ethernet. The abstraction layer also implements hardware interfaces for the software communication API and a OCP multi-layer bus between RAM controllers, on-chip memory, and board-independent user design. This abstraction allows for using the same user design in each of the supported emulation systems without changing the design at all.

Currently, also the Xilinx Virtex-7 VC707 and Digilent ZedBoard with Xilinx Zynq 7Z020 programmable SoC are evaluated and will be included in the framework.

5 Case study: SAR

In the field of remote sensing and security applications, a variety of different active and passive sensor systems can be found. SAR is an active radar technology which is used to generate high resolution images by repetitive transmission and reception of electromagnetic pulses. The principle of the image formation process is based on the relative movement between sensor and an observed area (Doppler effect). Typically, SAR is used for airborne platforms but can also be found in a variety of stationary security scanners [11, 20]. Compared to electro-optical and infra-red sensors, complex digital signal processing algorithms need to be applied to the SAR raw data to obtain the final image. In addition, state-of-the-art SAR systems provide very high data rates. For these reasons, efficient hardware architectures with high processing performance are mandatory.

5.1 System signal model

To achieve high bandwidths and moderate peak transmit power typical SAR systems use linear frequency modulated (FM) waveforms, given by
$$ s_r(t_r) = w_r(t_r)\rm{cos}\{2\pi f_c t_r + \pi K_r t_r^2\}, \quad -T_r/2 \le t_r \le T_r/2 $$
where wr(tr) defines the range amplitude, fc is the carrier radio frequency (RF) and Kr = Bbb/Tr is the range sweep rate for a given pulse duration Tr and baseband bandwidth Bbb. Because of the propagation speed c of an electromagnetic wave, the range time variable tr is often referred to as fast-time.
As the sensor advances in azimuth direction, depicted in Fig. 5, subsequent pulses are transmitted every 1/fprf seconds whereby fprf determines the pulse repetition frequency (PRF). Each reflector, which is covered by the beam footprint, is illuminated several times by the advancing antenna at different angles of incidence. The relative movement between reflector and antenna contributes to the Doppler modulation. The two-way distance Rsl(ta) between an arbitrary azimuth position x = vta and a reflector can be expressed as a function of azimuth time. For typical application scenarios (effective azimuth antenna opening angle θa ≪ 20°), a Taylor series expansion is commonly used as an approximation for the square root dependency as follows in Eq. 2.
$$ R_{sl}(t_a) = \sqrt{R_{0,sl}^2 + (v t_a)^2} \approx R_{0,sl} + \frac{v^2 t_a^2}{2 R_{0,sl}} $$
Fig. 5

SAR system model geometry (side-looking-radar)

When approaching the reflector the Doppler frequency is positive, whereas receding results in a negative frequency. By using the approximate definition of the range variation Rsl(ta) in Eq. 2 the Doppler frequency modulation of the azimuth signal component can now be expressed as a function of azimuth time ta.
$$ f_a(t_a) = f_{dc} + K_a t_a = \frac{2 v \cdot \rm{sin}(\theta_s)}{\lambda} - \frac{2 v^2}{\lambda R_{0,sl}}\cdot t_a, \quad -\frac{L}{2v} \le t_a \le \frac{L}{2v} $$
From Eq. 3, it can be seen that a positive instantaneous squint angle θs induces a positive shift of the Doppler spectrum. For the zero-squint case, the Doppler frequency fa(ta) is centered to fdc = 0 which is typical for stationary applications. The azimuth sweep rate Ka defines the variation of Doppler frequency as a function of azimuth time ta which is often referred to as slow-time because of the sensor propagation speed vc.
$$ \begin{aligned} h_{imp}(t_a,t_r) =\; & w_r \left( t_r - 2R_{sl}(t_a)/c \right) w_a \left( t_a - t_{a,dc} \right)\\ & \times e^{-j4\pi R_{sl}(t_a) / \lambda}\\ & \times e^{j \pi K_r \left( t_r - 2R_{sl}(t_a)/c \right)^2} \end{aligned} $$

By combining the linear range FM pulse from Eq. 1 with the Doppler modulation in Eq. 3, the two-dimensional SAR impulse response himp(tatr) of a single point reflector is given by Eq. 4. In detail, wr and wa are the time delayed amplitudes of the received signal. The first exponential term defines the Doppler modulation that results from the range distance variation Rsl(ta), the second term represents the delayed chirp from Eq. 1. It has to be noted that this expression assumes that the RF component of the transmitted pulse (sRF = cos(2π fctr) has been removed before analog to digital conversion (ADC) as part of a quadrature demodulation process.

Based on the impulse response himp of a single point reflector, the signal model of a ground surface with an arbitrary reflectivity distribution can be expressed as a two-dimensional convolution of the ground reflectivity g(tatr) and himp(tatr). An additional noise component n(tatr) is inserted as it is present in all practical systems.
$$ s_{bb}(t_a,t_r) = g(t_a,t_r) * h_{imp}(t_a,t_r) + n(t_a,t_r) $$

The purpose of a SAR processor is to recover the ground reflectivity g(tatr) from the measured baseband signal sbb. This deconvolution process is challenging because the impulse response himp is both range and azimuth dependent and has a range-varying migration of signal energy as a result of the range distance variation Rsl(ta). This migration effect is often referred to as range cell migration (RCM) and has to be corrected during the image formation process.

5.2 SAR image reconstruction using the range-doppler (RDA) algorithm

To solve the deconvolution problem in Eq. 5, a variety of different SAR focusing algorithms have been developed in the past which can roughly be separated into time and frequency domain based algorithms. By applying the time domain matched filter (MF) in the frequency domain, the time domain discrete convolution can be implemented as a complex multiplication in the frequency domain [16].
$$ u(t) = s(t) * h(t) = {\mathcal{F}}^{-1}\{ {\mathcal{F}}\{s(t)\} \cdot {\mathcal{F}}\{h(t)\}\} = {\mathcal{F}}^{-1}\{S(f) \cdot H(f)\} $$
The operator \(\mathcal{F}\) defines a discrete Fourier transform and \(\mathcal{F}^{-1}\) the inverse operation. Especially for large transform lengths, which are typical in the SAR context, the use of fast Fourier transform (FFT) algorithms offers a runtime efficient implementation of the fast convolution task. Algorithms which take advantage from the frequency domain processing are the well known RDA, chirp-scaling and wavenumber-domain algorithm. For detailed description and comparison in terms of runtime efficiency and achievable focusing precision see [6]. A typical signal flow for the RDA is depicted in Fig. 6 which involves two-dimensional matched filtering and interpolation to solve the deconvolution problem in Eq. 5. In case of an airborne sensor platform, deviations from the ideal linear flight path have to be measured and corrected in an additional motion compensation (MoCom) stage to achieve high spatial resolutions.
Fig. 6

RDA SAR processing signal flow

After demodulation and A/D conversion, sensor raw data is compressed in range by means of a precalculated replica of the transmitted chirp pulse. After an inverse Fourier transform, subsequent processing is performed in the so called RDA domain i.e. in the azimuth frequency, range time domain. Before the range dependent azimuth compression can be applied, the range varying cell migration has to be corrected by means of an interpolation. This step, commonly referred to as range cell migration correction (RCMC) straightens out the curved reflector’s trajectories so that they run parallel in the azimuth frequency axis. The final inverse azimuth Fourier transform completes the image formation process.

5.3 FPGA-based RDA hardware implementation

As mentioned before, SAR image generation has to cope first with the inherent algorithm complexity, second with increasing sensor data rates of state of the art SAR sensor systems. Especially compact SAR systems demand for an efficient hardware architecture which offers high throughput rates at moderate power consumption. Besides these system limitations, flexibility in terms of algorithm mapping as well as processing precision demands have to be taken into account.

A key challenge during the hardware development process is the choice of an appropriate trade-off between precision demands and hardware resource allocation. As this task requires a multi-dimensional parameter decision, the design space covers a large set of possible data path configurations and therefore practical limitations in terms of simulation time. Previous work has shown that for an exemplary image dimension of 16k × 8k = 128 MPixel, the overall software simulation time for a bit-true data path model ranges from ~80 min (single core) to ~7 min (16 cores) [18]. In terms of required simulation time per output image, this bit-true model would result in about 8 runs per hour. Considering even a small design space with three different parameters with 16 possible values each, a set of 163 = 4,096 permutations would result in 512 h simulation time which would exceed practical limitations by the order of magnitudes. To cover large parameter combinations in reasonable time, hardware based emulation is mandatory.

Efficient use of hardware resources can be obtained by switching from floating- to fixed-point arithmetic which induces limitations in terms of precision and dynamic range. To evaluate the influence of hardware related parameters, the reference algorithm has been examined concerning maximum achievable precision boundaries first. For this purpose, a SAR sensor data set (near field bike scan, provided by Ruhr-University-Bochum) using a 80 GHz (25 GHz effective bandwidth) frequency modulated continuous wave sensor [20] has been chosen.

The first signal processing step which influences the overall processing result is the ADC which is limited to 16-bit for the present data set. Successive reduction of the ADC precision from 16- to 2-bit in Fig. 7 shows the effect in terms of peak signal to noise ratio (PSNR) compared to the non-quantized reference input. The minimum block PSNR (dotted line) can be interpreted as the worst case 16 pixel diameter area of the final image while the average image PSNR (solid line) is used as a metric for the overall achieved image quality. For further analysis, a typical ADC precision of 14-bit has been chosen which results in an upper PSNR boundary of 65 dB for subsequent processing. In addition to the ADC output precision, the influence of intermediate quantization has been evaluated in floating-point varying from 24- to 2-bit. The result, depicted in Fig. 8, shows that the PSNR saturates for data quantizations above 65 dB which is the aforementioned fixed ADC output precision. An interesting effect can be observed for the worst case minimum PSNR which decreases significantly compared to the average PSNR.
Fig. 7

ADC quantization (MATLAB floating-point)
Fig. 8

Processing quantization (MATLAB floating-point)

Besides the technical evaluation based on the image PSNR, a qualitative evaluation has always to be taken into account when comparing SAR images. Therefore, Fig. 9 depicts a set of quantization levels marked as red dotted lines in Fig. 8. For image PSNR values in the order of 30 dB and below Fig. 9(d) illustrates the severe defocusing effect. To avoid these distortion in the final image, an appropriate word length of 16-bit (b) is chosen for the hardware data path configuration which is close to the reference image (a). Further results refer to an internal fixed-point representation Qfp = [QiQf] = [2, 14] which separates the total word length of 16-bit into a 2-bit integer and 14-bit fractional component.
Fig. 9

Focusing result for varying processing quantization (MATLAB floating-point)

Based on the UEMU framework presented in Sect. 4, the RDA has been mapped to the Xilinx ML605 Evaluation Board. Therefore, a new subsystem has been implemented which includes specific accelerators for the SAR image processing case. Besides the FFT, which has been implemented as a hybrid floating-point radix-23 decimation in time [12], two additional modules realize the remaining processing steps of the RDA. First, a MF PE which can be configured for range compression, MoCom, and range adaptive azimuth compression. Second a finite impulse response filter based interpolation PE which is used for azimuth MoCom and RCMC. While the reference algorithm is running on the host-PC in MATLAB, an interactive switching between soft- and hardware domain is used for DSE and analysis tasks.

All PEs can be accessed and configured via a set of user registers for two reasons. First, to enable for flexible adaption of SAR processing parameters depending on the current sensor configuration. Second, to switch between characteristic data path configurations without re-synthesis. As the azimuth MF provides major contribution to the final focusing quality, this PE will be discussed in detail as a proof of concept for the proposed emulation based DSE approach. The MF architecture, depicted in Fig. 10, involves an iterative, pipelined CORDIC algorithm as described in [19] to solve the Euler’s equation ejϕ = cos(ϕ) + jsin(ϕ) which is used to convert the range-dependent phase component into a complex valued signal. The PE interfaces to the OCP bus via two separated read/write master ports while a single slave port is used to access the configuration register file (green). Additional multiplexer structures are used to switch signal components depending on the current use case i.e. azimuth compression MF, MoCom or range compression MF. The accumulator realizes the range dependence of the azimuth MF phase component.
Fig. 10

Azimuth compression MF hardware architecture

At first instance, the influence of the data path configuration is taken into account whereby the input, output and the internal CORDIC look-up table (LUT) precision can be chosen by the user. In this configuration, the maximum number of iterations is set to Imax = 14 which equals to the fractional part Qf of the aforementioned fixed-point configuration Qfp = [2, 14]. The corresponding CORDIC rotation angles equal to a sequence of atan(2i) with i = [0, 1, …,Imax − 1] which results in an increasing output precision of about one bit during each iteration. Increasing the number of iterations above Imax does not contribute to the output precision as the binary representation of the rotation angles equals to zeros. In Fig. 11, the influence of a reduced word length on the average PSNR is depicted. As a result, this plot shows that for 9-bit and above, an additional gain in focusing precision cannot be achieved.
Fig. 11

CORDIC word length configuration (ML605 hardware emulation)

Considering the involved hardware elements, these results have a significant impact on the target hardware resource allocation. The CORDIC input precision directly influences the amount of RAM required to store the reference phase. On the other hand, the PSNR results for the CORDIC LUT and output precision can be used to reduce the number of registers (input data FIFO and output registers for each CORDIC element CE). The next parameter, which significantly influences the overall hardware resources as well as the delay of the PE is the number of involved CEs. In Fig. 12, two parameters have been combined. First, the number of effective CORDIC stages, second the output precision of the entire CORDIC.
Fig. 12

CORDIC stages versus output precision (ML605 hardware emulation)

As mentioned at the beginning, software simulation time exceeds practical time limits even for small design spaces. Based on the UEMU framework, the current hardware implementation requires 22 s in average per run for 16k × 4k = 64 MPixel, including all Ethernet transfers and data type conversions between the host-PC running MATLAB and the Xilinx ML605 Emulator at all stages (4× ETH WR/RD @ 2.5 s). This results in an average throughput of 163 runs per hour. In case that only the output result is needed, intermediate data exchange and conversion is avoided (1× ETH WR/RD @ 2.5 s) which decreases the overall processing time to 7 s. Therefore, the processing performance can be increased again by a factor of 3 which equals to 514 runs per hour. Compared to the aforementioned bit-true software model approach (128 MPixel/7 min = 0.3 MPixel/s) a speed-up of 30 is achieved. Besides the potential speed-up of an emulator based approach, development and verification time for the bit-true software model is no longer needed.

6 Case study: IA

High throughput wireless communication systems including LTE, ECMA-368 (WiMedia) and IEEE 802.11ac are built around sophisticated digital signal processing algorithms. Among the research goals for future communication standards are higher spectral efficiency, higher energy efficiencies towards the Shannon limit and increased data rates. Naturally, these benefits come at the price of higher computational complexity. The demand for flexible realtime hardware platforms capable of delivering the required huge number of operations per second at a severely limited power and silicon area budget has led to the development of specialized hardware platforms for software defined radio (SDR) applications.

As a case study using the UEMU framework, the implementation and characterization of IA algorithms is presented in this section. Implementation results for 3-user 2 × 3 MIMO antenna selection IA are presented in Sect. 6.2. In Sect. 6.5, a parameterized hardware complexity estimation of K-user iterative minimum mean square error (MMSE) IA is presented, identifying it as a demanding candidate for an optimized hardware implementation and DSE.

6.1 IA system model

IA is a transmission technique applicable to the K-user interference channel that can be used to increase the sum capacity of a multiuser communication system. At a high SNR, the sum capacity scales linearly with the number of users. A prominent manifestation is the combination of IA and multiuser MIMO orthogonal frequency-division multiplexing (OFDM) systems as shown in Fig. 13. Here, the transmitters apply linear precoding matrices \(\boldsymbol{V}\) to their transmitted signals in order to align all interference at the receivers to a lower-dimensional subspace of the available receive space. The receivers extract the interference free subspace by multiplication with the decoding matrix \(\boldsymbol{U}\). In general, the number of transmit antennas Nt and receive antennas Nr each has to be larger than the number of transmitted data streams d for a solution to exist.
Fig. 13

Multi-user 2 × 3 MIMO system with transmitter-side antenna selection

The received signal \(\boldsymbol{y}_i\) at each receiver i is
$$ \boldsymbol{y}_i = \boldsymbol{H}_{ii} \boldsymbol{V}_i \boldsymbol{s}_i + \sum\limits_{j \ne i} \boldsymbol{H}_{ij} \boldsymbol{V}_j \boldsymbol{s}_j + \boldsymbol{n}_i $$
with \(\boldsymbol{s}_j\) being the transmit data at transmitter j and \(\boldsymbol{n}_i\) being the noise picked up by receiver i. \(\boldsymbol{H}_{ii}\) are the desired channels, all \(\boldsymbol{H}_{ij}\) convey interference. In the case of flat fading, IA can be applied to discrete-time signals.
For a sufficient number of antennas, perfect IA can be achieved for \(\boldsymbol{V}\) and \(\boldsymbol{U}\) satisfying:
$$ \boldsymbol{U}_i^H \boldsymbol{H}_{ij} \boldsymbol{V}_j = 0 \quad \forall i \neq j $$
$$ rank(\boldsymbol{U}_i^H \boldsymbol{H}_{ii} \boldsymbol{V}_i) \ge d\quad \forall i $$
where d is the number of data streams per user.

6.2 3-User zero-forcing IA

Closed-form solutions for the zero-forcing case in Eq. 8 are known for the K = 3 user case as [5]
$$ \boldsymbol{E} = (\boldsymbol{H}_{31})^{-1} \boldsymbol{H}_{32} (\boldsymbol{H}_{12})^{-1} \boldsymbol{H}_{13} (\boldsymbol{H}_{23})^{-1} \boldsymbol{H}_{21} $$
$$ \boldsymbol{V}_1 = \nu (E) $$
$$ \boldsymbol{V}_2 = (\boldsymbol{H}_{32})^{-1} \boldsymbol{H}_{31} \boldsymbol{V}_1 $$
$$ \boldsymbol{V}_3 = (\boldsymbol{H}_{23})^{-1} \boldsymbol{H}_{21} \boldsymbol{V}_1 $$
$$ \boldsymbol{U}_1 = \mathrm{null}\left((\boldsymbol{H}_{13} \boldsymbol{V}_3)^H\right) $$
$$ \boldsymbol{U}_2 = \mathrm{null}\left((\boldsymbol{H}_{21} \boldsymbol{V}_1)^H\right) $$
$$ \boldsymbol{U}_3 = \mathrm{null}\left((\boldsymbol{H}_{31} \boldsymbol{V}_1)^H\right) $$
where ν (E) denotes an arbitrary Eigenvector of E, \(\mathrm{null(\cdot)}\) is the nullspace. Note that the choice of the Eigenvector has an impact on the overall system performance.

Our implementation supports antenna selection based IA. Here, the transmitters are equipped with three antennas, of which only two are to be actively used for data transmission. The IA algorithm picks the best antenna combination as explained below. This leads to an increased channel orthogonality for the chosen channels at a reduced number of RF front ends.

The antenna selection can be done at either the transmitter, receiver or both ends. Basically, this leads to a reduced number of required RF front ends and reduced power consumption compared to a system using all antennas. By leaving the worst modes unused, the resulting RF hardware complexity is reduced significantly at a moderate penalty in sum data rate.

The problem of finding the optimum antenna combination \(\hat i\) from a set of n combinations can be formulated as
$$ \hat i = \arg\max\limits_{i=1\dots n} \sum\limits_{k=1}^{K} C(\boldsymbol{V}_{k,i}, \boldsymbol{U}_{k,i}) $$
with \(C({\varvec{V}}, \boldsymbol{U})\) being a metric rating a set of matrices \({\varvec{U}}\) and \({\varvec{V}}\). \({\varvec{V}}_{k,i}\) and \({\varvec{U}}_{k,i}\) are the precoding and decoding matrices of user k for a given antenna combination i. Equation 17 is solved by visiting all n antenna combinations. The precoding and decoding matrices \({\varvec{V}}\) and \({\varvec{U}}\) are computed for every combination of antennas according to Eqs. 10, 11, 12, 13, 14, 15, and 16.
From an implementation point of view, the IA algorithm constitutes the two-step process of
  1. (1)

    channel rate processing (CRP) and

  2. (2)

    symbol rate processing (SRP).


Based on a full set of given channel estimations, the CRP computes the best antenna combination and the corresponding precoding- and decoding-matrices \({\varvec{V}}\) and \({\varvec{U}}\) for all devices. On the transmitter side, the SRP distributes each transmit symbol to both antennas by multiplication with \({\varvec{V}}\). The two received antenna symbols are linearly combined into a single symbol by multiplication with \({\varvec{U}}^H\) on the receiver side.

OFDM is chosen as the modulation scheme, resulting in separate \({\varvec{V}}\) and \({\varvec{U}}\) for each subcarrier.

6.3 Closed-form IA computational complexity

The resource requirements of an optimized efficient integer implementation of the antenna-selection IA algorithm is presented in this section, based on FPGA implementation results. Target systems include SDR platforms, FPGAs and ASICs.

As the SRP is straight forward, this section focuses on the costs of the 3-user 2 × 2 MIMO processing consisting of matrix inversions, matrix multiplications, Eigenvector computation and normalization, see Eqs. 10, 11, 12, 13, 14, 15, and 16. The metric C is computed for both Eigenvectors. All intermediate matrices can be independently scaled by arbitrary scalars without affecting the antenna decision or \({\varvec{V}}\) and \({\varvec{U}}\). Exploiting this makes the cost of all involved 2 × 2 matrix inversions negligible and allows intermediate matrices to be block-normalized by shifting, i.e. extract a common power of 2 from all matrix elements. This results in reduced integer word lengths and thus reduced hardware costs. Table 1 summarizes the number of required real-valued mathematical base operations for antenna selection and the computation of \({\varvec{V}}\) and \({\varvec{U}}\) per antenna combination and subcarrier, without a final normalization step of \({\varvec{V}}\) and \({\varvec{U}}\). Complex multiplications are composed of three real multiplications, three additions and two subtractions, INVSQRT denotes the reciprocal square root [21].
Table 1

Operation counts for the computation of C per antenna combination i and subcarrier






Matrix mult.










Metric score










To keep the total transmit power constant, the chosen antenna combination’s precoding matrices \({\varvec{V}}\) need to be normalized, resulting in three ADD, eight MUL and one INVSQRT additional operations #OPN per transmitter and subcarrier. The above analysis implies that in general, the implementation cost is dominated by the multiplications in terms of silicon area and power consumption. As an example, a typical 128 subcarrier OFDM system with two antennas chosen out of three antennas per transmitter and 1 ms IA processing latency requires ~861 million MUL/s for realtime operation.

For the K = 3 users case with Nt,a = 2 active transmit antennas used out of Nt,p physical antennas per transmitter and Nr,a = 2 receive antennas used out of Nr,p antennas per receiver, the number of combinations to be inspected per subcarrier is
$$ n = \left( {\left( \begin{array}{l} N_{t,p} \\ N_{t,a} \end{array} \right) \cdot \left( \begin{array}{l} N_{r,p}\\ N_{r,a} \end{array} \right)}\right)^K = 27 $$
For realtime operation, the maximum allowable latency is defined to be T0. Assigning relative operation costs αi to each operation type OPi, the total computational cost for M subcarriers becomes
$$ C = \frac{M}{T_0} \cdot \left( n \cdot \sum\limits_{i \in \mathrm{OP}}{\alpha_i \cdot \#\mathrm{OPC}_{\rm i}} + K \cdot \sum\limits_{i \in \mathrm{OP}}{\alpha_i \cdot \#\mathrm{OPN}_{\rm i}} \right) $$

6.4 Hardware cost estimation

Using α as relative silicon area costs, the total silicon area implementation cost of an architecture without resource sharing can be estimated from Eqs. 18 and 19. The relative area α of 16-bit arithmetic operations for an ASIC implementation based on [19] results in the values given in Table 2. The relative costs \(\alpha_\mathrm{MUL}\) of a multiplier are defined to be 1. For a system using antenna selection at the transmitter only with Nt,p = 3 antennas, M = 128 subcarriers and \(T_0 = 1\, \mathrm{ms}, \) the total IA costs are estimated to be C=1.875 GOPS.
Table 2

Relative silicon area costs of 16-bit arithmetic operations











For the configuration above, the original MATLAB algorithm takes 3.63 s on an Intel Xeon 2.4 GHz CPU running MATLAB R2012a for the computation of the optimal antenna combination \(\hat i\) and its corresponding precoding and decoding matrices \({\varvec{V}}_k\) and \({\varvec{U}}_k\) from a set of channel information H. The FPGA implementation created in this case study achieves realtime operation, requiring 380 µs at 100 MHz clock frequency on a Xilinx Virtex-6 LX550T FPGA in a BEE4 emulation system. Thus, the achieved speedup is 9,553.

By using the UEMU approach, the design effort of building a dedicated bit-true software simulation model has been saved. The required numerical precision could be determined by instrumentation of the highly optimized hardware modules. The resulting DSE hardware acceleration allows the coverage of a much larger design parameter space compared to software simulation. Furthermore, the significant speedup allows the repeated execution of a DSE over a large parameter space, which accelerates the cycle times of the iterative hardware design process.

6.5 Cost estimation for iterative K-user IA

For the general K > 3 user case, there are no known closed-form solutions, but iterative algorithms exist. In this section, we present implementation complexity estimates of the minimum mean squared error IA algorithm in [22].

The MMSE-IA algorithm starts with arbitrary precoding matrices \({\varvec{V}}_k\), then iteratively updates the decoding and precoding matrices \({\varvec{U}}_k\) and \({\varvec{V}}_k\) according to Eqs. 20 and 21 until convergence. The Lagrange multiplier λk ≥ 0 is computed to satisfy \(\| {\varvec{V}}_k \|_2^2 \le 1\) by Newton iteration.
$$ \boldsymbol{U}_k = \left(\sum\limits_{j=1}^{K}{\boldsymbol{H}_{kj} \boldsymbol{V}_j \boldsymbol{V}_j^H \boldsymbol{H}_{kj}^H + \sigma^2 \boldsymbol{I}} \right)^{-1} \boldsymbol{H}_{kk} \boldsymbol{V}_k $$
$$ \boldsymbol{V}_k = \left(\sum\limits_{j=1}^{K}{\boldsymbol{H}_{jk}^H \boldsymbol{U}_j \boldsymbol{U}_j^H \boldsymbol{H}_{jk} + \lambda_k \boldsymbol{I}} \right)^{-1} \boldsymbol{H}_{kk}^H \boldsymbol{U}_k $$
The number of required iterations is data-dependent. Each iteration step requires the following operations to be executed: matrix multiplication, pseudo-inverse, Newton iterations. Figure 14 summarizes the estimated number of operations for the computation of a set of \({\varvec{V}}\) and \({\varvec{U}}\) matrices, based on well-known optimized hardware implementations. Comparing the computational complexity of the iterative approach to the the closed-form 2 × 2 IA implementation presented in Sect. 6.2, the number of operations is increased by a factor of ~60.8.
Fig. 14

Number of operations for iterative MMSE IA (four iterations)

The high computational complexity of iterative IA algorithms makes their characterization by software simulation infeasible. For bit-true models, using hardware acceleration is mandatory to cover a sufficiently large design space.

7 Summary and conclusion

We have presented an FPGA-based framework for the design of highly optimized hardware modules. Leveraging two case studies from different fields of signal processing, the framework has proved to be suitable for the DSE of computational demanding algorithms and architectures. By building on infrastructure synthesizable for ASICs and FPGAs, the same highly optimized hardware modules used for ASIC synthesis are instrumented and used for the emulation based DSE on FPGAs. Therefore, building a bit-true software simulation model for the hardware characterization is not required. The infrastructure provided by the framework is optimized as an industry-grade SoC template for ASIC synthesis. Strict interface definitions within the framework are the key to portability and support of a range of emulation platforms with varying complexity. Combined with the simplified design partitioning approach, parallel development and verification of distinct system modules can be carried out on a number of different emulation platforms, resulting in an accelerated system integration. Seamless multi-target support for mid- to high-end emulation systems and a flexible simulation/emulation partitioning approach has proven to be valuable for concurrent development and reduced requirement for expensive hardware emulation systems.

The SAR implementation case study has revealed the precision requirements for SAR processing and the influence of the numerical precision in distinct processing steps on the picture quality. As the presented framework provides a powerful communication infrastructure, the design effort could be focused on the core signal processing algorithms. By simple instrumentation, the highly optimized hardware modules have been used for a DSE, saving the design effort of building a dedicated bit-true software simulation model. Furthermore, the higher emulation speed compared to pure software simulation significantly extended the coverable design space.

In the second case study, hardware costs for dedicated IA implementations have been studied. IA is a potential candidate to increase the throughput and spectral efficiency in wireless communication systems. The high computational effort of IA algorithms combined with the energy constraints of mobile applications makes the use of dedicated hardware accelerators and their rigid optimization mandatory. Again, the DSE simulation runtime would be prohibitive for pure software simulation. The required design effort for instrumenting the hardware modules for a DSE was significantly lower compared to setting up a bit-true software simulation model.

The case studies presented in this paper have shown that for complex signal processing tasks, a comprehensive DSE within reasonable time is possible by using such instrumented hardware modules.

Copyright information

© Springer Science+Business Media New York 2013