# Local Interpolation-based Polar Format SAR: Algorithm, Hardware Implementation and Design Automation

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s11265-012-0720-4

- Cite this article as:
- Zhu, Q., Berger, C.R., Turner, E.L. et al. J Sign Process Syst (2013) 71: 297. doi:10.1007/s11265-012-0720-4

- 2 Citations
- 493 Views

## Abstract

In this paper we present a local interpolation-based variant of the well-known polar format algorithm used for synthetic aperture radar (SAR) image formation. We develop the algorithm to match the capabilities of the application-specific logic-in-memory processing paradigm, which off-loads lightweight computation directly into the SRAM and DRAM. Our proposed algorithm performs filtering, an image perspective transformation, and a local 2D interpolation, and supports partial and low-resolution reconstruction. We implement our customized SAR grid interpolation logic-in-memory hardware in advanced 14 nm silicon technology. Our high-level design tools allow to instantiate various optimized design choices to fit image processing and hardware needs of application designers. Our simulation results show that the logic-in-memory approach has the potential to enable substantial improvements in energy efficiency without sacrificing image quality.

### Keywords

Synthetic aperture radarLogic in memoryChip generator## 1 Introduction

This idea stems from recent studies of sub-20 nm CMOS design, which indicate that memory and logic circuits can be implemented together using a small set of well-characterized pattern constructs [5, 6]. Our early silicon experiments in a commercial 14 nm SOI CMOS process demonstrate that this construct-based design enables logic and memory bitcells to be placed in a much closer proximity to each other without yield or hotspots pattern concerns. While such patterning appears to be more restrictive to accommodate the physical realities of 14 nm CMOS, the ability to make the patterns the only required hard IP allows us to efficiently and affordably customize the SRAM blocks. More importantly, it enables the synthesis (not just compilation) of customized memory blocks with user control of flexible SRAM architectures and therefore facilitate *smart memory compilation*.

Advances in this chip design methodology gives rise to the application-specific LiM computational paradigm, which moves part of a program’s computation directly into the memory but keeps the usual memory interface. It is easy to program, as all computational operations are hidden behind the memory abstraction. LiM builds on the idea of earlier processing in memory [7], however, puts only simple logic instead of actual processing cores right into the memory structures. Moreover, it requires application-specific logic to reach the desired energy savings. Thus, it is more specialized than the processor-in-memory idea [7, 8]. On the architectural level, the logic-enhanced memories look like normal memories to the CPU, but perform extra (and cheap) operations on the stored data before returning the requested data item to the CPU.

Design automation is required for handling the increased complexity of memory-logic-mixing hardware accelerators and the intricacies of cutting edge and next-generation silicon technology. Physical implementation of our logic and memory-mixing hardware is enabled by the *smart memory compiler* [5, 6]. Further, we build application-specific high-level design tools using the Genesis2 design tool [9, 10]. The combination of these tools enables designers to perform design space exploration at reasonable effort to optimize their designs for energy budgets, image reconstruction quality, and performance.

The major restriction of logic-in-memory is that only localized neighborhood data access can be implemented efficiently, and that algorithms requiring stride-like data access patterns (e.g., the fast Fourier transforms, FFT) are prohibitively expensive to implement. Therefore, algorithms need to be adapted to match the constraints of the logic-in-memory paradigm.

*Related Work*

Synthetic aperture radar is essentially “taking a photo with radar” where a plane’s flight path synthesizes a large antenna. A radar mounted on a plane sends repeatedly pulses to the scene patch and records the reflections, rotating the antenna to aim at the same scene center for all pulses. The image is formed by computing the inverse 2D FFT of the recorded data. However, the data is sampled on a polar grid, and the PFA needs to first convert these polar samples into rectangular samples (i.e., polar-to-rectangular re-gridding), so that a standard FFT can be applied for image formation. Without this conversion, a computationally infeasible non-uniform Fourier transform would have to be applied [1]. The polar-to-rectangular conversion is often done separably (first processing all rows and then all columns of the data), for example using FFT-based upsampling followed by picking the nearest neighbor to the actual grid points of interest [2, 11]. The reliance on FFTs makes this approach computationally intensive, moreover, it requires non-local computation due to the well-known FFT data access pattern. An algorithm for logic-in-memory cannot rely on FFTs but requires local computation, thus we need to develop a localized variant of polar-to-rectangular re-gridding.

There are other relevant hardware accelerators for gridding algorithms. For example, [12, 13] present an FPGA accelerator for gridding in Non-uniform FFTs. Their work targets a broader set of applications, regardless of the data acquisition method, i.e., the sampling of source points can be completely arbitrary. In contrast, we focus on the image re-gridding from polar format to rectangular format; specifically with large radian spatial frequency and small coherent integration angular intervals. The prior knowledge from the application allows us to build a dedicated hardware that is particularly optimized to our specific needs. On the other hand, their work demonstrated a complete system solution on an FPGA platform. The purpose of this paper is not to deliver a complete system solution but to implement the kernel part of the re-gridding algorithm to demonstrate the potential of LiM design methodology. Therefore, we narrow our scope to the on-chip data processing and storage.

While FPGAs and GPUs are also good alternatives as hardware accelerators to speed up compute-intensive sections of applications [14, 15], ASIC is still 10 to 100 times more power efficient than FPGA and GPU alternatives [16]. In addition, modern FPGAs contain “hard” blocks such as block memories whose functionality and sizes are fixed. They are hard to customize with fine granularity, which is the essential part of our approach. For example, [12] proposed a multi-port local memory (MPLM) to solve the limited memory bandwidth/port problem for the parallel pixel accessing. Our rectangular access smart memory architecture has the similar functionality as MPLM, however, we move one more step forward and realize the parallel data accessing by embedding “intelligent” functionality into the traditional interleaved multi-bank memory organization and allowing multiple memory subbanks to share one common memory periphery. In other words, we customized the traditional memory architecture in an unusual way to reduce the overhead that exists in the multi-banking memory systems. Our LiM approach provides a novel regular pattern constructs based ASIC solution targeting at sub-22 nm technology nodes, demonstrating the possibilities of re-designing algorithms and re-architecting the hardware to match the advanced technology capabilities and achieving dramatic performance improvements that was not possible with general purpose computing or configurable hardware computing.

*Contribution*

The main contribution of this paper is the derivation of an algorithm for performing SAR polar format re-gridding interpolation in the LiM paradigm, and to provide the necessary design automation tool chain to implement our proposed algorithm in advanced silicon technology. We combine filtering, geometric transformations, and localized 2D interpolation to provide a virtual rectangular 2D memory address space that is overlayed on the polar grid and performs the necessary interpolation on demand. Enabled by this on-demand interpolation our system further provides partial image reconstruction, allowing for reconstructing both low-resolution thumbnails and high-resolution patches at considerably reduced energy cost.

This paper is an extended version of our previous papers that appeared in the proceedings of HPEC [3] and ICASSP [4]. While these previous papers mostly focus on the algorithmic side of our approach, this paper also presents the details of the hardware implementation and the design automation framework. More importantly, we show how to leverage the proposed design framework to co-optimize the algorithm, architecture and circuit design to achieve the maximum performance and energy efficiency.

## 2 Localized SAR PFA Algorithm

In this section we discuss our localized interpolation-based re-gridding algorithm that underlies our approach.

### 2.1 Local Interpolation Based Polar Reformatting

The measurements of the radar reflectivity function that are performed by the radar sensor during the plane flight are taken on partial polar annuli, which need to be converted to outputs on a Cartesian grid before FFT-based image formation. Assuming a signal of necessary smoothness, the points in the rectangular grid are similar to their neighboring elements of the Polar Annulus in both the range and cross- range dimensions. Given the high noise in radar data, We use simple local interpolations (e.g., nearest neighbor, bi-linear or bi-cubic) to perform re-gridding, as opposed to the usual FFT-based upsampling. In Section 5 we show that this can be indeed done without significant loss of end-to-end accuracy.

*Coordinate Transformation*

The main idea underlying our approach is to perform a coordinate transformation that converts the polar grid into a rectangular grid while the original rectangular grid is warped, and then perform interpolation in the transformed space. This allows us to apply standard 2D surface interpolation for polar data to rectangular data reformatting, which has the potential of being efficient in logic-in-memory as no transcendental function needs to be evaluated, neither for the coordinate transformation nor for the interpolation in the transformed space.

*Geometric Approximations*

Our localized grid interpolation is based on several geometric approximations. Firstly, as we mentioned, we approximate the polar annulus by quadrilateral tiles (Fig. 3) so that a simple quadrilateral-to-quadrilateral four-corner perspective geometry transformation can be used. Secondly, we assume that the measurement grids are evenly distributed on a rectangular grid after the transformation. These approximations could result in distortions in the resulting reconstructed image. As shown in Fig. 3, accurate approximation is achieved if the radian spatial frequency lower bound \((R_L)\) is large enough and the coherent integration angular interval \((\Theta )\) is small enough, which is true for most SAR applications. Therefore, an effective solution is to tile the image into small enough parts and perform the geometry approximation on each tile. We tile the output image in the Cartesian grid and find the minimum subset of the polar annulus that contains the corresponding rectangular tile. The resulting distortion is smaller than the intrinsic distortion of perfect SAR image reconstruction.

### 2.2 SAR Image Partial Reconstruction

*low resolution full-size image display*, and (2) high

*resolution partial-size image display*.

*Thumbnail Reconstruction*

In the first scenario, we get a quick overall view of the whole image without the fine-scale details (a thumb nail). This coarse reconstruction corresponds to multiplying the data in Fourier space (the original data) with a mask which attenuates the high frequency components. Only data elements that correspond to the low frequency components are interpolated and computations for high frequency components are omitted. A much smaller 2D inverse FFT can be used afterwards, saving a substantial amount of operations.

*Zoom-in Reconstruction*

In the second scenario we reconstruct only a small portion of the image (however, at full resolution). This can be seen as multiplication by a mask in the spatial domain zeroing everything but the region of interest, or equivalently, as decimation filtering in the frequency space [18]. Filtering is necessary for image anti-aliasing and the filter decimation factor corresponds to the proportion of the image area to be reconstructed in space. Using Fourier identities we can reconstruct sub-patches of an image at arbitrary position with arbitrary size. In the implementation we rely on the combination of a CIC (cascaded integrator-comb) and short FIR (finite impulse response) filter for decimation. The CIC filter requires no multiplications and its simple hardware implementation can be easily integrated with the logic-in-memory interpolation, however, accuracy requires us to use some FIR filtering.

*Computational Cost Savings*

In standard polar formatting algorithms using FFT-based upsampling for re-gridding, grid interpolation is the the most computationally intensive portion as it involves two FFTs per segment/secant for each range/crossrange line [2, 11]. In our local interpolation approach, all the interpolation related FFT/IFFT operations are avoided. The proposed grid interpolation has economical hardware implementations. Moreover, these operations are computed locally in the memory and therefore consume much less energy compared with in-CPU computing. For partial reconstruction, additional in-memory computation for decimation filters are required. However, the chosen CIC filter only involves eight adders and eight storage registers for any decimation factors. Under partial reconstruction, inverse 2D-FFT size is reduced which saves the unnecessary operations and thus energy. Thus, our approach has a huge potential for operations and energy savings. We will evaluate practically achievable savings in Section 5.

## 3 Hardware Implementation

In this section we will describe the hardware implementation details of our proposed LiM-based SAR polar reformatting and partial reconstruction algorithm.

### 3.1 Interpolation Memory Implementation

The core operation in our approach is 2D interpolation (bilinear, biquadratic, bicubic), which is used after the perspectively transformation to calculate the values of the tentative outputs from the neighboring measurements and the interpolation distances in the transformed coordinate system. To implement the interpolation operations efficiently, we design a LiM block called *interpolation memory*. Interpolation memory holds function values at evenly spaced, non-contiguous memory addresses, and the integrated logic performs polynomial interpolation operations on each read reference for locations that do not hold data. Thus, these interpolation memory blocks contain a seed table that stores the known function values, and compute “in-between” values on the fly. It has a larger memory read address space than write address space. Interpolation memory is a very general LiM building block that can benefit many signal and image processing algorithms [17, 19–21].

*Memory Access Logic*

*2D interpolation memory*returns the corresponding pixel value at that location, which is actually interpolated from its neighboring measurements in the original polar grid. Internally the input address is split into two parts. The higher \(k\) bits are used to address the measurement points in the original polar grid. And the lower \(r=n-k\) bits are used to specify the distances between the evaluated output point and its nearest neighborhood measurements. The output pixel values are the weighted approximations of the neighborhood measurements, and the weights are set by interpolation distances. The number of nearest neighborhood memory references to be considered is determined by the interpolation order. Power-of-2 indexing mechanism is applicable for most interesting problems, and it largely simplifies the hardware implementation.

*Interpolation Logic*

### 3.2 Rectangular-Access Smart Memory

*Decoder Sharing for Multiple Memeory Banks*

Traditionally, these parallel memory accessing is accomplished by distributing data across multiple memory banks so that for any consecutive access all data elements are retrieved from different banks without conflicts. Using multiple SRAM banks incurs high overhead since every memory bank requires its own decoder logic. Using logic-in-memory it is possible to build multi-bank memories that share parts of the decoder logic to exploit the known access pattern.

We exploit the fact that we always read a constant number of consecutive elements per cycle for each interpolation. The core observation is that after address decoding, the activated wordlines of all memory banks are always adjacent to each other. Based on that, it’s possible to optimize the multi-banking memory system to save the periphery overhead. We employ the a customized multi-banking SRAM design topology [23], which provides around 50 % area and power savings compared with the traditional multi-banking memory design. However, the design of such customized memory requires careful circuit design, sizing and layout, which is a significant design cost if it cannot be automated.

*Single Cycle Rectangular Block Access*

We define the functionality of memory to support one-clock-cycle rectangular access of \(2^a\times 2^b\) data points from a \(2^m\times 2^n\)\(2D\) data array. The input of the memory system is the top-left coordinate of the accessing rectangular block \((x_{[m-1:0]}, y_{[n-1:0]})\) and the outputs are all the data point inside the rectangular block. For bicubic interpolation, we have \(a=b=2\).

*Implementation*

The main idea is to let \(2^b\) memory banks in each memory block share a modified \(X\)-decoder by using the same method described in [23]. The \(X\)-decoder is specifically designed to activate two adjacent wordlines simultaneously. That is, when one block wordline is asserted, the next block wordline is also asserted by the OR gate operation of every two adjacent wordline signals. Another \(Y\)-decoder is used to select one of the two activated wordlines for each memory bank with the AND operations. Each memory bank word holds \(2^c\) data points but each time only one data point of them is required. A column MUX is designed to select one data element for each memory bank and the column MUX is controlled by the lower \(b+c\) bits of address \(y\)\((y_{[b+c-1:0]})\).

As shown in Fig. 7, both the first wordline (\(WL[0]\)) and the second wordline (\(WL[1]\)) are initially activated by \(X\)-decoder but \(Y\)-decoder further selects the \(WL[1]\) for bank \(0\) and \(WL[0]\) for the other three banks. After the column MUX, block \(0\) outputs data series of ‘\(8-5-6-7\)’, which are then reordered to be ‘\(5-6-7-8\)’. With some simple logic for data reordering, the smart memory outputs the required \(2^a\times 2^b\) data points in order simultaneously. As shown in Fig. 7, the distributions of address bit to each memory component are parameterized. By specify these parameters, the resulting memory architecture can be precisely determined.

Compared with the conventional multi-banking memory design, the amount of memory bank periphery circuits is reduced from \(2^{a+b}\) to \(2^a\). As is observed in Fig. 7, the resulting memory architecture has the embedded logic gates (e.g. the AND gates) tightly integrated with the memory cells, and each logic gate communicates with its local memory cells. The hardware synthesis of these novel smart memories will be presented in Section 4.

### 3.3 Image Perspective Transformation

*Perspective Transformation*

*Division*

As we can see, the perspective transformation mostly involves simple arithmetic logic like additions and multiplications. Although the division operation is also required, we observed that the denominator is a linear function of the \(u\) and \(v\) coordinates. Therefore, for the items of \(1/({{a_{13}}u+{a_{23}}v+{a_{33}}})\) and \(1/({{a_{13}}u+{a_{23}}v+{a_{33}}})\), we can first evaluate their values at the four corners, that is, \((u=0, v=0)\), \((u=0, v=1)\), \((u=1, v=0)\), \((u=1, v=1)\), and then the values at other locations can be computed by a bilinear interpolation from the four corners. This way we convert the division to a bilinear interpolation and a multiplication leading to negligible accuracy loss.

*Implementation in LiM*

The whole geometric transformation logic is embedded into the memory boundary together with the 2D interpolation logic. From the user’s point of view, the resulting LiM block is a normal memory that stores the pixel values at rectangular grids and returns the requested pixel value on command. However, internally it actually stores the polar grid measurements in the physical memory and has the application-specific logic computation embedded in the memory boundary. Therefore, LiM block provides a virtual rectangular 2D memory address space that is overlaying the polar grid and performs the necessary logic operation inside the memory abstraction.

### 3.4 Frequency Filter

The most important arithmetic operation for SAR image partial reconstruction is the filtering, which enables us to implement partial image reconstruction for both low-resolution thumbnails as well as high-resolution scene patches in logic-in-memory. We rely on simple Fourier transform identities to translate phase shifts in frequency space to time-domain displacements [18]. Using this identities we can zoom at any region of interest.

*CIC and FIR Filters*

A wide range of decimation factors is required for different problem size with different display resolution and zoom factors. Straight-forward implementation of finite impulse response (FIR) filters becomes too expensive for long tap lengths that are required to maintain accuracy. In order to include the filters into the logic-in-memory device, it is required that hardware implementation is as simple as possible. A finely-tuned combination of FIR and cascaded integrator-comb (CIC) filters can be implemented very efficiently in logic-in-memory. After evaluating the accuracy-cost decimation filter design space, we use FIR polyphase filter for low decimation factors (for instance, 2 or 4) and use CIC filters for high decimation factor (for instance, 8, 16, 32, 64, or 128). CIC filters are chosen because no multipliers and no intermediate storage are required, and the same filter design can easily be used for a wide range of decimation factors by adding an additional scaling circuit and minimally changing the filter timing. However, a CIC compensation filter, which is usually implemented as FIR inverse sinc filter is usually required to compensate the non-flat passband and wide transition-region of high decimation factor CIC filters. It is performed after the decimation so that there is no much additional cost.

## 4 Design Automation Framework

We now discuss the design trade-off space and our design automation tools.

### 4.1 Design Trade-off Analysis

The SAR image formation process requires the choice of a series of problem parameters, and each parameter setting leads to a different hardware implementation. In addition, as the major components of the system, both interpolation and filtering are trade-off problems in terms of performance/accuracy/cost.

*Accuracy vs. Interpolation Order*

*Filter Design*

The parametrization of filter specifications also give rise to a large design space. For example, the transition region of a non-ideal filter will result in added distortion at the image edge. Therefore, the narrower the transition region the better the edge quality, but the higher the filter degree, thus higher the hardware cost. In essence, we can reconstruct a slightly larger image and disregard the boundary region to enable the utilization of a lower-quality (and computationally cheaper) filter. In our implementation, we use an region-of-interest (ROI) parameter to specify the ratio of the used image center area compared to the overall image area, which determines the transition-region of the filter (rolloff factor).

This shows that different design decisions will result in different tradeoffs. The combination of these design choices constitutes a huge design space. Further, exploring the design tradeoff space requires customized memory designs, which are traditionally prohibitively expensive. Thus, a strong design automation tool is required to make the hardware synthesis feasible.

### 4.2 LiM Design Framework

Our design framework provides designers with a graphical user interface to select application functionality and parameters and then generates synthesizable RTL designs for a specified functionality. Free or un-specified parameters can be optimized by the system. A designer then evaluates the obtained designs and can explore the design space and optimize the design for the application by varying the parameters. The design framework consists of the tool frontend which is built from the architectural chip generation infrastructure Genesis [10, 24] and the tool backend that is built from the pattern-construct based smart memory compiler [5, 6].

*Genesis Chip Generator*

The frontend of the design tool chain is a standalone design tool framework named Genesis [10, 24, 25]. It is responsible for application interfacing, design optimization and efficient RTL generation. Genesis is a framework that simplifies the construction of highly parameterized IP blocks. Unlike existing HDLs that calcify any existing flexibility at instantiation, Genesis leaves low level optimization “knobs” free even after aggregation into bigger IP blocks, allowing them to be set and optimized later in the design process. To achieve that, Genesis enables the hardware designers to simultaneously code in two interleaved languages when creating a chip module: a target language (SystemVerilog) to describe the behavior of hardware and a meta-language (Perl) to decide what hardware to use for given specs (see the left part of Fig. 8b). The net result is that Genesis enabled us to design an entire family of LiM designs, all at once. After the parameterized design was complete, there is still the matter of controlling all the parameters and they can be made explicitly by the user or automatically by optimization tools. The generator mechanism provides a standardized way, via an XML form, for optimization tools to make design decisions for various given parameters throughout the design hierarchy. Genesis classifies parameters into three groups. First, an inherited or constrained parameter is one that is inherited from, or constrained by decisions made in other modules in the design hierarchy (e.g., interface bit width). The second type of parameter is the free parameter-parameters whose values can be freely assigned by the system and it is best to allow an optimization engine to set the value that maximizes performance under a given power or area constraint. A third type of parameter is the architectural parameter that changes the function or the behavior of the module. These are the parameters that must be set by application designer. An inherit priority rule in Genesis determines the assignment/overwritten policy of parameter values.

*Smart Memory Compiler*

The automated design framework discussed so far is capable of mapping application specifications to optimized RTL. Equally important, a smart backend of the design tool chain is required to efficiently co-synthesize logic and memory (the right part of Fig. 8b). Generic SRAM compilers enable automatic SRAM IP creation based on user specification, but they “compile” memory blocks from a set of pre-determined SRAM hard IP components (e.g., bitcells and peripheral circuits). This compilation strategy not only limits the possibility of application-specific customization but also hinders comprehensive design space exploration, leading to a sub-optimal IP. We have been exploring opportunities for synthesis (not just compilation) of customized logic-in-memory blocks in a commercial sub-20 nm CMOS process and successfully developed a smart memory design and synthesis methodology. The smart memory is composed of a group of Memory Arrays, peripheral circuits and application specific random logic implementing a special function. The major step in the design of smart memory is to co-optimize logic, memory and process. In order to predictably print the tight pitches in extreme nodes, the design rules require an extremely regular and gridded design making logic and memory co-design easier, for that we have created a bitcell compliant area-efficient unidirectional logic fabric. This methodology allows to remove any distinction between pushed memory design rules and logic design rules. Therefore, customized memory periphery is synthesized using lithographically compliant unidirectional standard cells which can be mapped together with memory to a small set of pre-characterized layout pattern constructs [5, 6]. Lithographic compliance between the co-designed logic and memory ensures sub-20 nm manufacturability of LiM circuits.

The architectural frontend and physical backend are combined to build an end-to-end LiM design framework [3, 4, 26]. Its input is the design specification and the output is ready to use hardware (RTL, GDS, .lib, .lef). When generating a specified design point, our framework also reports the area, power and latency and send them back to the frontend user interface, from which the designer can evaluate the resulting design and reset the design specs for redesign if necessary. Our LiM framework allows an application designer to generate the optimized “silicon” templates by simply tuning the “knobs”.

*User Interface Illustration*

*Data precision*,

*interpolation order*) are set by the application designer. In our example in Fig. 9, the selected operation is the reformatting of a \(256\times 256\) polar grid array to a rectangular grid array, using bilinear interpolation. The interpolation resolution is set to be 8 bits. To achieve this, a 2D bilinear interpolation memory with a \(256\times 256\) physical memory size and a \(2\times 2\) rectangular access size is required, which is a separate LiM design tool we built. It here acts as a sub-module of the image reformatting tool. Constrained by the higher-level image reformatting tool, its parameters are shown in right part of Fig. 9b. This interpolation memory contains a second-level sub-module: a \(2\times 2\) rectangular access memory for supplying \(2\times 2\) block pixels to its higher level bilinear interpolation memory module. When satisfied with the parameters, the user simply clicks the “Submit Changes” button, and the tool will start to run to generate the dedicated hardware description in Verilog.

As seen in the example, we are building a LiM tool that is hierarchically composed from lower-level LiM design tools. All of these submodules in the designs provide users the hierarchical graphical tools to design instances of the algorithm with the capability of exploring the design space to trade off cost and performance.

## 5 Experimental Results

In this section we evaluate our logic-in-memory based SAR implementation for accuracy, performance, as well as computational and energy cost. We use our design tool to automatically synthesize the hardware for measurement and build an architectural model to simulate the algorithm.

*Consecutive Access Smart Memory Evaluation*

*Accuracy and Hardware Cost Evaluation*

*Energy Efficiency*

## 6 Conclusion

Advances in integrated circuit design enable the energy-saving logic-in-memory paradigm, which moves a part of the computation directly into the memory array. This cutting-edge design methodology requires redesign of well-known algorithms to match its performance characteristics. In this paper we derive a logic-in-memory variant of the polar formatting algorithm used in SAR image formation, and it has the accuracy comparable to the traditional FFT-based polar formatting algorithm but requires much less processing energy. Our algorithm further supports partial image reconstruction. We provide the necessary design automation tool chain to enable users to study the design trade-offs in the energy and performance space. Our experimental results show substantial energy saving at the same accuracy level.

## Acknowledgments

The authors acknowledge the support of the C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.