The constant advances in IC technologies have introduced new challenges for implementations and design methodologies; higher integration level allows more complex systems to be implemented but on the other hand implementations often have strict constraints on power consumption. These challenges are present in signal processing systems implying the need to improve design methods and find more efficient algorithm-architecture optimizations. This special issue contains a selection of recent papers on design and implementation of signal processing systems ranging from circuit level architectures to scheduling methods and from application-specific architectures to implementations on many-core systems.

In Data Center Switch for Load Balanced Fat-Trees, Lai, and Chiu demonstrate a fault tolerant switch IC operating at the maximum rate of 5.8 Gbps per channel. This work employs a load-balanced fat-tree architecture that does not consume all of its bandwidth even under heavy traffic. When there are broken links or faulty switches in the network even in heavy traffic load situations, available bandwidth remains in every connection pattern and alternative paths are provided to re-route the traffic. Fault tolerance capability evaluations of link or switch faults in the fat-tree network are given to support the presented idea, and a 4 × 4 Banyan type switch IC is developed as the commodity switch for building the fault tolerant fat-tree data center networks.

Lee and Sung propose a cell-to-cell interference (CCI) cancellation technique for multi-level NAND flash memory in their paper Least Squares Based Coupling Cancellation for MLC NAND Flash Memory with a Small Number of Voltage Sensing Operations. Their two-step algorithm consists of training and then interference removal performed during the page read operation. A least-squares adaptive CCI canceller is developed and optimal quantization schemes are studied. Experimental results show a significant BER improvement despite a low number of voltage sensing operations.

In A Fast Recursive Algorithm and Architecture for Pruned Bit-Reversal Interleavers, Mansour describes an algorithm and architecture for implementing interleavers used in communications applications such as error-correcting codes (turbo codes) and bit-interleaved coded modulation. A mathematical formulation for developing flexible-length interleavers is developed along with a study of permutation statistics. Practical examples of implementations of parallel interleavers are provided.

In Highly Parallelable Bidimensional Median Filter for Modern Parallel Programming Models, Sánchez and Rodríguez present efficient parallel implementation methods for median filtering. The authors implement their previous work on the parallel ccdf-based median filter (PCMF) on a GPU (Graphics Processing Unit), and show that the proposed median filtering algorithm is efficient and can outperform other generic median filters for the GPU. The proposed algorithm is implemented in three parallel programming models: SIMD Intel, multi-core Intel with SIMD, and SIMT (CUDA). Additionally they make use of the salt & pepper noise model to improve the image reconstruction quality with a small performance impact.

Pan, Zheng, Tian, Yan, and Huan consider distributed architecture for particle filters in their paper Hierarchical Resampling Algorithm and Architecture for Distributed Particle Filters. The authors propose resampling process to be decomposed into to two hierarchies, which avoids tricky particle redistribution procedure. A residual cumulative resampling method is proposed, which effectively allows pipelining the resampling step. The proposed approach provides the same accuracy as centralized resampling methods Authors have implemented the architecture on FPGA where eight processing elements are incorporated. The prototype shows the potential of the method.

Wavefront parallel processing (WPP) coding in high efficiency video coding (HEVC) standard is considered in Parallel HEVC Decoding on Multi- and Many-Core Architectures: A Power and Performance Analysis by Chi, Alvarez-Mesa, Lucas, Juurlink, and Schierl. The paper shows how WPP can be implemented efficiently on multicore and many-core architectures. The authors propose overlapped wavefront method, which allows several picture partitions as well as multiple pictures to be processed in parallel. The paper shows performance and power analysis of optimized HEVC decoder on 4-core, 8-core, and 36-core systems. The experiments show that the proposed method improves also single-threaded performance.

Du, Wang, Zhuge, Hu, and Sha consider replacing DRAMs as main memories with modern non-volatile memories (NVM) in their paper Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main Memory. As write in NVMs is significantly more costly operation than read, the authors propose a method to reduce write activity in parallel loops. The method is based on rotation with maximum bipartite matching and it provides savings in writes compared to traditional rotation algorithms. The authors have carried out simulations with DSPStone benchmarks showing savings of over 30 % for writes. Based on this the lifetime of NVMs can be expected to double.

A novel schedule model called scalable schedule tree (SST) is proposed by Wang, Shen, Wu, and Bhattacharyya in Parameterized Scheduling of Topological Patterns in Signal Processing Dataflow Graphs. They exploit a concept of topological patterns, for efficient representation of repetitive sub structures in dataflow graphs. The paper proposes a formal design method for specifying the topological patterns and deriving parameterized schedules from such patterns. The proposed method ensures deterministic system behavior. The authors have integrated SST support to DIF framework and applied the method to two case studies and generated code on GPU and general-purpose processor. The results show that the method provides a formal path from scalable application analysis to systematic exploration of implementation trade-offs.

In Pedestrian Navigation Based on Inertial Sensors, Indoor Map, and WLAN Signals, Leppäkoski, Collin, and Takala propose accurate indoor positioning that use pedestrian dead reckoning (PDR) based on microelectromechanical systems (MEMS) sensors fused together with WLAN signals, indoor map, or both. The MEMS sensor unit includes a heading gyro and a 3D-accelerometer. For the data fusion, they employ two nonlinear Bayesian filters, one is complementary extended Kalman Filter (CEKF) and the other is a particle filter. The processing load of the map information is reduced by an appropriate prior sectioning so that only parts of the obstacles need to be checked by the algorithm. The results with different combinations of the available sensor information are compared. This work helps accurate positioning in indoors where satellite signals, e.g. GPS, are severely degraded or not available at all.

Zhu, Berger, Turner, Pileggi, and Franchetti present a hardware architecture for synthetic aperture radar (SAR) image formation in their paper Local Interpolation-based Polar Format SAR: Algorithm, Hardware Implementation and Design Automation. The architecture is based on a logic-in-memory approach. A design automation framework consisting of a chip generator and smart memory compiler is described. Experimental results show significant energy savings.

We would like to thank all of the authors of this special issue for their contributions. We would also like to thank the anonymous reviewers for their efforts in ensuring the quality of the papers. We also extend our appreciation to C. Clark for her help on setting up this issue. We hope that you enjoy the special issue and find the articles informative and useful.