Current embedded systems are increasingly used to support high-performance applications. This is due to the diffusion of these systems in the domain of mobile devices and the need of a large number of services required from these systems.

To support the execution of these applications, several architectures based on CPU, GPU or FPGA have been developed and are still under investigation. When the application demands both performance and flexibility, architectures based on several types of execution resources are of high benefit. In this context, designers define architectures which include all the necessary resources on the same chip, also called “Multiprocessor System on a Chip” (MPSoC).

Image processing is one of the major applications in embedded domain, which requires high effort in computation. Image processing for medicine, automotive, and for video compression is the main algorithm that has been addressed by the authors of the Design and Architecture for Image and Signal Processing (DASIP) conference. This special issue presents several papers which address this general topic, and also presents papers dealing with the implementation complexity as well as exploring different opportunities concerning the possible architectures of CPU, GPGPU, FPGA and ASIC implementations. Comparisons among these different technologies are also presented in order to attempt defining the best implementation of the applications. Due to the complexities of the applications and architectures, research concerning methodologies for implementation is also addressed in this special issue. The main objective is to provide designers efficient methodologies and tools which can help during the exploration of different implementation opportunities.

In the next paragraph, the guest editors provide a brief description of each paper presented in this special issue. We wish to have provided JRTIP readers a good reading collection and hope that these selected papers will be a source of inspiration for future works.

1 Applications papers

H264 is a good example of applications that require a high amount of computation, and several architectures can be explored to define their efficient implementations. For example, using GPU, high parallelization of the application has been addressed by the paper from Youngmin Yi, “An Efficient Parallelization Technique for ×264 Encoder on Heterogeneous Platforms Consisting of CPUs and GPUs”. Another study, which proposes to implement a flexible execution of the H264 algorithm, is discussed in Wajdi Elhamzi’s paper “An Efficient FPGA Implementation of a Configurable Motion Estimation for H.264 Video Coding”. This paper explores the implementation of low-cost algorithms using Xilinx V6 FPGA.

Image processing like binarization is often used to reduce the complexity via extraction of some image features. This topic is covered in the paper authored by Naeem Abbasi, “Modified Stable Euler-Number Algorithm Implementation for Real-Time Image Binarization” where the authors propose an efficient FPGA implementation based on a pipelined architecture in order to address this type of computation.

The biomedical domain also requires high computations, and the FPGA device is a good choice for a parallelized algorithm implementation to ensure real-time execution. The paper provided by Fan Yang, “Flexible VLIW processor based on FPGA for efficient embedded real-time image processing” targets this challenge.

Video surveillance is increasingly being deployed in public locations in order to detect specific problems and provide assistance in critical situations. To provide this type of service, image processing algorithms to extract background and moving objects are of high importance. In the paper written by Mateusz Komorkiewicz, “Real-time background generation and foreground object segmentation for high definition color video stream in FPGA device” the authors propose a technique based on advanced background model, including color and texture of images. They demonstrate their approach for HD color video with an FPGA implementation supporting real-time execution.

Security in automotive domain is an important topic and the car manufacturers have been introducing advanced systems to support drivers with Advanced Driver Assistance Systems (ADAS). One important subject concerns the capability to detect road signs in order to alert drivers. The paper from Chokri Souani, “Efficient algorithm for automatic road sign recognition and its hardware implementation” proposes a hardware implementation based on FPGA circuits, and provides interesting trade-off between computation speed and recognition capability.

Security and authentication are very important features of the current embedded systems, which are mainly devoted to hide confidential user information. In this context, there is an increasing interest in techniques, which can help to protect personal data. One technique consists of verifying the authenticity of users by verifying their fingerprints. The paper written by Rosario Arjona et al., “A Hardware Solution for Real-Time Intelligent Fingerprint Acquisition” propose an efficient algorithm implemented on a low-cost embedded system, which is capable of supporting different types of fingerprint sensors.

2 Implementations papers

Implementation of applications is a difficult challenge, in particular when these applications need high computation resources. However, there exist several possible architectures, which offer different flexibilities, performance, and low power characteristics. Several papers in this special issue address different architectures including ASIC, FPGA, CPU and GPGPU as execution resources.

When the system is composed of several nodes, which are connected to a central node, the energy consumption necessary to transfer information can be drastically reduced by introducing efficient computation directly in the sensor node. The paper with the author Z. Cihan Taysi, “In-Situ Image Processing Capabilities of ARM-based Microcontrollers”, propose to implement basic algorithms for image detection, recognition and tracking on ARM processor which is a widely used architecture for low power sensors.

Sometimes, the classical implementation of an algorithm cannot provide sufficient computational performance. In this case, designers can deploy implementations using dedicated devices or an ASIC implementation. This is the case for the article from Alireza Behrad, “VLSI Implementation of Star Detection and Centroid Calculation Algorithms for Star Tracking Applications”, which addresses the computation of the tracking of a large number of stars in the universe.

Several papers propose to implement image processing algorithms on a specific FPGA. Texture processing has been addressed in order to carry out high-performance extraction of this specific image feature in the paper written by Asadollah Shahbahrami, “High Performance Implementation of Texture Features Extraction Algorithms using FPGA Architecture”.

To process images captured by some medical devices, high-performance computation algorithms are usually necessary. This is the case in tomography analysis reconstructing images from a large number of signal data. To do such computation in a parallelized way, two models of architectures can be explored: FPGA and GPU. In Matthias Birk’s paper, “A Comprehensive Comparison of GPU and FPGA-based Acceleration of Reflection Image Reconstruction for 3D Ultrasound Computer Tomography”, addresses this topic and provides comparisons between the two architectures.

Extraction of information from an image in order to reduce the quantity of data to be processed and to find an efficient implementation on FPGA circuit is the topic of the paper presented by Sara Granado, “On-chip semi-dense representation map for dense visual features driven by attention processes”. This paper presents an efficient hardware implementation leading to a real-time execution.

FPGA is not only a configurable circuit but also it can be used as a complete solution to implement a MPSoC system when the application demands high computational performance of hardware execution and flexibility of software execution.

The paper written by Xiofang Wang, “Hardware–Software Optimizations of Reconfigurable Multi-Core Processors for Floating-Point Computations of Large Sparse Matrices”, shows that implementing a high computation application can lead to significant improvements using a hardware software co-design approach. A Virtex 4 is used in this work and a 17 % speed up is obtained for the computation of large sparse matrices.

Detection of specific lines is often used to detect objects in images. Several algorithms exist to support such line detections. An efficient implementation on different types of execution resources always constitutes a research topic. Markéta Dubská’s paper, “Real-Time Detection of Lines using Parallel Coordinates and CUDA”, the authors propose to implement the Hough transform on GPU by offering a new organization of the Hough basis kernel. The solution proposed enables real-time detection for simple binary images or for more complex color images.

Deployment of GPU is widely done on desktop computers and has been largely envisaged to support the computation of digital signal processing applications. Indeed, this type of applications can exploit this platform and even multi-GPU architecture. However, one specific point, which needs to be managed to support these applications, is the communication and exchange between the different processing elements of GPU. Silvain Huet, the authors of the paper “Efficient implementation of data flow graphs on multi-GPU clusters”, discuss a high-level model based on the Data Flow Graph and propose to reduce the communication overhead during the execution of the application.

3 Methodology papers

The implementations of algorithms in embedded systems generally need time-consuming effort for designers to verify the functionalities and to ensure the constraints. To help designers, methodologies and tools are developed to support applications and architecture descriptions. In the paper written by Francesca Palumbo, “The Multi-Dataflow Composer Tool: Generation of On-the-Fly Reconfigurable Platforms”, the authors examine a platform which enables to generate runtime description of a multi-application system from its data flow descriptions. Implementations of image and video computation on both ASIC and FPGA are presented to demonstrate the capability of the platform.

The efficiency of an application implementation on a specific architecture depends on all required development phases. The first step consists of describing the application in order to help the designer in the exploration phase. For this purpose, a Data Flow description is widely used and also much considered as input of the methodologies and tools. In the paper written by Endri Bezati, “High-Level Dataflow Design of Signal Processing Systems for reconfigurable and multi-core heterogeneous platforms”, the authors demonstrate that the Data Flow programming can be used as the base of a unified methodology to support heterogeneous systems composed of reconfigurable (FPGA) and software (CPU) execution resources.

Another way to help designers in the implementation aspect consists of using High Level Synthesis tools, which can accept as input a high-level description of an application. The paper provided by Carlo Colodro-Conde, “A practical evaluation of the performance of the Impulse CoDeveloper HLS tool for implementing large-kernel 2-D filters” presents some results and comparisons of HLS synthesis with classical implementations. This paper shows that important reductions of design time and design effort can be obtained with an acceptable reduction in performance.

Complexity of embedded systems is becoming more important and designers often use simulations to explore their designs and to verify the functionalities of the applications. The problem gets more complex when the application is distributed among several execution resources, which are connected by specific interfaces. In the paper authored by Sébastien LeNours “Performance evaluation of an automotive distributed architecture based on a high speed power line communication protocol using a transaction-level modeling approach”, the authors provide a simulation environment which is able to support transaction-level modeling, thus providing several information which help the designer to tune the architecture to the application requirements.