1 Introduction

With the explosive growth of vision technologies, including sensors, computing platforms and displays, the role of digital images in our society has never been so important. Almost all modern systems rely on computer vision, whether it is for security and safety applications, such as video surveillance [1] and autonomous driving assistance systems [2], or for diagnostics and investigation aims, like in the case of medicine [3] and satellite imagery [4]. In such a case, sensed images are collected and elaborated with the aim to provide synthetic information of the surrounding scene and/or to make decisions automatically, without human’s intervention. It is then clear that, to cope with the numerous difficulties that may occur in analyzing real scenes, the vision technology embedded within such systems will include several complex computational stages. The resulting video pipeline is requested to elaborate the input image step-by-step and to ensure frame rate and power consumption adequate to the target application [5]. Such constraints may be satisfied through hardware accelerators implemented on either Field Programable Gate Array (FPGA) or Application Specific Integrated Circuits (ASICs) technologies, which enable parallelism and pipelining features over the software-based CPU counterpart [6, 7].

In most applications [1,2,3,4], recognizing particular patterns and/or detecting specific objects within the scene is crucial for the proper functioning of the system. However, raw digital images captured through sensors may be affected by some disturbance, which could impact on the output produced by intermediate elaboration stages and fake the final result. Just as an example, poor illumination conditions of the scene and/or high temperatures of the acquisition electronics circuits modify the actual value of digital pixels by adding a noise component having Gaussian distribution [8]. In this case, preliminary elaborations are mandatory to enhance the quality of images and make them suitable to be processed by the subsequent pipeline. Among the established pre-processing techniques, bilateral filtering (BF) [9] is one of the most popular thanks to its ability to smooth the image while preserving edges. It has been successfully exploited in conjunction with other elaboration steps to perform detail enhancement [10], image up-sampling [11] and tone mapping [12], just to cite a few applications. BF relies on the combination of spatial and intensity information coming from pixels to filter the image through adaptive weights based on the Gaussian distribution. Depending on the target task (e.g., denoising, texture smoothing, etc.), the desired level of filtering can be adjusted through some tuning parameters. However, the relatively high computational complexity of BF, due to exponentiation and division operations, makes the integration of this function not trivial in most of the above cited contexts [1,2,3,4].

In the recent past, significant efforts were spent to conceive hardware-oriented approximation strategies [13,14,15,16,17,18] suitable to reduce the computational load of BF without compromising the achievable quality results. Most of prior art [13,14,15,16] replaces the exponential function with look up tables (LUTs) storing a discrete set of approximate weights that are pre-computed for a certain number of fitting points chosen at design time. When adopted for image denoising, as in the cases demonstrated by [13,14,15,16], such an approach provides competitive trade-offs in terms of image quality and hardware characteristics, but it shows poor flexibility to different noise conditions. As it is well known, the BF is effective in denoising an image when the noise variance is known in advance [17, 18]. In such a case, the tuning parameters may be properly adjusted depending on the current estimated noise, meaning that the approximate weights, and the content of LUTs consequently, must be updated as well. The self-adaptable high throughput BF (SAHTBF) architecture presented in [17] provides this capability, introducing additional circuitries to estimate the noise variance and to reload the new set of weights within on-chip memories. To avoid runtime updating operations, in our previous work [18], LUTs were used to store approximate weights pre-computed through appropriate piecewise linear functions. The latter were conceived to allow scaling the address used to access the LUTs according to the current noise level, thus selecting the most suitable set of weights. However, as detailed in the following, these solutions introduce an overhead in terms of maximum running frequency and dynamic power consumption.

In this work, we present a new hardware-oriented approximation strategy that exploits simple mathematical transformations to reduce the computational complexity of BF. The main contributions of the paper are summarized as follows:

  1. 1.

    We provide a comprehensive overview of most recent applications exploiting BF. The objective of this study is to demonstrate not only the effectiveness of a traditional technique like BF when used in modern vision technologies, but also the requirements that a hardware implementation must exhibit in terms of throughput, energy consumption and flexibility, based on the operating scenario.

  2. 2.

    We propose a new approach for BF that, in contrast to prior art [13,14,15,16,17,18], avoids pre-computation of weights and LUT-based processes. The proposed method allows weights to be runtime computed, depending on the current settled tuning parameters. Moreover, it simplifies the filtering operation by converting some multiplications into simple shifting operations.

  3. 3.

    We design a custom architecture based on the novel BF approach and implement it on the Xilinx Zynq XC7Z020 FPGA device. For purpose in comparison with competitors, different implementations of the proposed architecture have been characterized to process images with resolutions 256 × 256, 512 × 512 and 3268 × 2448.

  4. 4.

    We evaluate the impact on the accuracy of the proposed approximate BF technique, by referring to three different applications: denoising, texture smoothing for edge detection and high-dynamic-range (HDR) tone mapping.

The remainder of this paper is organized as follows. Background on BF and the review of its most recent application in computer vision tasks and hardware implementations are provided in Sect. 2. The proposed BF strategy and the quality evaluation for different vision applications are presented in Sects. 3 and 4, respectively. Section 5 describes the proposed hardware design and obtained implementation results. Finally, conclusions are drawn in Sect. 6.

2 Background and related works

2.1 Bilateral filtering

BF convolves the input image with weights computed on the basis of both geometric and intensity distances between neighboring pixels. From a mathematical point of view, BF operates as reported in (1). I(x, y) and O(x, y) are the input and output pixels at the coordinate (x, y), respectively; Ω is the k × k filter window centered in I(x, y); Ws(i, j) and Wr(i, j), being defined in (2), are the spatial and range coefficients at the generic position (i, j) within Ω.

$$O\left( {x, y} \right) = \frac{{\mathop \sum \nolimits_{(i, j) \in \Omega } I(i, j) \times Ws(i, j) \times Wr(i, j)}}{{\mathop \sum \nolimits_{(i, j) \in \Omega } Ws(i, j) \times Wr(i, j)}}$$
(1)
$$Ws(i, j) = e^{{ - \frac{{\left( {x - i} \right)^{2} + \left( {y - j} \right)^{2} }}{{2 \times \sigma s^{2} }}}}$$
(2a)
$$Wr(i, j) = e^{{ - \frac{{|I\left( {x, y} \right) - I\left( {i, j} \right)|^{2} }}{{2 \times \sigma r^{2} }}}}$$
(2b)

It must be noted that the calculation of the weights, depending on the Gaussian distribution, involves the standard deviation parameters σs and σr, which are used as tuning parameters to adjust the level of filtering. In particular, σs is set based on the filter window size k, so that \(\sqrt {\left( {x{-}i} \right)^{2} + \left( {y - j} \right)^{2 } } \le 3 \times \sigma s\) [19]. Conversely, σr has to be properly chosen based on the task to be performed. As an example, in the case of denoising, its value depends on the current noise standard deviation σn and it is chosen so that σr = 3 × σn [20].

With reference to (1)–(2), it can be noted that the complexity of BF is mainly due to the high number of exponentiation and multiplication operations needed to compute filter weights. If on the one hand the spatial coefficients may be considered as constant once k is set, on the other hand the range coefficients require extra computational resources to perform exponentiation operations at runtime for different pixel intensity differences and σr.

2.2 Applications

BF was first proposed in 1998 as an effective technique to smooth grayscale images. Since then, its application in the most varied computer vision fields has gone through constant evolution by either a modification on the original algorithm [21,22,23,24,25] or a combination with other image processing techniques [10,11,12, 26,27,28,29,30,31,32]. In the first case, worth of mentioning are the modified BF techniques that include the Adaptive BF (ABF) [21] and the Joint BF (JBF) [24]. Such approaches are proposed to overcome some limitations occurring in specific tasks. As an example, the ABF provides the ability to smooth the image and to sharpen the edges by adding an offset to (2b) and also making the σr parameter adaptable across different windows. This property is very useful in restoring images affected by Gaussian noise. Conversely, the JBF was proposed to address the issues related to the use of flash lights in photography by acquiring two images of the same scene, e.g., with and without flash, and then computing the range coefficients using one of this as guidance [24].

From an application point of view, BF [10] and its variants [21, 24] are adopted as intermediate steps in several complex computer vision tasks. Figure 1 summarizes the most recent applications involving such an operation, with a look to the main requirements from a system perspective. With reference to its original application, i.e., image denoising, BF is very effective in processing images characterized by many edges and contours. Some relevant examples include low-dose computed tomography images [26] and biometric images of iris, fingerprint and finger vein that are commonly used for authentication applications [28,29,30]. In such cases, one or more BF stages are fundamental to attenuate noise and to improve the performance of the subsequent edge detection step [28,29,30]. Beside its denoising capability, BF has demonstrated its effectiveness also in the context of medical image fusion [27] that is often adopted to decompose the input coming from different diagnostic apparatus, like the CT and the magnetic resonance imaging (MRI), and to extract as much as possible details from each source.

Fig. 1
figure 1

Overview of most recent applications of BF

Structure-preserving smoothing plays a crucial role in several imaging applications, like HDR tone mapping [12] and video abstraction [31], and it can also be implemented by BF [10]. As an example, the tone mapping method presented in [12] relies on processing the HDR input image to extract the base and detail layers through a BF-based decomposition process. Such layers are then properly merged to obtain a low dynamic range (LDR) image, using as scaling and compression factors the tuning parameters resulting from the BF stage.

In Wennersten et al. [32], the usage of BF to reduce ringing artifacts due to video compression is also proposed. In such a case, it is a post-processing performed on decoded samples after the inverse transform, which allows improving the quality of the reconstructed images as well as the spatial and temporal prediction of subsequent blocks. Most recently, a bilateral up-sampling network has been presented in [11] to perform the task of single-image super-resolution with arbitrary scaling factors. The idea arises from the observation that the range component of BF takes the difference between pixel intensities into consideration, which allows adaptively learning up-sampling weights for different image regions and avoids over-smooth super-resolution results.

Two important observations arise from this overview. First, regardless of its application, BF requires the tuning parameters to be runtime configured depending on the current operating conditions. This is fundamental in: denoising to attenuate only the currently estimated additive noise; texture smoothing to ensure a sharpening level variable across different image regions based on the image content; compression artifact removal, where σr and σs must be settled according to the way the compression has been performed; image up-sampling for which σr and σs are learnable parameters of the model. Second, all the cited applications have to meet tight constraints in terms of either frame rate or energy consumption or both. However, BF is a computationally-intensive technique and the exponential function needed to calculate range coefficients on the fly accounts for almost the entire processing time [33]. Furthermore, it should be considered that the hardware cost of a single exponentiation operator is ~ 6 times higher compared to a fixed-point multiplier unit [34]. It is then clear that, in all those operating scenarios where BF is used in conjunction with less complex pixel-level processing, its contribution becomes dominant. This is, for instance, the case of the HDR tone mapping [12] and image fusion [27] applications, where the layers obtained by the BF-based decomposition process are subsequently elaborated through a weighted addition. In some other applications, like those relying on neural networks [11, 26], BF layers are used in combination to traditional convolutional layers. In such a case, while the convolutional layers perform filtering operations through constant weights, the BF ones show higher computational complexity at the parity of size of input volume.

In the above-mentioned scenarios, custom hardware appliances are needed to meet the speed/energy requirements. In such cases, straightforward implementations of Bilateral Filters most often do not satisfy the specifications. Therefore, innovative and specific design strategies are highly desired to reduce the computational complexity of BF, thus accomplishing the energy dissipation constraints.

2.3 Hardware implementations

As above discussed, operations involved in BF, including exponentiation and division, make its hardware implementation quite challenging. State-of-the-art FPGA-based designs [13,14,15,16,17,18] all refer to the denoising application and rely on a straightforward implementation of (1)–(2), using approximate weights that are pre-computed and stored within LUTs. As observed in [13, 14], since the spatial component of BF is an isotropic Gaussian filter, pixels within the k × k window can be grouped to exploit the symmetry and optimize the hardware circuit. Then, while [13] performs parallel operations on the groups, thus outputting one pixel per clock cycle, [14] serializes the process on more stages to reduce the amount of hardware resources. Starting from the observation that the computational complexity of BF is proportional to the size of the filter window, the architecture presented in [15] relies on a constant-time implementation where (1)–(2) are expressed as a series of box filtering. However, since such a solution is based on storing intermediate pixels on-chip, the resulting hardware design has a limitation in the size of images that can be processed. Conversely, the BF circuit proposed in [16] allows processing high-resolution images at a high frame rate. There, the LUTs containing 31 approximated filter weights are loaded at design time, which means that no runtime content update is required in case of different noise conditions.

The implementation presented in [17] relies on the architecture illustrated in Fig. 2. There, a stream buffer is responsible to cache input pixels through single-port on-chip RAMs that are accessed by scanning the memory addresses through a dedicated control unit. The segment creator (SC) blocks properly group incoming pixels in order to form segments that can be parallel processed by the subsequent elaboration units. For each segment, in particular, the corresponding range filter module implements (1) by: computing the absolute value of the difference I(x, y) − I(i, j); upscaling the result by 256; using it to access the LUT that stores the approximate range coefficients; calculating the product P(i, j) = I(i, j) × Wr(i, j). The resulting sets of Wr and P values computed as above described are sent to the multiply and accumulate and spatial filter blocks, respectively. The former is responsible for calculating the denominator in (1); to this aim, it embeds constant coefficients corresponding to the spatial Ws contribution and accumulates the products between the Wr and Ws coefficients. The latter, instead, implements the convolution operation to produce the numerator in (1), taking into account the isotropic property of the Gaussian kernel. Finally, the divider stage outputs the filtered pixel.

Fig. 2
figure 2

Schematic of the BF design presented in [17]

It is worth noting that the architecture depicted in Fig. 2 can be made adaptive to different σr through the following modifications [17]: a circuit that estimates the new σr and computes updated weights at runtime has to be inserted; LUTs must be provided by an auxiliary circuitry to enable writing operations. As demonstrated in [17], in comparison with the static counterpart, the resulting adaptive SAHTBF architecture utilizes an amount of LUTs and flip-flops 2.5 and 2.2 times higher, respectively, and consumes 68% more power. To cope with the need of an adaptive BF circuit and to avoid excessive area overhead, our previous proposal [18] adopts piecewise linear functions, with discrete values stored within LUTs. There the runtime configurability is provided by using the currently estimated σr to properly scale the address used to access LUTs, thus allowing the most suitable set of weights to be read. In this implementation, the additional circuit used to compute the addresses penalizes the critical path delay, leading to a running frequency 12% lower than the static BF circuit characterized in [18].

In contrast to the state-of-the-art [13,14,15,16,17,18], the BF approach presented here, reduces the computational complexity of the exponentiation and multiplication operations and allows the range coefficients to be changed at runtime, avoiding any LUT-based process.

3 The proposed BF approach

The proposed method introduces two modifications to the traditional BF formulation. The first one originates from the observation that, according to (2b), the range coefficients require the calculation of the square difference between I(x, y) and I(i, j) and the variance σr2. With the aim of reducing the amount of arithmetic operations, we investigated the effects of modifying (2b) by using (3) that, instead, uses the absolute difference between I(x, y) and I(i, j) as the intensity distance and the standard deviation σr at the denominator.

$$Wra\left( {i, j} \right) = e^{{ - \frac{{|I\left( {x, y} \right) - I\left( {i, j} \right) |}}{2 \times \sigma r}}}$$
(3)

Just as an example, Fig. 3 graphically compares the Wr and Wra functions for σr = 30. It can be observed that the two functions begin to differ more evidently for argument values higher than 50. In Appendix A, we report a mathematical formulation of the worst-case error evaluated by replacing Wra in (1), in order to assess the impact of the proposed strategy on a single filtered pixel. A comprehensive evaluation referred to real application scenarios, like image denoising, edge detection and HDR tone mapping, is instead reported in Sect. 4.

Fig. 3
figure 3

Plot of the range filter functions computed through (2b) and (3)

The second approximation proposed here, is based on the mathematical transformation for the exponential function \({e}^{z}\) reported in (4).

$$\log _{2} (e^{z} ) = \frac{{\ln (e^{z} )}}{{\ln (2)}} = \frac{z}{{\ln (2)}}$$
(4a)
$$e^{z} = 2^{{{\raise0.7ex\hbox{$z$} \!\mathord{\left/ {\vphantom {z {\ln (2)}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\ln (2)}$}}}}$$
(4b)

Then, by introducing the weight Wg(i, j) = Ws(i, j) × Wra(i, j) and replacing the exponentiation operations with (4b), the approximate output pixel Oa(x, y) and the weight Wg(i, j) can be computed as given in (5) and (6).

$$Oa\left( {x, y} \right) = \frac{{\mathop \sum \nolimits_{(i, j) \in \Omega } I(i, j) \times Wg(i, j)}}{{\mathop \sum \nolimits_{(i, j) \in \Omega } Wg(i, j)}}$$
(5)
$$Wg\left( {i, j} \right) = 2^{{ - \frac{{\left( {x - i} \right)^{2} + \left( {y - j} \right)^{2} }}{{2 \times \sigma s^{2} \times \ln (2)}}}} \times 2^{{ - \frac{{\left| {I\left( {x, y} \right) - I\left( {i, j} \right)} \right|}}{2 \times \sigma r \times \ln (2)}}}$$
(6)

The latter can be rewritten as shown in (7) by using the ag(i, j) term as defined in (8).

$$Wg\left( {i, j} \right) = 2^{{ - ag\left( {i, j} \right) }}$$
(7)
$$ag\left( {i, j} \right) = Cs\left( {i, j} \right) + |I\left( {x, y} \right) - I\left( {i, j} \right)| \times Cr$$
(8a)
$$Cs\left( {i, j} \right) = \frac{{\left( {x - i} \right)^{2} + \left( {y - j} \right)^{2} }}{{2 \times \ln \left( { 2} \right) \times \sigma s^{2} }}$$
(8b)
$$Cr = \frac{1}{{2 \times \ln \left( { 2} \right) \times \sigma r}}$$
(8c)

Since Wg(i, j) is now represented as a power-of-two term in the form 1/2ag(i,j), with ag(i, j) being always positive, the multiplication at the numerator of (5) can be approximated by appropriately shifting the pixel I(i, j) on the right. To this purpose, the \(ag\left(i,j\right)\) term is truncated to its integer part, as detailed in Sect. 5. The impact of such approximations is presented in the following case studies.

4 Quality evaluation on different vision applications

In order to assess the image quality achieved by the new approximate BF strategy, it has been applied to three different imaging applications: denoising, edge detection and HDR tone mapping. For each implementation, a comprehensive study has been conducted on proper image datasets and comparing quality results with the accurate counterpart.

4.1 Image denoising

Following the evaluation methodology adopted in prior works [15,16,17,18], benchmark images from the Miscellaneous USC-SIPI dataset [35] were corrupted with additive Gaussian noise having standard deviations σn ranging from 5 to 60. Then, accurate and approximate BF implementations with filter sizes k = 3, 5, 7, 11 (corresponding to σs = 0.5, 1, 2, 3 [18]) and σr = 3 × σn, have characterized by means of the PSNR and the structural similarity (SSIM) [36] metrics. Table 1 summarizes obtained results, which are averaged over the entire dataset for each (k, σn) combination. Accurate results have been obtained by means of the software routine imbilatfilt provided by MATLAB.

Table 1 PSNR/SSIM results for the denoising task (in bold best results)

First of all, such an analysis demonstrates that the proposed approximation technique achieves denoising capability very close to, and sometimes even better than, that shown by the accurate approach. The plots in Fig. 4 summarize this behavior, reporting the PSNR and SSIM relative errors introduced by the proposed BF with respect to the accurate counterpart for different (k, σn) configurations. The highest negative error of the PSNR (SSIM) is about 2.2% (3.3%), which confirms the ability of the novel approximate method to denoise images without compromising the quality. Moreover, the positive relative errors experienced in several cases suggest that even more aggressive approximations could be investigated to further reduce the BF computational complexity.

Fig. 4
figure 4

Relative errors of PSNR and SSIM introduced by the proposed approximate approach over the accurate BF

Furthermore, the above results provide interesting indications on which behavior should be expected by different filter sizes according to the level of noise within the corrupted image. As shown in Table 1, the accurate and approximate filters with k = 5 perform better than the other configurations for low σn values, whereas they become less efficient for σn higher than 20. Just as an example, in the case of the approximate BF at σn = 60, moving from the k = 3 to the k = 11 configuration improves the PSNR and SSIM by up to 40.9% and 3 times, respectively.

Finally, for the purpose of in comparison with other approximate techniques, quality tests have been performed also referring to the reduced dataset used in [19, 25]. It includes the following test images: Barbara, Boat, Bridge, Couple, Goldhill, Lake, Lenna, Lighthouse, Peppers and Plane. Results plotted in Fig. 5 show that the proposed strategy has competitive performance compared to the software implementation [25], while experiencing a PSNR(SSIM) quality degradation of 4% (5%), on average, over the hardware counterpart [18]. As discussed more in detail in Sect. 5.2, this accuracy penalty is the price to pay to achieve hardware implementations with improved area, energy and speed characteristics.

Fig. 5
figure 5

Relative errors of PSNR and SSIM introduced by the proposed approximate approach over the accurate BF

4.2 Edge detection

Directly applying an edge detector on an image with textures results in many false edges. In such a case, BF is typically used to suppress textures, along with noise and fine details [10]. In the following analysis, benchmark images from the Barcelona Images for Perceptual Edge Detection Dataset (BIPED) [37] were first converted from RGB to grayscale. The resulting frames were then processed through four approaches: simple Canny edge detector (CED), accurate BF + CED, approximate BF [18] + CED and proposed BF + CED. The CED has been implemented by the MATLAB routines edge. SSIM results, obtained by comparing the output of each approach with the edge map ground truth available in the dataset, were averaged over the entire dataset. In the case of simple CED, the SSIM stands at 0.379. For the remaining composite approaches, different configurations have been explored, with σr ranging from 10 to 60 and σs = 3, 5, 7, 11. Results are shown in Fig. 6. In general, BF seems to improve the edge detection task, regardless of how it is implemented. It can be also observed that, as expected, the larger the BF spatial component σs, the higher the SSIM. As an important remark, Fig. 6 highlights that the proposed approximate BF + CED behaves very similarly to the accurate counterpart, while exhibiting improved SSIM over the [18]-based implementation: in the case σs = 11 and σr = 60, the quality reduction experienced with respect to the accurate counterpart is 3.9% and 9.4% for the proposed implementation and [18], respectively.

Fig. 6
figure 6

SSIM results for (a) accurate BF + CED; (b) [18] + CED; (c) proposed BF + CED

4.3 HDR tone mapping

Acquisition of HDR images is a hot topic in several applications that cope with outdoor scenes. In a typical scenario, where including special HDR sensors is not affordable, standard cameras are used to capture a sequence of images with different exposure times, which will be merged into an HDR frame for subsequent elaborations. Among the latter, considerable importance is given to the tone mapping step, which aims at compressing the HDR information into an LDR space. In such a case, the BF is used to decompose the luminance component of the HDR image into base and detail layers [12, 38]. These images will be merged again after proper compression and scaling operations of the base layer, performed considering σs and σr factors computed on the basis of statistical information.

For this case study, the dataset provided by [39] and including 15 HDR images has been adopted. The scope is to evaluate the performance achievable by using the proposed approximate BF in place of the accurate counterpart in the Durand tone mapping operator [38]. To this aim, we used here, the tone-mapped image quality index (TMQI) proposed in [39] as objective quality assessment metric. The latter considers both the structural fidelity and the statistical naturalness of the resulting LDR image, leading to possible values in the range from 0 to 1, with TMQI = 1 meaning the highest possible quality. Table 2, which compares the TMQI for each image, highlights that an average 2.37% loss is experienced by our proposal with respect to the original Durand tone mapping process [38].

Table 2 TMQI results for the HDR tone mapping task

4.4 Output image samples

Figure 7 summarizes some output samples obtained by the analyzed case studies. With reference to the denoising results illustrated in Fig. 7a, the insets report zoomed details and allow highlighting the high similarity between the accurate and approximate BF outputs. Referring to the edge detection application, when looking at Fig. 7b, it can be observed that directly applying CED to the input image produces a lot of texture components along the street, which leads to as many false edges. When using the BF, such components are removed, which generates edge maps more like the ground truth regardless the level of approximation introduced. Finally, for the more complex HDR tone mapping task, the output images shown in Fig. 7c allows observing that the proposed approximate BF slightly influences the way colors are spread across the image and its naturalness. However, overall it does not compromise the structural fidelity of the original scene and allows to clearly distinguish details.

Fig. 7
figure 7

Some output images produced in this study for: (a) denoising; (b) edge detection; (c) HDR tone mapping

5 Hardware implementation

5.1 Architecture design

The approximation strategy above presented has been exploited in the design of a novel hardware architecture, which has been described at the register transfer level (RTL) of abstraction using the very high speed integrated circuit hardware description language (VHDL). The proposed strategy avoids the usage of pre-loaded LUTs, saves an appreciable number of multipliers by transforming multiplications into simple shifting operations and runtime adapts its filtering capability to different values of σr. Figure 8 illustrates the top-level schematic of the implemented BF architecture. Input pixels are streamed in raster order to the Buffer Line consisting of k-1 FIFOs and k × k registers that, after the initial latency, form a new Ω window centered in I(x, y) at each clock cycle. The k × k pixels arranged as above described are transferred to the Processing Element (PE) Array, which includes (k × k) PE instances: in order to compute the positive argument ag(i, j), the generic (i, j)-th PE receives the current Cr, the central pixel I(x, y) within Ω and the corresponding Cs(i, j) and I(i, j) values from the Reg Array and Buffer Line blocks, respectively. Such an organization of the PE Array is illustrated in Fig. 9a. It is worth noting that, the position (i, j) corresponding to the center in Ω, i.e., PE(x, y), always outputs ag = 1. For the remaining positions, the PEs implement (8a) by operating on pixels represented as 8-bit unsigned integer and Cr (Cs) values represented as unsigned 16-bit fixed-point number in the format Q1.15 (Q3.13). Thereby, the result of each PE is represented as a Q5.15 number. Finally, in order to convert the fixed-point results into the integer format, the Slice block visible in Fig. 9a extracts the 5-bits of interest and outputs the ag terms.

Fig. 8
figure 8

Top-level design of the proposed BF architecture. The modules PE Array, Decoder and Barrel Shifter include (k × k)−1 instances that operate on the k × k pixels of the current window in parallel

Fig. 9
figure 9

(a) PE Array organization (PE outputs are unsigned 20-bit fixed-point numbers with 5-bits for the integer part, i.e., Q5.15). (b) PE internal design (the difference operation is implemented within the DSP slice through the pre-adder stage, thus avoiding further external resources)

The internal circuit of the PE is illustrated in Fig. 9b. It was designed to fully exploit the available resources inside the DSP slice, avoiding, as much as possible, external actions that could cause detrimental effects on the computation delay. Indeed, the DSP pre-adder is used to calculate the subtraction between I(x, y) and \(I\left(i,j\right)\), while an external carry chain concurrently extract the sign information that is then propagated toward the multiplier stage through a multiplexer receiving the 16-bit Cr coefficient and it 2’s complement as inputs. It is worth pointing out that this information is already known at the beginning of the frame elaboration. Finally, the coefficient 16-bit Cs(i, j) is added to generate the corresponding ag term. This approach bounds the impact of the absolute value calculation on the whole architecture and is quite different from straightforward traditional approaches [17].

As shown in Fig. 8, the set of (k × k) outputs produced by the PE Array is sent to two processing branches. The former implements the numerator term in (7) and consists of (k × k) − 1 Barrel Shifters, whose internal design is depicted in Fig. 10a. In particular, the Barrel Shifter relative to the generic (i, j) position receives the 5-bits ag term, as well as the pixel in the original window Ω that has been properly delayed in order to synchronize its arrival with the output produced by the PE Array. Through a conventional multiplexer-based structure, such a pixel is right-shift by a number of positions that depends on the current ag value. It is important to note that, given the RTL description in Fig. 10a, the synthesis tool performs optimizations to pack more logical stages within a single LUT, thus minimizing the delay of this component. Finally, the central pixel in the window is right shifted by one position, to consider the division by 2. The (k × k) outputs produced in parallel by the Barrel Shifter are then accumulated through the Adder Tree block depicted in Fig. 10b, thus computing the dvd result.

Fig. 10
figure 10

(a) Barrel shifter designed to compute the numerator term in (5). (b) Adder tree circuit

The second branch shown in Fig. 8 is responsible for calculating the denominator term of (5). In such a case, the generic ag(i, j) value is inputted to a decoder circuit organized as depicted in Fig. 11. It calculates the power-of-two term using ag(i, j) as argument. The second Adder Tree accumulates such power-of-two values to compute \({\sum }_{(i,j)\in \Omega }Wg(i,j)\). It is worth noting that, thanks to this characteristic, most bits of the adder tree operands are 0, thus the accumulation operation has a reduced dynamic switching activity. Finally, the Divider block visible in Fig. 8, which has been implemented through a restoring-based architecture, processes the dvd and dvr data and outputs the filtered pixel.

Fig. 11
figure 11

Decoder circuit designed to compute the denominator term in (5)

As discussed in the following sub-section, the proposed BF approach results in reduced computational complexity over the state-of-the-art counterparts [15,16,17,18]. The main difference arises from the fact that the exp-based range/spatial coefficients have been replaced with power-of-two terms. Such a strategy is dual-purpose. First, since the computation of weights relies on the online process performed by the PE Array, thus avoiding storing a set of pre-computed values, the proposed architecture is actually flexible to different noise conditions. Second, the multiplications involved in (5) have been all replaced with simpler shifting operations.

5.2 Results

In accordance with prior works [15,16,17,18], the proposed architecture has been tailored to elaborate digital images at different M × N resolutions (i.e., 256 × 256, 512 × 512 and 3268 × 2448) with a filter window having size k = 5. The competitor designs [15, 16, 18] were re-implemented at a parity of implementation setup, using the Xilinx Vivado 2021.2 Software Development Tool. The layout resulting by the implementation of the proposed 3268 × 2448 architecture on the Xilinx Zynq XC7Z020 FPGA device is illustrated in Fig. 12. It can be noted that a compact layout is achieved without imposing any synthesis/implementation directives, except that for the clock constraint. Therefore, even considering the elaboration of high-resolution images, the proposed implementation saves significant area of the FPGA chip, enabling also the acceleration of other processing stages.

Fig. 12
figure 12

Layout obtained by implementing the new BF architecture tailored to process 3268 × 2448 sized input images

Table 3 collects results in terms of: number of occupied LUTs, flip-flops (FFs), Digital Signal Processor (DSP) slices and Block RAMs (BRAMs); maximum running frequency and throughput as Mega pixels per second (Mps); energy dissipation evaluated as the average consumption per pixel extracted from post layout simulations and energy efficiency (Mps/mW). For all the designs except [17], the power consumption (mW) was computed using the Xilinx Vivado Power Report tool whose power estimation was based on the Switching Activity Interchange Format (SAIF) file generated by post-implementation simulations for M × N 8-bit random values.

Table 3 Hardware characterization and comparison

It is worth noting that, among the compared designs, only [17, 18] and that proposed here, support σr values varying at runtime. The new approximate BF architecture operating on 256 × 256 input images exhibits a throughput and an energy efficiency ~ 37 and ~ 19 times higher than [15], respectively. While the former advantage is mainly due to the parallel action of the PE array, in contrast to the recursive approach used in [15], the almost halved energy requirement is achieved as a consequence of the approximation here presented.

The new BF architecture also exhibits significant improvements over the more competitive designs presented in [16, 18]. As an example, the 512 × 512 implementation overcomes [18]; indeed, it is 19.6% faster, consumes 14.7% less energy, and saves 25% of DSP slices. Among high-resolution implementations, the circuit presented in [17] exhibits the highest speed performance; indeed, at the parity of adaptability to σr, i.e., SAHTBF circuit, it achieves a throughput rate ~ 33% higher than the proposed one but dissipates × 2.8 more energy. This result confirms the improved capability of our proposal in supporting variable σr with lower energy/area overhead.

Finally, with the aim of a broad comparison with the hardware implementations [15,16,17,18], which reported results of PSNR and structural similarity (SSIM) obtained by applying their approximate BF on the Lenna image benchmark corrupted by Gaussian noise with standard deviation σn ranging from 5 to 30, specific quality tests were performed. Results plotted in Fig. 13 demonstrate that the approximate strategy proposed here, does not introduce significant quality penalties over [16, 18] and overcomes [15, 17] in most of the cases.

Fig. 13
figure 13

Comparison of PSNR and SSIM metrics achieved by prior FPGA-based designs on the Lena image benchmark (σs = 1)

6 Conclusions

This work presented a new approximate strategy for hardware-friendly implementation of BF architectures. In contrast to previous works, the proposed technique allows the filtering action to be on the fly adapted avoiding any architectural modification or tables update. Moreover, it considerably simplifies the computational complexity, removing exponentiation calculations and replacing some products with simple shifting operations. Implementation results obtained at the parity of image resolution and technology demonstrate that a BF circuit designed as proposed here, allows systematically reducing logic resources and improving the energy efficiency over state-of-the-art competitors. Thanks to its runtime adaptive capability, the proposed approximate BF can be easily integrated within different imaging applications that require properly tuning the filtering parameters according to the current operating scenario. As a proof of concept, we demonstrate here, its effectiveness over three computer vision tasks: image denoising, edge detection and HDR tone mapping. Obtained results are very promising and suggests the suitability of the proposed implementation to be adopted in several real-time and low-energy applications.