Introduction

Motivation

As to image-centered academic and industrial applications, blur is an undesired but ubiquitous phenomenon caused mainly by defocus and motion. Image blur inherently degrades perceptual quality and damages usable information. Due to this fact, detecting the blurry region from a given image, i.e., blur detection, is increasingly valued and studied in many computer vision tasks such as deblurring [1, 2], quality assessment [3, 4], saliency detection [5] and depth estimation [6, 7].

Along with the rapid development of image processing and artificial intelligence techniques through the past decade, many efforts have been devoted to blur detection. Traditionally, a slew of blur detection methods tend to characterize the image blur by handcrafting some specific features in domains such as gradient [8], frequency [9], intensity [10] and sequency [11]. These traditional methods are effective in simple scenes but far less effective in complex scenes, because those handcrafted features based on low-level knowledge are hard to capture high-level semantics or contexts required for understanding complex scenes. Recently, deep learning techniques typified by convolutional neural network (ConvNet) have been one of the most attractive methodologies and achieved remarkable effects in many computer vision fields [12] including blur detection. To date, quite a few ConvNet-based blur detection methods [13,14,15,16,17,18,19,20,21,22,23,24] have been successfully developed. They show generally better performance than traditional methods owing to their prominent capability of capturing high-level semantic features. Albeit great improvement achieved, however, these deep learning-based methods are still challenged by two following problems.

Overweight model

Existing ConvNet-based blur detection methods are concentrated on well-designed and deep-enough architecture, leading to increasingly complicated models. Although complicated ConvNets usually provide higher accuracy than simple ones, the amount of model parameters easily reaches dozens of millions [25] and even hundreds of millions [26]. However, in many real-world applications such as intelligent robot and self-driving vehicle, vision tasks have to be implemented on computation-limited devices [27]. Such an overweight model with lots of parameters and high floating-point operations (FLOPs) requires huge computational resources for training and deploying, which is extremely difficult to be satisfied in embedded, mobile and edge devices. Therefore, in view of the actual requirement, an expense-economical yet high-performance model is just what the real-world blur detection task really concerns.

Rough boundary

Current state-of-the-art blur detection methods depend on massive convolutional layers to extract high-level semantic features. By such learning, the model would focus more and more on the regions with high response values [28] and thereby fails to capture residual details, especially around the boundary. Consequently, the detection result will suffer from rough and inaccurate boundaries between blurry and sharp regions. This would definitely hurt the performance of blur detection models and damage the success of subsequent image processing tasks like deblurring, especially in complicated scenes. As can be seen in Fig. 1, several recent top-class methods [13, 16,17,18, 21, 29, 30] show unsatisfactory blur detection results with rough boundaries when they face a complex scene containing both low-contrast clear foreground and deceptive background clutters. Therefore, how to guarantee fine boundaries during blurry region segmentation is another central concern of the blur detection model.

Fig. 1
figure 1

A typical challenging case containing low-contrast in-focus foreground (marked by red rectangle) and deceptive background clutters (marked by blue rectangle). In this case, nearly all recent deep learning-based methods such as DHDE [14], BTBFRR [17], DMENet [44], BTBCRL [18], CENet [29], EFENet [19] and DeFusionNet [22] have poor performance

Contribution

In this paper, we seek to tackle the above concerns in quest of a favorable accuracy–complexity trade-off in the blur detection task.

Concise architecture

For purpose of efficiency, we follow an explicit philosophy including three guidelines: (1) applying depth-wise convolutions to lighten the complexity; (2) using large convolutional kernels to enlarge the receptive fields; (3) designing bottleneck-style structures to rebuild the feature representation capability. In view of these, two novel inverted bottleneck blocks based on large depth-wise convolutions are built to serve as the core of the network.

Morphological priors

The key to refine the boundary between blurry and sharp regions, we believe, lies in provoking and fusing more morphological priors during learning. Under this perspective, both a region-concerned prior for complementation and an edge-concerned prior for guidance are embedded into the model in hierarchical and attentional manners.

Our contributions are summarized as follows.

  • We propose a Hierarchical Edge-guided Region-complemented Network (HER-Net) to detect possible blurry regions from images. Particularly, HER-Net has a very concise but effective architecture attributed to its large depth-wise convolutions, novel olive-shaped and pear-shaped inverted bottlenecks, parallel decoders and inbuilt morphological priors.

  • Region-concerned and edge-concerned morphological priors are hierarchically exploited to refine the boundary. Concretely, we propose a reverse-region spatial attention (RrSA) to mine the complementary affinities between blurry and sharp regions so as to enrich the residual details around the boundary. In addition, we design an edge spatial attention (ESA) to guide the edge-concerned cues to reinforce the features related to the boundary.

  • Our method shows a superior accuracy–complexity trade-off in blur detection task. Experimental results on benchmark datasets demonstrate that HER-Net can achieve better detection accuracy with finer boundaries as well as more than 72% reduction in parameters compared to competitive methods. Such performance superiority on accuracy, efficiency and lightweight can inspire further studies for real-world blur detection applications.

Related works

Handcrafted feature-based methods

Blur detection methods using traditional techniques have been well studied. Basically, traditional methods prefer to handcraft specific features under explicit principles.

The most popular features used in blur detection are handcrafted from the perspective of gradient and frequency, following the principle that the sharp region has stronger gradients or frequencies than the blurry one in an image. Along this direction, researchers have proposed many handcrafted feature-based methods. Su et al. [31] developed a technique to detect the blurry region by examining pixel-wise singular values and gradient distribution of alpha channel. Shi et al. [32] engineered a set of blur features including image gradient, Fourier transformation and data-driven local filters to discriminate in-focus and out-of-focus regions. Tang et al. [33] mined the log-averaged spectrum residual to measure the image blur, which was refined based on gradient and color similarities between neighbor pixels. Xu et al. [34] estimated the spatially varying defocus blur by exploiting patch-level maximum ranks in gradient domain. Golestaneh et al. [35] utilized high-frequency DCT coefficients of gradient magnitudes at multiple scales to compute the blur map. Liu et al. [9] proposed to combine region-based frequency information and edge-based linear information by regression tree fields to estimate the defocus blur.

Besides the widely used gradient-based and frequency-based features, some other principles are also effective. For instance, Yi et al. [10] designed a blur detection framework based on local binary patterns. Wang et al. [36] utilized Walsh–Hadamard coefficients in sequency domain to detect the image blur and segment the blurry region; on this basis, Liang et al. [11] further introduced a noise-robust blur detection framework via sequency spectrum truncation.

For traditional blur detection approaches, handcrafting a number of features in empirical and manual manners is a very common practice. However, these handcrafted features can hardly capture high-level semantics and contextual information. Therefore, traditional methods could work well in simple cases but often underperform when dealing with complex scenes.

Deep learning-based methods

The past decade has witnessed the rapid growth of deep learning in the realm of computer vision. Motivated by such success, researchers begin to apply deep learning techniques like ConvNet in blur detection.

Initially, ConvNet was simply used as the feature extractor or patch-level classifier to coordinate with traditional techniques. For example, Park et al. [13] made the first attempt at unifying handcrafted features and deep features to estimate the defocus blur via a series of elaborate filters. Huang et al. [14] predicted the patch-level blur likelihoods map through a six-layer ConvNet and fused multi-scale results into a refined one. Zeng et al. [15] built a local blur metric based on deep features assisted by their principal components.

Recently, researchers have developed more in deep feature exploitation and fusion, which continuously improves the performance of blur detection. Zhao et al. [16, 17] constructed a bottom–top–bottom fully convolutional network to integrate multi-scale features and further optimize the architecture via multi-stream mode and cascaded residual learning. In their follow-up studies, two diversity-boosting deep ensemble networks [18] and an image-scale-symmetric cooperative network [19] were successively proposed for defocus blur detection and achieved state-of-the-art performance. Tang et al. [20, 21] built and updated a defocus blur detection framework, which recurrently fused and refined multi-scale features in a cross-layer manner. Zhai et al. [26] boosted the performance for blur detection by aggregating low-level detail features, high-level semantic and global context information in a hierarchical manner. Karaali et al. [23] estimated defocus blur by weight-sharing ConvNets to distinguish depth edges from pattern edges.

Attention mechanism as an effective tool for reweighting features is also utilized by recent methods to facilitate the learning. Tang et al. [22] implanted the channel attention into a bi-directional residual refining network to select more discriminative features for defocus blur. Inspired by the nested network, Guo et al. [25] embedded small U-shaped networks and the channel attention into an end-to-end ConvNet to detect the blurry region. Jiang et al. [24] introduced channel-wise and spatial-wise attentions to obtain more discriminative features and suppressed unreliable predictions by a generative adversarial training strategy.

Therefore, it can be seen that deep learning-based blur detection methods do outperform traditional methods by a large margin. The superiority is attributed to the exploitation of different-level information, such as the bottom–top–bottom manner [17] and alternate cross-layer manner [21]. Nonetheless, two major problems, i.e., overweight model (even up to hundreds of megabytes) and rough boundary (especially in complicated scenes), still challenge recent deep learning-based approaches.

As to the former problem, since an overweight model is quite not appropriate in many real-world applications, a concise yet effective model is required. In this paper, we decide to follow a philosophy of designing the bottleneck-style structures around depth-wise convolutions and large kernels in order to achieve lightweight but expressive feature representation. As to the latter problem, since the rough boundary of the detected blurry region could severely impede the success of subsequent tasks like deblurring, morphological priors are taken into account to refine the boundary. Li et al. [37] designed two symmetric branches to jointly estimate the blur probabilities of both in-focus and out-of-focus pixels. Lin et al. [38] introduced the reverse attention [28] to capture in-focus and out-of-focus features. Inspired by them, we propose the RrSA mechanism to take in complementary morphological affinities between blurry and sharp regions and hope to pay coequal attention on blurry and sharp pixels around the boundary. Besides, we also propose the ESA mechanism to further emphasize the edge information for finer boundary.

Proposed algorithm

Pipeline

The overview architecture of our HER-Net is illustrated in Fig. 2, consisting of a backbone encoder, two parallel decoders (i.e., region-aware decoder and edge-aware decoder), the reverse-region complementation connection and the edge guidance connection. Each above part can be divided into four stages from low level (large scale, small channel dimension) to high level (small scale, large channel dimension). The resolution of feature maps keeps the same inside one stage. Detailed configurations are shown in Table 1.

Fig. 2
figure 2

Overview architecture of HER-Net (best viewed in color). The stage from 1 to 4 denotes the level from low (large scale, small channel dimension) to high (small scale, large channel dimension). Detailed structures about OLIVE, PEAR, RrSA, ESA, Stem, Head, and Dense Transition will be illustrated in Figs. 3, 4, and 5

Table 1 Configurations of HER-Net

First, the backbone encoder extracts multi-level features progressively from the original image and its input pyramid. Next, these encoded feature maps are passed through a dense transition block and then fed into two parallel decoders. Thereinto, the region-aware decoder is the main decoder designed for blurry region segmentation, which is the major task of this paper. The edge-aware decoder is the auxiliary decoder used to emphasize the boundary. The reverse-region complementation connection inlaid with a reverse-region spatial attention (RrSA) mechanism is analogous to a skip connection between the encoder and main decoder. In addition, between the auxiliary encoder and main decoder, we build another skip connection inlaid with an edge spatial attention (ESA) mechanism. Lastly, the main loss \(\mathcal{L}^{r}\) (between the output of main decoder and the ground truth) and the auxiliary loss \(\mathcal{L}^{e}\) (between the output of auxiliary decoder and the edge of ground truth) are hierarchically computed to supervise the entire network across all levels.

Backbone encoder

The backbone encoder, constructed as a lightweight but effective feature extractor, contains three designs: multi-input pyramid, OLIVE block, and dense transition block.

  1. (1)

    Multi-input pyramid It has been known that low-level spatial information is very helpful for searching and locating a region in an image. To preserve sufficient low-level information when the encoder extracts deeper and deeper semantic cues, we design a multi-input pyramid structure herein. First, three consecutive max-pooling operations are applied on the original image with a resolution of I to generate an input pyramid with decreasing scales {I, 1/2I, 1/4I, 1/8I}. Second, each input is passed through the stem cell and then hierarchically concatenated into corresponding encoder stage in channel dimension. The stem cell can be formulated as

    $$ {\varvec{S}} = {\text{Conv}}^{1 \times 1} \left( {{\text{Conv}}^{d9 \times 9} \left( {\text{input}} \right)} \right), $$
    (1)

    where d9 × 9 denotes a depth-wise convolution with the kernel size of 9 × 9 and 1 × 1 denotes a point-wise convolution. Specially, the d9 × 9 kernel in the stem cell has a stride of 2, while other kernels in HER-Net default to a stride of 1. The stem cell is intended to not only down-sample the input image for reducing its inherent redundancy [39], but also expand the channels from 3 (RGB) to an appropriate dimension. It should be also noted that each convolutional layer is sequentially followed by batch normalization (BN) and rectified linear unit (ReLU) unless otherwise indicated. The proposed multi-input pyramid enables the encoder to keep enough low-level information taken directly from the original image.

  2. (2)

    OLIVE block To achieve more expressive feature extraction while keeping low complexity, we propose a novel olive-shaped inverted bottleneck structure (named OLIVE) based on the classic inverted bottleneck from MobileNetV2 [40] by employing larger convolutional kernels and transferring large kernels to low-dimension space. As shown in Fig. 3a, the OLIVE block includes two d9 × 9, two 1 × 1 and one identity shortcut, which can be formulated as

    $$ {\varvec{O}} = {\text{Conv}}^{d9 \times 9} \left( {{\text{Conv}}^{1 \times 1} \left( {{\text{Conv}}^{1 \times 1} \left( {{\text{Conv}}^{d9 \times 9} \left( {\varvec{F}} \right)} \right)} \right)} \right) \oplus {\varvec{F}}, $$
    (2)

    where denotes the addition operation, F is the input feature map from the previous layer, and O is the output feature map of the OLIVE block.

    The design motivation of OLIVE is depicted in Fig. 4, which can be detailed as follows: (a) we base the OLIVE block on the classic inverted bottleneck centered on depth-wise convolutions for the purpose of efficiency; (b) considering the fact that depth-wise convolution can reduce the complexity and yet also weaken the accuracy, we replace original d3 × 3 kernel with d9 × 9 kernel for larger receptive field so as to lift the accuracy; (c) in order to save parameters, we transfer the d9 × 9 kernel from the high-dimension space to the low-dimension side; (d) we add one more d9 × 9 kernel at the other low-dimension side to boost the representation capability. Meanwhile, most of ReLUs are deleted and only one ReLU located in high-dimensional space between two 1 × 1 kernels is retained to reduce information loss, since that using nonlinear activation in low-dimensional space easily leads the feature values to zero and hence lose much helpful information [40]. From the perspective of the parameters, replacing d3 × 3 with d9 × 9 (Fig. 4a → 4b) will lead to a large increase by 13.8% in parameters of the OLIVE block. Subsequently, moving d9 × 9 to the low-dimension side (Fig. 4b → 4c) will significantly decrease the parameters by 10.4%, and then adding one more d9 × 9 (Fig. 4c → 4d) will induce a small rise of 5.2% in parameters.

  3. (3)

    Dense transition block. To further improve the feature reuse for multi-level information fusion but not increase the model parameters, inspired by [41], we design a densely connected context fusion structure as a transition block between the encoder and parallel decoders, named dense transition block. As illustrated in Fig. 3b, multi-level features can be aggregated by concatenating all its previous feature maps as the input of the present layer. In this block, the kernel sizes of depth-wise convolutions are 3, 5, 7, and 9 followed by 1 × 1 convolutions with reduction factors of 1/2, 1/3, 1/4, and 1/5, respectively, to control the channel dimension. Therefore, the dense transition block can be defined as

    $$ {\varvec{D}}_{1} = {\text{Conv}}^{1 \times 1} \left( {\left[ {{\varvec{F}};\text{Conv}^{d3 \times 3} \left( {\varvec{F}} \right)} \right]} \right), $$
    (3)
    $$ \varvec{D}_{i} = \text{Conv}^{1 \times 1} \left( {\left[ {\varvec{F}; \varvec{D}_{1} ; \ldots ;\text{Conv}^{d(2i + 1) \times (2i + 1)} \left( {\varvec{D}_{i - 1} } \right)} \right]} \right), $$
    (4)

    where [;] denotes the concatenation operation and Di is the feature map of the ith layer in the dense transition block (i = 1, 2, 3, 4).

    Totally, a concise yet effective backbone encoder is built by stacking four OLIVE blocks corresponding to four stages, inserting the separate down-sampling layer (d9 × 9 with stride 2) between stages, and being equipped with the multi-input pyramid and dense transition block.

Fig. 3
figure 3

Basic blocks of the backbone encoder. a The OLIVE block. b The dense transition block where the inset number denotes the reduction factor of channels for each convolutional layer

Fig. 4
figure 4

Design motivation of the OLIVE block. a The original inverted bottleneck [40]. In (b), d3 × 3 is replaced with d9 × 9 for larger receptive field. In (c), d9 × 9 is moved to the low-dimension side for saving parameters. In (d), we add one more d9 × 9 to boost the feature representation capability and retain only one ReLU located in high-dimension space between two 1 × 1 to reduce information loss

Parallel decoders (region-aware and edge-aware)

In HER-Net, feature maps from the encoder will be fed into two parallel but interconnected decoders, i.e., the region-aware decoder (main) and the edge-aware decoder (auxiliary). The former is designed to decode the feature map level by level to segment the blurry region, and the latter is to detect the boundary information so as to provide guidance on region segmentation. As the core unit of parallel decoders, the PEAR block named after its pear-like shape includes d9 × 9, 1 × 1 with a expansion factor of 2, 1 × 1 with a reduction factor of 1/4 and identity shortcut, as shown in Fig. 5a. The PEAR block can be formulated as

$$ {\varvec{P}} = \left[ {{\text{Conv}}^{d9 \times 9} \left( {{\text{Conv}}^{1 \times 1} \left( {{\text{Conv}}^{1 \times 1} \left( {\varvec{F}} \right)} \right)} \right);{\varvec{F}}} \right], $$
(5)

where P denotes the output feature map of the PEAR block. The design motivation of PEAR is similar to that of OLIVE, as depicted in Fig. 6. First, we modify the classic inverted bottleneck [42] by replacing original d3 × 3 kernel with d9 × 9 kernel for larger receptive field and higher accuracy. Then, we move the d9 × 9 kernel to the low-dimension side for saving parameters. Finally, we retain only one ReLU located in high-dimension space between two 1 × 1 kernels to reduce information loss. Specifically in terms of parameters, using larger d9 × 9 (Fig. 6a → 6b) will cause a 18.3% rise in parameters of the PEAR block, and then moving d9 × 9 to the low-dimension side (Fig. 6b → 6c) will generate a 10.3% drop in parameters. In short, the PEAR block has larger kernel size, leaner parameters and fewer nonlinear activation. It is worth mentioning that our modification results in two back-to-back 1 × 1 kernels naturally, which is a Transformer-style structure [43] and also used in ConvNeXt [39]. Such back-to-back 1 × 1 kernels whose channels expand first and then reduce can boost cross-channel information interaction and help to learn complex representation via full connection and nonlinear activation.

Fig. 5
figure 5

Basic blocks of parallel encoders. a The PEAR block of parallel decoders. b The detection head for yielding the final or intermediate prediction result. c The RrSA block exploiting reverse-region complementation priors to enrich the residual details around the boundary. e The ESA block exploiting edge guidance priors to emphasize boundary features

Fig. 6
figure 6

Design idea of the PEAR block. a The original inverted bottleneck [40]. In (b), d3 × 3 is replaced with d9 × 9 for larger receptive field. In (c), d9 × 9 is moved to the low-dimension side for saving parameters and only one ReLU located in high-dimension space between two 1 × 1 kernels is retained to reduce information loss

Between PEAR blocks in the main decoder, a simple fusion module (seen in Fig. 2) containing concatenation and 1 × 1 with a reduction factor of 1/4 is used to fuse and adjust the channels. There is also a 2 × 2 nearest neighbor interpolation for up-sampling between stages. As parallel decoders go from deep back to shallow, the resolution will be stage-wise restored from I/16 to I and the channel dimension will be stage-wise reduced from high to 1. As shown in Fig. 5b, we use a detection head at the output end of each decoder to produce the final prediction result (region-oriented or edge-oriented) with the same resolution as the original image. The detection head is formed by two d9 × 9, two 1 × 1 with the reduction factor of 1/2 and 1/N, respectively (N is the number of input channels), and one Sigmoid activation. It can be formulated as

$$ {\varvec{H}} = \sigma \left( {{\text{Conv}}^{d9 \times 9} \left( {{\text{Conv}}^{1 \times 1} \left( {{\text{Conv}}^{d9 \times 9} \left( {{\text{Conv}}^{1 \times 1} \left( {\varvec{F}} \right)} \right)} \right)} \right)} \right), $$
(6)

where σ denotes the Sigmoid function and H is the output of the detection head. It should be noted that each convolution herein is followed by BN and ReLU unless otherwise indicated.

Reverse-region complementation via RrSA

An encoder stacked by a lot of convolutions is prone to cause the network to focus on the regions with high response values and thereby miss out on residual details, especially around the boundary [28]. The consequent rough boundary could undermine the effectiveness of blur detection model in complex scenes. Evidently, how to sufficiently keep and exploit residual information around the boundary is the key to solve this problem. Blurry region and sharp region have a mutually disjoint but closely complementary relationship due to the fact that they two can be grouped into a complete image without any intersection. In view of this, we hope residual details ignored by one region can be learned by the other via the latent complementary relationship. Therefore, we reify a reverse-region spatial attention mechanism and propose the RrSA block to hierarchically mine the complementary affinities between blurry and sharp regions in a spatially attentional manner. Thus, the network can enrich residual information to refine the boundary.

As depicted in Fig. 5c, the feature map PM ∈ ℝw×h× 0.5c from a PEAR block of the main decoder is first squeezed into two single channel by global average pooling (GAP) and global maximum pooling (GMP), respectively, where w × h is the size of feature map and c is the number of channel. Then, the two aggregated feature maps are concatenated and passed through the sequential d9 × 9, 1 × 1 and Sigmoid activation to yield a spatial attention map ΩM ∈ ℝw×h×1. According to the complementary relationship between blurry and sharp regions, we can also obtain its reverse attention map 1 − ΩM ∈ ℝw×h×1. Finally, the encoder-feature map O ∈ ℝw×h×c produced by the OLIVE block will be re-weighted simultaneously by ΩM and 1 − ΩM to generate the output feature map FM ∈ ℝw×h×c, which is defined as follows:

$$ {\varvec{\Omega}}^{M} = \sigma \left( {\text{Conv}^{d9 \times 9} \left( {\text{Conv}^{1 \times 1} [\text{GAP}(\varvec{P}^{M} ); \text{GMP}(\varvec{P}^{M} )]} \right)} \right), $$
(7)
$$ {\varvec{F}}^{M} = {\text{Conv}}^{1 \times 1} [{\varvec{O}} \otimes {\varvec{\Omega }}^{M} ;{\varvec{O}} \otimes ({\mathbf{1}} - {\varvec{\Omega }}^{M} )], $$
(8)

where denotes dot product. By erasing the current spatial attention ΩM, the reverse attention 1 − ΩM can enable the network to pay new attention to those ignored parts, which leads to rediscover the missing residual details quickly and effectively. The simultaneous work of spatial attention and its reverse attention can make full use of complementary priors between blurry and sharp regions so that the network can keep ample attention on both blur-concentrated parts and residual parts around the boundary. This helps to refine the blurry-sharp boundary and further improve the segmentation accuracy.

Edge guidance via ESA

To further emphasize the boundary features, we design the ESA block for exploiting edge-concerned information from the auxiliary decoder to guide the blurry region segmentation in the main decoder. As delineated in Fig. 5d, the ESA block is also based on a spatial attention mechanism. First, GAP and GMP are applied to squeeze the feature map PA ∈ ℝw×h×0.5c derived from a PEAR block of the auxiliary decoder into two individual channels, respectively. Then, two squeezed feature maps will be sequentially fed into the concatenation, d9 × 9, 1 × 1, and sigmoid activation to yield a spatial attention map ΩA ∈ ℝw×h×1. Subsequently, ΩA is used to spatially re-weight the feature map derived from a PEAR block of the main decoder and allocates higher weights to the features related to the boundary. In short, the ESA block is computed as

$$ \varvec{F}^{A} = \varvec{P}^{M} \otimes \sigma \left( {\text{Conv}^{d9 \times 9} \left( {\text{Conv}^{1 \times 1} [\text{GAP}( \varvec{P}^{A} );\text{GMP}( \varvec{P}^{A} )]} \right)} \right). $$
(9)

Overall, these ESA blocks across main and auxiliary decoders can exploit edge-concerned cues to highlight the boundary features of region-aware feature maps in attentional and hierarchical manners. Therefore, the design of ESA is beneficial to the boundary refinement, resulting in a more accurate region segmentation result.

Hierarchical joint loss

Aiming to supervise the HER-Net as well as mitigate the scale ambiguity, we design a hierarchical joint loss \(\mathcal{L},\) as visualized in Fig. 2. Unlike most encoder-decoder networks which only compute the loss at the final output end, we jointly consider losses at all stages of dual decoders to balance the prediction biases between different scales and decoders. The \(\mathcal{L}\) consists of both the main loss \(\mathcal{L}_{\text{main}}\) derived from the region-aware decoder and the auxiliary loss \(\mathcal{L}_{\text{aux}}\) derived from the edge-aware decoder.

Concretely, the \(\mathcal{L}_{\text{main}}\) uses the binary cross entropy function defined as follows:

$$ {\mathcal L}_{\text{main}} = \sum \limits_{k = 0}^{4} {{\mathcal{L}}_{k}^{m} },$$
(10)

where \(\mathcal{L}_{k}^{m}\) is the side-out loss of the stage k in the main decoder, and k = 0, 1, 2, 3, 4 denote the final output end and four intermediate output heads, respectively. \(\mathcal{L}_{k}^{m}\) is expressed as

$$ \begin{aligned} \mathcal{L}_{k}^{m} (R_{k} ,G_{k}^{m} |\psi ) & = - \sum\limits_{{i \in Y^{ + } }} {\log [P(x_{i} = 1|\psi )]} \\ &\quad - \sum\limits_{{i \in Y^{ - } }} {\log [P(x_{i} = 0|\psi )]} ,\end{aligned} $$
(11)

where Rk and Gmk are the prediction result in the main decoder and its ground truth at stage k, respectively; ψ denotes the model parameters; Y+ and Y are the sets of pixels marked positive (1, blurry) and negative (0, sharp) in the ground truth, respectively; and xi is the predicted value (1 or 0) of pixel i.

Considering the little proportion of boundary pixels in an image, we combine the focal loss and dice loss in the auxiliary decoder to construct an auxiliary loss \(\mathcal{L}_{\text{aux}}\) defined as follows:

$$ \mathcal{L}_{\text{aux}} = \sum\limits_{k = 0}^{4} {\mathcal{L}_{k}^{a} } , $$
(12)

where \(\mathcal{L}_{k}^{a}\) is the side-out loss of the stage k. The \(\mathcal{L}_{k}^{a}\) is expressed as

$$ \mathcal{L}_{k}^{a} = \beta \mathcal{L}_{\text{focal}} - \log \mathcal{L}_\text{{dice}} , $$
(13)
$$ \begin{aligned} &\mathcal{L}_{\text{focal}} (E_{k} ,G_{k}^{a} |\psi ) = - \sum\limits_{{i \in Y^{ + } }} {P(x_{i} = 0|\psi )^{2} \log [P(x_{i} = 1|\psi )]} \\ &\quad - \sum\limits_{{i \in Y^{ - } }} {P(x_{i} = 1|\psi )^{2} \log [P(x_{i} = 0|\psi )]} ,\end{aligned} $$
(14)
$$ \mathcal{L}_{\text{dice}} (E_{k} ,G_{k}^{a} |\psi ) = 1 - \frac{{2(E_{k} \cap G_{k}^{a} ) + \varepsilon }}{{\left| {E_{k} } \right| + \left| {G_{k}^{a} } \right| + \varepsilon }}, $$
(15)

where Ek and Gak are the prediction result in the auxiliary decoder and its ground truth (derived by a canny operator from the ground truth of the main decoder) at stage k, respectively; β is set to 1e − 3 to balance the magnitude between focal loss and dice loss; ε is an infinitesimal value to avoid the numerical issue of dividing by zero.

Finally, the total hierarchical joint loss \(\mathcal{L}\) of HER-Net can be summarized as follows:

$$ \mathcal{L} = \mathcal{L}_{\text{main}} + \lambda \mathcal{L}_\text{{aux}}, $$
(16)

where λ is a weighting factor to balance between \(\mathcal{L}_{\text{main}}\) and \(\mathcal{L}_{\text{aux}}\), which is set to 0.1 by experiments.

Experiments

Datasets

We evaluate the proposed approach on three publicly available and well-annotated datasets: CUHK [32], DUT [16] and CTCUG [21]. All three datasets are oriented to partial image blur. More specifically, the CUHK dataset consists of 296 motion-blurred images and 704 defocused images. The motion-blurred part is termed CUHKM, where 250 samples are for training and the rest is for testing. The defocused part is termed CUHKD, where 604 samples are for training and the rest is for testing. The DUT dataset includes 1100 challenging defocused images, where 600 samples are used for training and the rest is used for testing. The CTCUG dataset contains 150 images with different defocus distribution, where all images are only used for testing. CUHK and DUT datasets have similar blur distribution that the foreground is usually sharp while the background is usually blurry. By contrast, the CTCUG dataset has an inverse blur distribution that mostly the foreground is blurry while the background is sharp. Besides, DUT and CTCUG datasets contain many challenging cases such as in-focus but low-contrast areas and half-blurry half-sharp objects. In this paper, in order to train the HER-Net, a Canny operator is used to obtain the edge-oriented ground truth from the region-oriented ground truth, as shown in Fig. 7.

Fig. 7
figure 7

Two ground truths used in this paper, where (c) is obtained by a Canny operator from (b)

Evaluation criteria

Precision (P), recall (R), F-measure (Fβ) and mean absolute error (MAE) are used here to evaluate the segmentation accuracy of the model. Their mathematical definitions are listed in Table 2. P measures the ability to recognize only blurry pixels and R measures the ability to identify all blurry pixels. Fβ is a comprehensive weighted indicator considering both P and R. Here, we use β2 = 0.3 that agrees with the setup in [16,17,18,19,20,21,22]. MAE evaluates the difference between the binarized prediction result B and the ground truth G.

Table 2 Definitions of evaluation indicators

Implementation details

HER-Net is implemented under Pytorch framework and the training is accelerated on Nvidia GeForce RTX 3090 GPU. Stochastic gradient descent (SGD) with the momentum of 0.9 and weight decay of 0.0005 is used for further optimization. Learning rate is initialized to 1e − 4 and decreased by 1e − 6 every iteration. Batchsize is set to 16. We resize images from three datasets to 320 × 320 pixels by bicubic interpolation. Moreover, we augment existing samples by flipping, rotating and noising. An early stop strategy is adopted to end the training in advance when the error of the validation set stops falling within 20 iterations. The training process needs approximately 300 epochs.

Comparison with state-of-the-art methods

In this section, we evaluate the proposed method (HER-Net) by intensive comparisons with 11 state-of-the-art methods including eight deep learning-based approaches and three traditional methods. The eight deep learning-based methods include recurrent Deep feature Fusion and refinement Network (DeFusionNet) [21], Bi-directional Residual Refining Network (BR2Net) [22], Encoder-Feature Ensemble Network EFENet [18], Cross-Ensemble Network (CENet) [44], multi-stream Bottom–Top–Bottom network with Cascaded Residual Learning (BTBCRL) [17], multi-stream Bottom–Top–Bottom network with Fusion and Recurrent Reconstruction (BTBFRR) [16], Defocus Map Estimation Network using domain adaptation (DMENet) [41], and multi-scale Deep and Handcrafted features for Defocus Estimation (DHDE) [13]. The three traditional methods are Spectral and Spatial approach (SS) [33], Local Binary Patterns (LBP) [10], and High-frequency multi-scale Fusion and Sort Transform (HiFST) [35].

First, the accuracy (in terms of P, R, Fβ and MAE) and efficiency (in terms of parameters and FLOPs) of HER-Net are reported in Table 3. It shows that our HER-Net achieves an average Fβ of 0.929 and an average MAE of 0.052 over CUHK, DUT and CTCUG datasets by merely 8.34M parameters and 8.55G FLOPs. Several cases are shown in Fig. 8. Although HER-Net is capable of handling both defocused and motion-blurred scenes, the comparisons herein will be limited within the scope of defocus blur considering that most aforementioned methods are designed only for defocus blur. For the sake of fair comparison, we use the results reported in literatures [17, 18, 21, 22, 44] or the codes released by the authors [10, 13, 33, 35, 41] on the same workstation with same testing samples, without any modification.

Table 3 Average results of HER-Net across three datasets
Fig. 8
figure 8

Some cases including defocused scenes from CUHKD and motion-blurred scenes from CUHKM

Qualitatively comparisons are shown in Figs. 9, 10 and 11 for CUHKD, DUT and CTCUG datasets, respectively. Obviously, recent deep learning-based methods perform generally better than traditional handcrafted feature-based methods. Figure 9 shows some common cases from the CUHKD dataset. It can be observed that our HER-Net has the superiority to other methods and can provide more accurate segmentation results with finer regional boundaries. Figure 10 shows some cases containing low-contrast in-focus objects or deceptive background clutters, which are considered two intractable problems in blur detection tasks. Low-contrast in-focus objects, such as the V, VII rows of Fig. 9 and the I, II, III, IV rows of Fig. 10, are prone to be misjudged as out-of-focus parts due to their smooth and texture-less characteristics. Background clutters, such as the V, VI, VII rows of Fig. 10, are easily mistaken for in-focus foregrounds due to their deceptive pseudo-edges. Even so, our HER-Net gives the most accurate predictions with finest details among all competitive methods. Figure 11 shows some very challenging cases from the CTCUG dataset. Besides low-contrast sharp objects (the I, II, III rows of Fig. 11) and deceptive background clutters (the IV, V rows of Fig. 11), there are also some inversely-distributed blurs (the VI, VII, VIII rows of Fig. 11) that means the foreground is blurry while the background is sharp. Evidently, the proposed HER-Net still outperforms other competitive methods. Especially on inversely distributed blurs, we can see that most methods are trapped in misjudgement while our method is still able to accurately detect. This superiority in dealing with inversely-distributed blurs should be largely attributed to the exploitation of complementary affinities between blurry and sharp regions, which leads to the model paying coequal attention to both in-focus and out-of-focus pixels around the boundary.

Fig. 9
figure 9

Qualitative comparisons on the CUHKD dataset among different methods including HiFST [35], LBP [10], SS [33], DHDE [14], BTBFRR [17], DMENet [44], BTBCRL [18], EFENet [19], DeFusionNet [22] and our HER-Net

Fig. 10
figure 10

Qualitative comparisons on the DUT dataset among different methods including HiFST [35], LBP [10], SS [33], DHDE [14], DMENet [44], BTBFRR [17], BTBCRL [18], CENet [29], EFENet [19], DeFusionNet [22] and our HER-Net

Fig. 11
figure 11

Qualitative comparisons on the CTCUG dataset among different methods including HiFST [35], LBP [10], SS [33], DHDE [14], DMENet [44], BR2Net [23], BTBFRR [17], DeFusionNet [22] and our HER-Net


Quantitatively comparisons are reported in Table 4. It is obvious that our HER-Net achieves better performance in terms of Fβ and MAE on all three datasets compared to other state-of-the-art methods, which indicates the superiority of our method. Compared to the second-best methods, our method improves the Fβ by 2.63%, 0.32%, and 4.74% and reduces the MAE by 23.73%, 9.26%, and 27.96% on CUHKD, DUT and CTCUG datasets, respectively. It is worth mentioning that such significant improvement is achieved with only 8.34M parameters and 8.55G FLOPs. Since the coding languages and running environments are widely various among different methods, the direct runtime comparison (ours is 38.6 ms per image) makes no sense. Thus, we compare the model complexity in terms of the number of parameters between different deep learning-based methods and ours, as reported in Table 5. Obviously, our method has the most lightweight architecture among competitive methods. Compared to the second-place, HER-Net achieves a 72% reduction in parameters.

Table 4 Quantitative comparisons among different methods in terms of Fβ and MAE on CUHKD, DUT and CTCUG datasets
Table 5 Comparison on the number of parameters among different models

Ablation studies

In this section, we ablate the modular HER-Net to prove the effectiveness of several proposed blocks by experiments. Results of ablation studies are summarized in Table 6.

Table 6 Ablation studies for OLIVE, PEAR, RrSA and ESA based on Fβ and MAE in CUHK, DUT and CTCUG datasets

Effectiveness of OLIVE

As depicted in Figs. 3 a and 4, the OLIVE block is modified from the classic inverted bottleneck [42]. Therefore, to explore the role of the OLIVE block, the model with OLIVE (i.e., HER-Net) and the model without OLIVE (using classic inverted bottlenecks instead, denoted by w/o OLIVE) are separately tested. Table 6 (rows 2, 7, 12) shows that the absence of OLIVE blocks leads to 1.62%, 1.96%, 1.55% decrease of Fβ and 14.77%, 12.12%, 10.62% increase of MAE in CUHK, DUT and CTCUG datasets, respectively, which definitely proves the effectiveness of OLIVE blocks.

Effectiveness of PEAR

Similarly, the PEAR block is also modified from the classic inverted bottleneck. We compare two models: one using PEAR blocks (i.e., HER-Net) and the other using classic inverted bottlenecks (named w/o PEAR). Table 6 (rows 3, 8, 13) shows that the Fβ will be decreased by 0.43%, 0.97%, 0.55% and the MAE will be increased by 7.41%, 8.42%, 6.48% in CUHK, DUT and CTCUG datasets, respectively, if discarding all PEAR blocks. It demonstrates the effectiveness of PEAR blocks.

Effectiveness of RrSA

The RrSA block is designed to exploit complementary affinities between blurry and sharp regions. To investigate its contribution, we compare the model with RrSA (i.e., HER-Net) and the model with conventional skip connection (named w/o RrSA). Table 6 (rows 4, 9, 14) shows that, when RrSA blocks are ablated, the Fβ will drop by 2.06%, 2.62%, and 2.01% and the MAE will rise by 22.68%, 23.08%, and 18.67% in CUHK, DUT and CTCUG datasets, respectively. These results demonstrate the effectiveness of RrSA blocks and confirm the vital role of region complementation in blur detection.

Effectiveness of ESA

The ESA block is designed to capture edge-concerned cues from the auxiliary decoder to guide the region segmentation in the main decoder. Comparison between the model with ESA (i.e., HER-Net) and the model with conventional skip connection (named w/o ESA) is conducted. Table 6 (rows 5, 10, 15) shows that the absence of ESA blocks will cause the Fβ being to reduce by 1.73%, 1.52%, and 1.67% and the MAE to enlarge by 16.67%, 9.38%, and 13.68% in CUHK, DUT and CTCUG datasets, respectively. It proves the importance of ESA blocks.

Effectiveness of large kernel

To explore the effect of kernel size, we conduct experiments on six HER-Net variants with different kernel sizes from 3 × 3 to 13 × 13. Results are shown in Fig. 12. Clearly, a larger kernel size quite increases the model size with more parameters and FLOPs. It can be also observed that the accuracy rises with the increasing kernel size from 3 × 3 to 9 × 9, because the larger receptive field produced by larger kernel size is believed to facilitate capturing contextual information. However, the accuracy begins to fall when the kernel size is beyond 9 × 9. For 320 × 320 input images, the resolution of feature maps in four stages of HER-Net is 160 × 160, 80 × 80, 40 × 40, and 20 × 20. In this circumstance, an overlarge kernel size like 11 × 11 or even bigger will lead convolutional filters to ignore lots of local details, which would undermine the accuracy. In this paper, the HER-Net with the kernel of d9 × 9 achieves the best performance of Fβ = 0.929 and MAE = 0.052 over three datasets using 8.34M parameters and 8.55G FLOPs.

Fig. 12
figure 12

Fβ and MAE of the HER-Net variants with different kernel sizes. Each filled circle represents a HER-Net variant. In (a), the circle size represents the model parameters; in (b), the circle size represents the model FLOPs

About the optimizer

Besides the SGD with Momentum (SGDM), we also test two other optimizers including ADAM [29] and DiffGrad [30]. By experiments over three datasets, the SGDM optimizer achieves the highest Fβ measure of 0.929 using the convergence time of 300 epochs. The DiffGrad optimizer gains a good accuracy (Fβ = 0.923) on par with SGDM but a slightly slower convergence speed (370 epochs). In contrast, the ADAM optimizer achieves the Fβ measure of 0.879 and convergence time of 410 epochs, which are obviously inferior to SGDM and DiffGrad. Therefore, we finally select the SGDM optimizer for the investigated blur detection task. Considering the fact that various optimizers perform very differently depending on specific task scenarios [12, 18, 45], we will pay more attention on the adaptability of optimizer to task scenarios in future research.

Conclusion

In this paper, we build a deep learning-based blur detection model named Hierarchical Edge-guided Region-complemented Network (HER-Net). To solve the overweight model problem, two novel inverted bottleneck blocks (olive-shaped and pear-shaped) based on large-kernel size (9 × 9) and depth-wise convolution are designed to serve as the core unit of HER-Net. To solve the rough boundary problem, two morphological priors (region-concerned and edge-concerned) are absorbed into the HER-Net in hierarchical and attentional manners. Concretely, a reverse-region spatial attention is engineered to exploit the regional complementation information and an edge spatial attention is engineered to utilize the edge-concerned cues for guidance. By these designs, we construct a concise yet effective one-encoder-parallel-decoders model. Experimental results on benchmarks demonstrate the effectiveness of our method with an average Fβ of 0.929 and an average MAE of 0.052 on three datasets. Note that such a state-of-the-art performance is achieved with only 8.34M parameters and 8.55G FLOPs, which is far fewer than competitive methods. The superior accuracy–complexity trade-off of the HER-Net could inspire further studies for real-world blur detection applications.