1 Introduction

Image downscaling is an essential and fundamental operation to reduce the spatial complexity of data for various applications. In particular, image downscaling has become a necessary process to reduce the computational overhead of memory-intensive applications, such as training a deep neural network on large images. For instance, conventional content-independent image downscaling methods, such as bilinear interpolation [1] and Lanczos [2], are widely used in a number of studies to fit large historical document images into constrained GPU memory to train deep convolutional neural networks (DCNNs) for semantic segmentation tasks [3,4,5,6]. Especially, segmentation models often necessitate more memory space than classification models due to the need for a cascaded upscaling process to preserve high fidelity in the final segmentation results on top of its classification backbone model [7]. As effective resizing techniques play a crucial role in segmentation tasks, this work focuses on evaluating the effectiveness of the downscaling method specifically for the segmentation aspect.

Numerous image downscaling methods have been proposed in the literature and can be classified into two groups depending on their approaches. The first group is based on a content-independent strategy [1, 2] that considers only a local spatial relationship to estimate the intensities of pixels in the downscaled image. Typically, such methods involve steps. By using a given sampling factor, a set of pixel locations in the original image are first identified that correspond to the pixels in the downscaled image. Then, in order to estimate the intensity of each pixel in the downscaled image, neighboring pixels near the identified pixels are uniformly sampled—so-called anchors—using floor or ceiling operations, and their intensities are subsequently interpolated. Although this approach is advantageous for its speed, it does not retain perceptual details well, stemming from the fact that the approach pays equal attention to the regions of interest and the background. Consequently, sharpness and fidelity are compromised, and these are important properties to preserve in image downscaling for various applications.

To address the issue, more advanced methods have been proposed and use a content-aware strategy [8, 9]. In general, methods in this strategy reframe the downscaling problem as an optimization problem depending on their own objectives, such as maximizing entropy or maximizing structural similarity (SSIM). These methods have been successful in retaining semantic contents, but they usually distort spatial structures severely, negatively impacting some applications, particularly semantic segmentation tasks.

In this paper, we present a novel image downscaling approach that combines the strengths of both the content-independent strategy and the content-aware strategy. That is, our approach focuses on retaining both the spatial relationship between instances (e.g., characters) and perceptually important features, especially edges. The underlying framework of our approach is similar to the content-independent strategy in that we sample a number of points within a local window and interpolate them to estimate the intensity in the downscaled image, by which spatial distortion can be minimized. However, we adaptively change the locations and intensities of the sampled points using our adaptive relocation process (ARP) and adaptive boosting process (ABP), respectively, to embed content-dependent information in the downscaled image. In particular, ARP identifies a number of valuable pixels in a given local window using a mapping function measuring local intensity gradient, in which we leverage local, spatial characteristics. ABP adjusts the intensities of the newly identified pixels to better capture the content-dependent visual cues by reflecting local texture and intensity-change dynamic caused by the relocation, i.e., putting more weight on the pixels of the target texture (e.g., texts) with a higher dynamic. That is, the resultant interpolated value—which is the estimation of the intensity in the downscaled image—from such adjusted pixels is likely to embed content-dependent information while avoiding severe local spatial distortion, resulting in retaining perceptually important features more effectively.

To evaluate the effectiveness of our approach in retaining perceptually informative details in terms of edges and spatial structure, we compare semantic segmentation performance between DCNNs trained on images downscaled by the proposed approach and on images downscaled by one of the widely used conventional approaches, namely, Lanczos. Importantly, we investigate three different scenarios using the proposed downscaling method—stand-alone, image-pyramid, and augmentation—to train a segmentation model to demonstrate the robustness of our method. We utilize images from three different publicly available historical newspaper image collections to train the models, where the collections cover different quantities (i.e., training data points) and visual difficulties. The trained models are evaluated based on four quality measurement metrics: Matthews correlation coefficient (MCC), mean intersection over union (mIoU), precision, and recall.

This paper makes the following main contributions:

  • We propose a novel adaptive downscaling approach and show that the resultant downscaled images can be effectively used in various ways (e.g., stand-alone, image pyramid, and augmentation) to train a DCNN on a semantic segmentation task.

  • We show that relocating anchors near, instead of directly on, the boundary for interpolation plays an important role in encoding perceptually informative features (e.g., crisp boundary and high contrast) that are useful for the semantic segmentation task.

  • We find that using a second-order derivative edge detector (e.g., Laplacian of Gaussian) is superior to that of non-second-order methods for capturing such points near the boundary due to the nature of their signal response.

2 Related work

In this section, we first review image downscaling methods, and discuss their limitations in terms of retaining perceptual details. Then, we review representative DCNN models that have been recently proposed for semantic segmentation on document images and discuss the limitations of using conventional downscaling methods to deal with large images.

2.1 Downscaling methods

As described above, image downscaling methods can be classified into two groups based on their approaches: (1) content-independent and (2) content-aware.

The content-independent image downscaling approach mainly focuses on reconstructing an original image in a lower resolution without losing details in a “blanket” manner. In other words, the strategy of this approach does not incorporate content-related information, such as salient objects (e.g., character or person) during the downscaling process to the extent that a content-aware strategy does. To this end, content-independent methods typically involve two steps. First, given a sampling factor, a set of pixel locations are resampled to correspond to the pixels in the downscaled image. Then, for each resampled pixel, a pixel value is interpolated based on its neighboring pixels. The choice of interpolation technique has been evolving, for instance, from linear (e.g., bilinear) to nonlinear functions (e.g., bicubic or Lanczos) [1, 2], where the functions are designed to work as a low-pass filter. The methods using this strategy are advantageous for their speed and effectiveness in suppressing aliasing artifacts. However, because of the low-pass filter that penalizes high-frequency regions, and their naïve sampling process that pays equal attention to the contents of interest and background (i.e., content-independent), they do not retain perceptual details and lose sharpness and salient objects, which are important properties to preserve in image downscaling for various applications.

To address the issue of losing perceptual details, such as key visual cues, more advanced methods have been proposed, under the second group, the content-aware resizing approach [8,9,10]. The strategy of this approach mainly focuses on preserving visual details that are known to be valuable in human perception. In other words, the strategy values and highlights salient objects and edges as much as possible in that such features are known to be important features for Human Visual System [11,12,13]. Avidan and Shamir [8] proposed a content-aware image resizing method that can preserve contents in an image by removing unnoticeable pixels that blend with their surroundings. The method finds a connected path of pixels on a single image from either top to bottom or left to right, called seam, based on the energy value, and removes the seam with the lowest energy until the image reaches the expected size. However, in order to find an optimal seam, a dynamic programming process is needed, requiring significant time to resize even a moderately sized image [14]. More importantly, their method is reported to fail on images having dense layouts, such as newspapers, that have substantial amounts of content. In addition, [9] proposed an image downscaling method without using a filter by reframing it as an optimization problem where the objective is to directly optimize the downscaled image against its original image using the structural similarity index (SSIM). However, its optimization process ignores the spatial distribution of pixels within the local window, and it often degrades the image quality in terms of noise and jagged edges [15]. Moreover, when applied to large document images for semantic segmentation, its optimization process on a large image can be time-prohibitive.

In a related field, there are adaptive pooling-based approaches that share a common goal of retaining perceptual detail throughout the resizing process in DCNN [16, 17]. However, these approaches predominantly involve feature pooling within a network model, tightly integrated with the model’s end-to-end training. That implies employing their adaptive pooling technique in conjunction with the network model, which presents the same concern that our investigation addresses: the system encounters out-of-memory issues during training unless the input images are downscaled beforehand. Therefore, we refrain from exploring such approaches in this study.

2.2 DCNNs for document image semantic segmentation

With the successful achievement of deep learning on natural scene semantic image segmentation, many efforts have been devoted to applying DCNNs to the document image domain.

As a relatively simple architecture, [5] proposed a fully convolutional neural network (FCN) for historical document segmentation that consists of an encoder and decoder structure based on convolution and deconvolution layers, and the model is evaluated on large historical document images. In order to both reduce the computation time and fit within memory, these large images are aggressively downscaled by a factor of about 8 (from \(2200\times 3400\) to \(260\times 390\)) using a typical content-independent downscaling method. Although such aggressive downscaling using a fixed downscaling operation, such as uniform sampling, might only lightly distort images with relatively simple layout structures and sparse content, it becomes problematic when images have complex layout structures and dense content. For instance, in downscaled images, most of the layout details are inadvertently lost prior to feeding them into a DCNN to learn, making it challenging to obtain sharp segmentation results.

Several works present DCNN models for document semantic segregation tasks, because of their advanced architecture in terms of learning capacity. In particular, they utilize the U-net architecture extended from the FCN by adding skip connections between each level of decoder and encoder so as to provide fine-grained details in the prediction. Lee et al. [4] introduces a module into a U-net architecture that can extract co-occurrence features, which is expected to be effective in recognizing the regularity in text line structures and tables, thereby boosting the performance of the segmentation. Yang et al. [6] also adopt a U-net architecture but introduce a multi-modal approach, in which, besides the original image, a textural knowledge modality retrieved by an OCR engine is additionally fed to the network for more effective training performance. The limitation of the first two approaches, similarly as mentioned in the previous paragraph, is that the training images are aggressively downscaled (down to \(512\times 512\) in Lee et al.’s and to less than \(384\times 384\) in Yang et al.’s) using fixed downscaling operation, and it is hard to expect to retain complex layout details in the downscaled images.

In response, Oliveira et al. [3] propose a more generic model based on a U-net architecture that is more suitable to be imported for a wide range of semantic segmentation tasks. Although Oliveira et al. [3] adopt patch-wise training rather than resizing the large images for fitting them into the memory, it is reported that patch-wise training over the full image either lowers accuracy due to the class imbalance issue arising from the random patching [18] or is computationally intensive arising from overlapping patches [19].

3 Methodology

This section describes our image downscaling approach and method, which combines the strength of both content-independent and content-aware approaches. The underlying premise is that we follow the overall sampling and interpolation process to avoid severe structural distortion as in the content-independent approach. However, instead of directly interpolating the anchors to estimate pixel values for the downscaled image, we adaptively select the anchor points using an adaptive relocation process (ARP) and then adaptively adjust intensity values of the new anchor points using an adaptive boosting process (ABP). On the one hand, ARP allows us to find new anchor locations within a local search space, e.g., where pixels have higher local gradients compared to the original anchor points. The newly found anchors provide “value-guided” interpolations, where gradient-driven intensities capture perceptual cues (e.g., edge) and where locally constrained locations prevent structural distortion during interpolation. On the other hand, ABP allows us to adjust the intensities of the new anchors by exploiting the regional texture of each new anchor and the intensity-change dynamic from its original anchor. Such adjustment suppresses the inadvertent intensity change that might have been driven by faulty perceptual cues (e.g., noise) in ARP while we boost the counterparts. Indeed, we observe this “boosting” better retaining perceptually important features as desired by the task or application.

Our adaptive sampling-based downscaling method is designed as a five-step process. First, we precompute a reference map and a probability density function (PDF) (Line 1 and Module 2), which will be used in both the ARP and ABP processes, respectively, as a reference of visual cues from two different aspects: intensity-gradient (e.g., edge) and regional texture (e.g., noise). We discuss the details of the reference map and PDF in Sect. 3.1. Second, based on the expected downscale size and also the scaling factor (\(f=1\), by default) (Lines 2-4), we find anchor points in the original image that correspond to the pixels in the downscaled image (Lines 5–8). Third, we adaptively relocate the anchors whose local gradients are relatively high within a local window based on the reference map, as part of the ARP (Line 9 of Algorithm 1 and Module 3). Fourth, we adjust the intensities of the new anchor points based on the PDF, as part of the ABP (Line 10 of Algorithm 1 and Module 4). Lastly, we interpolate those new anchor points whose locations and intensities are adaptively adjusted to compute a single value, then assign that value to the corresponding pixel in the downscaled image (Line 11). We repeat the process from the second to the fifth step until all the sampled values are mapped into the downscaled image. The overall process is shown in Algorithm 1, and the detail of each step is elaborated in the following subsections.

figure a
figure b
Fig. 1
figure 1

Poor visual quality in historical newspapers (e.g., broken line-separators, inconsistent stroke width, etc.). Note that the quality becomes even worse after downscaling using content-independent approach (right)

3.1 Precomputation: creating reference map and probability density function

Recall that one of the key strategies of our method is to adaptively sample pixel points to retain perceptually important features. The question then is what pixels should be considered important to sample? In historical document images, printed components (e.g., characters) are often in poor visual quality in that characters or line separators are disconnected and their strokes are inconsistent (as shown in Fig. 1(left)), and especially when compared to the components presented in born-digital document images. Visual quality grows even worse in the downscaled images generated by a content-independent downscaling method, such as a uniform sampling-based method, due to its inability to embed visually important context, as shown in Fig. 1(right). Eventually, a DCNN trained on such downscaled images is likely to underperform for the segmentation task. Given this context, in our work, we consider pixels right-on or nearby components as visually important pixels, as sampling more such pixels during downscaling reduces the occurrences of disconnectedness between components.

In order to quantify the visual importance of pixels, we consider two aspects: intensity-gradient and regional texture. On the one hand, the intensity-gradient allows us to identify high-frequency signals where edge-like features tend to appear. Specifically, we compute the intensity-gradient over the entire image, a so-called reference map, which is utilized in our ARP. On the other hand, the regional texture allows us to verify whether the identified high-frequency signals are driven by noise-like artifacts, such as bleed-through. Specifically, we compute the distribution of the textures found in a set of random patches of the image being processed and derive the distribution’s probability density function (PDF), which is used in our ABP. Module 2 shows the acquisition of a reference map (Lines 1.a and 1.b) and PDF (Lines 2–8), and the detail of each aspect is discussed in the following.

Intensity-gradient: Reference Map. To quantify the visual importance of pixels from the intensity-gradient aspect, we need a mapping function that assigns higher weights to the pixels showing high-frequency—i.e., near edges—such that they can be identified as new anchor points in ARP. In our case, because of the particular application associated with our task, we focus on functions that can measure the change in the intensity of the pixels (e.g., color or brightness) over local areas, and thus on edge detectors and entropy that are representative designs for such purpose.

To investigate the robustness of our method depending on the choice of the mapping function, we selected a handful of representative edge detection methods that demonstrate distinct approaches, as shown in Table 1. It is worth noting that while the underlying philosophy of GLCM’s entropy and contrast is to measure local texture based on statistics, they can also be utilized as edge detectors [20]. In this context, they can be considered similar to the first-order approach, as they indirectly measure the degree of intensity change within a region. Moreover, we selected these measures in our work because their designs align well with our definition of what is considered important to focus on for sampling. They allow us to prioritize pixels in a region near disconnected characters or line separators where the intensity randomness and variation should be high.

When considering the time-complexity of computing a reference map, the convolution-based mapping function (i.e., Scharr and LoG) takes O(hw), while the GLCM-based mapping function takes \(O(hwk^2)\), where h and w are the height and width of an input image, and k is the size of the kernel. To be more precise, convolution-based functions traverse the \(h\times w\) space, with convolution operation using a \(k \times k\) kernel taking constant time, O(1). Therefore, the overall complexity becomes O(hw). On the other hand, GLCM-base functions also iterate over the \(h\times w\) space, but building a co-occurrence matrix with \(k\times k\) kernel takes \(O(k^2)\), which in turn takes \(O(hwk^2)\).

Table 1 Representative mapping functions
Fig. 2
figure 2

PDFs of textures found in the randomly selected patches modeled by GMM into three distributions: background texture, potential noise textures, and confident text textures. Two examples are shown. Correspondingly, representative patches are also shown

Regional texture: PDF. To quantify the probability of a pixel identified by a reference map (i.e., a new anchor point) being either noise or not, we use the probability density function (PDF) of the regional texture of a set of random patches. As mentioned earlier, we precompute the distribution of the texture of a set of random patches and estimate the PDF of the targeted texture (e.g., the texture of confident text). From such a precomputed PDF, we can gauge the sampled new anchors (identified in ARP) and can assign higher weights on the higher-quality new anchors during the intensity adjustment than the lower-quality ones, during the ABP.

The acquisition of PDF takes the following steps: (1) randomly select a number of patches (e.g., 10,000 patches are selected in our experiment) in the image being processed and measure the texture-in our application, local contrast is used-of each patch (Lines 2–6 in Module 2) and (2) model three PDFs based on the distribution of the obtained texture values using Gaussian Mixture Model [21] (Lines 7 in Module 2), as shown in Fig. 2. Note that the computation of PDF is designed for document images in grayscale, where the number of components is defined as three based on the assumption that typically three different types of local contrast exist in document images: background, potential noise, and confident text [22].

In terms of time-complexity of computing PDF, it takes O(ncdI) where n is the number of patches, c is the number of Gaussian components, d is the dimensionality of the data, and I is the number of iterations required for convergence. The complexity is determined through the aforementioned two steps. Firstly, the random patch selection and texture computation requires O(n) time. Secondly, optimizing GMM takes O(ncdI) [23]. As there is no loop between the first and second step, the final complexity is the sum of each step, \(O(n+ncdI)\), which simplifies to O(ncdI). In our experiments, we empirically found that collecting 10,000 patches (i.e., \(n=10000\)), computing the texture of each patch in the form of a single numeric value (i.e., \(d=1\)), and optimizing GMM of each patch’s texture (i.e., \(c=3, d=1, I\le 100\)) takes approximately 0.5 s on average. Depending on the application, PDF computation may grow proportional to the number of patches and dimensionality of data (i.e., texture representation).

As depicted in the figure, the PDF of confident text textures (i.e., high-contrast), which is our targeted texture, is distinctive from the others (i.e., background and potential noise textures) in both noisy and clean images in terms of its mean (Line 8 in Module 2). That is, using such a PDF, we can compute the probability of the given new anchor’s regional texture being of confident text, which becomes the scaling factor for adjusting the intensity during ABP.

Fig. 3
figure 3

Visualized ABP of two representative cases: boosted with a large scaling factor (top on the left) and boosted with a small scaling factor (bottom on the left). Left image: Original; Right image: LoG of the original image. First, both anchors (blue dots on the left) are relocated (red dots on the left) based on their nearby LoG (blue boxes on the right). Then, we use each relocated anchor’s texture (each red box on the left) in the original image to estimate whether the visual cues driven by LoG are reliable (i.e., noise-like or actual edge-like features). In the given example, the contrast inside the bottom red box in the original image is more likely to be of a noise-like artifact compared to that of the top red box based on the PDF of confident text textures, and thus a lower scaling factor is assigned to the corresponding intensity-change dynamic (i.e., intensity change from the old anchor (blue dot) to its new anchor (red) (colour figure online)

3.2 Adaptive relocation process

In ARP, we identify new anchors whose coordinates are adaptively relocated from the static points (i.e., anchor) based on the reference map of the original image. The objective of this adaptive relocation is to achieve the following two results. First, recall that as discussed earlier in Sect. 3.1, pixels on or nearby components are more likely to contain perceptually important information than the other pixels. These pixels are also the pixels with high local gradient values (i.e., high LoG values in this case) and would be desirable to sample. Second, to avoid severe structural distortion, we seek out the new pixel location to sample only within the local window corresponding to each of the anchors. This allows us to leverage the advantages of both content-independent and content-aware strategies (efficiency and task-specific effectiveness, respectively).

Module 3 details the adaptive relocation process. Similar to a uniform sampling-based downscaling method (e.g., bilinear interpolation), we find the coordinates of the points in the original image that are mapped onto each pixel of the downscaled image (Lines 5–8 of Algorithm 1). Note that for each mapping point in the original image, if we simply use the nearest neighboring samples (i.e., static points) for the interpolation, it becomes a naïve bilinear interpolation, which lacks embedding content-dependent visual context into the downscaled image. Instead, in our method, in order to incorporate perceptually important context, we sample the points in an adaptive fashion by selecting the highest pixel value in the reference map within the region surrounding the mapping point (Lines 5–6 of Module 3). During the sampling process, to prevent selecting the same pixels repeatedly within the region near the given anchors, we flag each selected pixel (Line 7 of Module 3). This allows the adaptively selected anchors to better represent the corresponding region with regard to variability. In summary, the anchor point is effectively relocated to the new location in a content-dependent fashion.

figure c

3.3 Adaptive boosting process

In ABP, the intensities of the new-found anchors are adjusted to better incorporate content-dependent visual cues in the subsequent interpolation step. The rationale behind such intensity-adjustment, so-called boosting, is twofold: (1) to better highlight visual cues and (2) to improve the sensitivity of our method to noise-like artifacts.

Fig. 4
figure 4

Overview of the proposed adaptive image downscaling method. Note that locations and intensities of the anchors to be interpolated (\(o_1,o_2,o_3,o_4\)) are adaptively adjusted (\(n^\prime _1,n^\prime _2,n^\prime _3,n^\prime _4\)) by ARP and ABP, respectively. In ARP, a new anchor is identified for each old anchor based on visual cues (e.g., high signal driven by local gradients) to find pixels highly relevant to the visual cue. In ABP, the intensity of each new anchor is adjusted based on the amount of information change (e.g., intensity-change dynamic) and its correspondence to noise (e.g., texture similarity) to better incorporate visual cues. Consequently, each pixel in a downscaled image estimated by the interpolation is expected to incorporate strong visual cues while accounting for noise

First, we reflect on the intensity-change dynamic, e.g., intensity change from the old anchor to its new anchor. That is, we give less weight to a low dynamic (e.g., low entropy; background) and give more weight to a high dynamic (e.g., high entropy; edge) since the latter implies that there is a more informative visual cue. In other words, if we adjust the intensity of the new anchor proportionally to the change in the direction of the intensity change, the content-dependent visual cue indicated by such anchor can have greater impact on the interpolation.

Second, recall that the new anchors identified by ARP are driven by strong visual cues (e.g., edges). We focus on the fact that such cues can also be driven by noise-like artifacts (e.g., bleed-through), from which anchors can be falsely relocated, causing a negative impact on the subsequent interpolation step. Thus, to help us counterbalance the first rationale above, we give less weight to the intensity-change dynamic of the relocated anchor that is more likely to have noise-like texture and give more weight to the counterparts.

Given that, the boosting function is defined as:

$$\begin{aligned} z= & {} min(255, max(0, z_n - (z_o-z_n)*\alpha )) \end{aligned}$$
(1)
$$\begin{aligned} \alpha= & {} \int ^{texture_n}_{-\infty }PDF_{texture_{target}} \in (0..1) \end{aligned}$$
(2)

where \(z_o\) and \(z_n\) are the intensities of the original anchor and new anchor, respectively, \(\alpha \) is a scaling factor, \(texture_n\) is a quantified regional texture of the new anchor, and \(PDF_{texture_{target}}\) is the PDF of the targeted regional texture.

The term \((z_o-z_n)\) computes the intensity-change dynamic, determining the amount of boosting, and the term is scaled by \(\alpha \) to mitigate the production of noise-like artifacts. The resultant scaled amount of boosting is subtracted from the intensity of the new anchor so that its intensity is adjusted in the direction of the change. As the upper and lower bounds of the final adjusted value, we set 255 and 0, which are typical bounds for intensity.

The scaling factor ranges from 0 to 1, and its actual value is based on an additional inspection that estimates whether the regional texture of the corresponding new anchor shows a high probability of being a targeted texture. It is closer to 1 if the regional texture of the new anchor is highly likely to be confident text; otherwise, it is closer to 0. The concept of ABP is visualized in Fig. 3.

Module 4 details the adaptive boosting process. The process starts with identifying the locations of the old and their new anchors (Line 3) and the corresponding intensities (Line 4). Then local texture is computed for each new anchor (Line 5). For instance, we measure contrast using standard deviation. The computed texture is examined to estimate whether it is the targeted texture based on the precomputed probability density function (PDF), from which the higher scaling factor is assigned for the higher probability (Line 6). Lastly, we adjust the intensity of the adaptively relocated new anchor by plugging the computed values into the boosting function (Lines 7-8).

The end-to-end process of the proposed image downscaling strategy is illustrated in Fig. 4.

Fig. 5
figure 5

Downscaled image using Lanczos (top) and proposed method (bottom), from 7135\(\times \)4896 to 1280\(\times \)760, which is the maximum resolution fit in the 16 GB of GPU. Note that downscaled images by the proposed method retain better perceptual cues (e.g., high contrast and edges)

figure d

3.4 Computing sampled value via interpolation

Once we adaptively relocate the points and adaptively amplify their intensities for a region of the anchor, the last step is to interpolate them to estimate a value representing that region, which will then be mapped into the corresponding pixel in the downscaled image. Bilinear interpolation is a popular method for estimating an unknown value from known values based on the assumption that closer sample points are more related than further ones. However, it is most suitable when sample points are aligned in a rectangular fashion, which is not always the case in our images due to the adaptive relocation process. Thus, instead, we use inverse distance weighting (IDW) interpolation, which is defined as:

$$\begin{aligned} {\hat{z}}&=\frac{\sum ^n_iw_iz_i}{\sum ^n_iw_i}\nonumber \\ w_i&=\Vert x-x_i\Vert ^{-\beta } \end{aligned}$$
(3)

where \(z_i\) is the intensity value of the \(i^{th}\) sample point, x and \(x_i\) are the coordinates of an unknown and the \(i^{th}\) sample point, respectively, \(\Vert \cdot \Vert \) corresponds to the Euclidean distance, and \(\beta \ge 0\) is inverse distance power that determines the degree to which the nearest sample points are preferred over more distant points. Note that IDW interpolation shares the same assumption as bilinear interpolation for its estimation, but the sample points are not required to form a rectangle [24].

Examples of the final resulting downscaled images generated by our method and a content-independent-based method are shown in Fig. 5. Importantly, perceptually important context is retained effectively in the downscaled image generated by our method compared to the one by the conventional method, as demonstrated by the higher contrast and superior line separator highlighting.

The time-complexity analysis of our downscaling method is presented as follows. In Sect. 3.1, we discussed the precomputation phase, which takes \(O(hw)+O(ncdI)\) due to its underlying operations. The main algorithm involves traversing the \(h\times w\) space, performing consecutive operations of ARP, ABP, and interpolation. Each of these operations has a time-complexity of O(C), where C represents the number of anchor points. Combining all the steps, the final time-complexity of our downscaling method is determined to be \(O(hw)+O(ncdI)+O(hwC)\). This analysis reflects the computational complexity of our approach in achieving downscaled results for the given input data, considering the various stages involved and the impact of anchor points on the overall performance.

4 Experiment

To evaluate the effectiveness of our approach in retaining perceptually informative details, we compare semantic segmentation performance between DCNNs trained on the images downscaled by our proposed approach and by Lanczos [2], which is one of representative content-independent strategies. We evaluate our approach in this way because the semantic segmentation task is one of the applications that relies heavily on both perceptual details and spatial structures for inference [25]. In particular, it is worth noting that we were unable to compare our method with a purely content-aware strategy, such as seam-carving [8], because it severely destroys the spatial structure, as shown in Fig. 6.Footnote 1 The original author acknowledges this specific limitation [8], noting that the method is susceptible to failure when applied to condensed images containing a substantial number of objects, such as the letters in our application. This limitation is in line with the findings of a separate study by [26], which argues that unconstrained seam-carving applied to handwritten manuscripts, without incorporating any prior knowledge about text layout, may inadvertently remove seams that pass through gaps between multiple lines of text. In our scenario, these limitations would result in distortion of the overall structure. Consequently, generating the corresponding ground-truth becomes non-trivial, thereby undermining the training process.

In the following subsections, we discuss the training scenarios, the datasets, the segmentation model and training setting, and evaluation metrics that are used for the performance comparison.

Fig. 6
figure 6

Exemplary BBZ images that are downscaled by content-aware method, seam-carving (from \(4293\times 6372\) to \(896\times 1280\)): (Left) Original image, (Middle) Width-first Seam-carving, (Right) Height-first Seam-carving. Note that due to the severe spatial distortion, it is non-trivial to construct the corresponding ground-truth for training

4.1 Training scenarios

To investigate whether and how well the images downscaled by the proposed approach retain informative features for training, in addition to training a model with a set of simply downscaled images (i.e., the stand-alone format), we train a model with two additional scenarios. Specifically, we use a set of (1) image-pyramid and (2) augmented images produced by our method. The two additional training scenarios are employed in our experiment based on the following reasons.

First, given that image-pyramid is an effective representation for scale invariant processing (e.g., object detection, segmentation, etc.) [27], if our method is capable of retaining useful features effectively in various scales, the segmentation model trained on the resultant image-pyramid should produce improved performance.

Similarly, augmentation is an effective method to increase the variance in training sets, from which the generalizability of a model can be improved [28]. Therefore, if our method is capable of retaining useful features distinctive from the content-dependent method, the segmentation model trained on the augmented dataset should produce improved performance.

In the stand-alone format, a model is trained with only a set of single images downscaled by either Lanczos or our approach. In the image-pyramid format, a model is trained with a set of stacked images where each image is downscaled using either the Lanczos method or our approach with different factors (i.e., by 2, 4, 8, etc.). In the augmentation format, a model is trained with a set of images downscaled by our approach in addition to the Lanczos method.

In accordance with the downscaled images, we have downscaled the corresponding ground-truth segmentation masks. Rather than attempting to downscale the mask using the proposed method–for instance, by tracking sampled pixels and interpolating the corresponding pixels in the mask–to achieve precise alignment with those in the original image, we opted for a simpler content-independent method, Lanczos. This is because simple yet effective in that the proposed method is designed to generate a downscaled image with minimizing spatial distortion, preserving the layout structure to mirror the structure in the downscaled mask. This decision is based on its simplicity and effectiveness.

4.2 Datasets

In our experiments, we use three publicly available datasets: Berliner Börsen–Zeitung (BBZ), Europeana Newspapers Project (ENP), and Biblliothéque nationale du Luxembourg (BnL), containing a total of 104, 528, and 1,220 digitized historical newspaper images, respectively. Importantly, the datasets cover several languages (e.g., English, German, French, etc.), time periods (e.g., 17th-20th centuries), color representations (e.g., binary, grayscale, and color), and structural and textual features (e.g., multi-column, mixed font, running titles, etc.). In Table 2, we summarize and compare the quality- (e.g., noise, size, width and height ratio, and page layout) and quantity-relevant characteristics that are likely to impact the segmentation performance, among various characteristics of each dataset. All datasets have ground-truth files-annotated by domain experts-where BBZ and ENP contain region outlines and types (e.g., text, image, table, etc.) and BnL contains only text. For more detail, we refer to [29, 30] and [31].

Table 2 Characteristics of the datasets used in our experiments

4.3 Segmentation model and training setting

We use one of the state-of-the-art DCNN segmentation models that is specifically designed as a generic model for the historical document image domain [3]. In particular, we train this model for two different segmentation tasks named blk and sep, where blk separates background, text regions, and table regions, and sep separates straight or slightly curved separator lines between text regions and inside tables [29]. Note that both blk and sep are important tasks in the area of document layout analysis. Moreover, sep is considered to be a more challenging task than blk due to two factors: (1) a model is susceptible to false-positive for look-alike components, such as table lines or drawings, and (2) it inherits the class imbalance issue, in which most of the pixels are background. As an exception, we perform only the blk task for BnL due to its limited ground-truth information, as discussed above in Sect. 4.2.

Regarding the training setting, we use the Adam optimizer that has been shown to be largely successful in reducing the labor of tuning the learning rate [32], where the base value for each task is empirically searched from [1e−5, 1e−4, 1e−3]. To prevent over-fitting, we terminated the training using early stopping. Additionally, we perform a fivefold cross-validation and used the average of the folds as a final performance. For the batch size, we use a size of 1 considering the given GPU memory constraint, which was not sufficient to load more than one large image. Note that each training image is downsized such that the total number of pixels should not exceed 720,000 and keeping the original width to height ratio. As an effort to ease skewness, which is an inherent image-level issue with digitized historical newspapers, we use a mild rotation (±0.2 \(^{\circ }\)) as a data augmentation technique.

All the experiments were performed on an Ubuntu 16.04.06 LTS with a single Tesla V100 GPU with 16GB memory for the GPU.

Fig. 7
figure 7

Comparison of adaptively relocated anchors identified by second-order (top) and non-second-order (bottom) methods. The graphs overlaid on the original image visualize the signal response of LoG and GLCM (entropy). Dots represent the relocated anchors based on the high signal responses. Note that the high signal responses found by LoG relatively reside inside of either foreground or background pixel compared to that of GLCM. As a result, the interpolated values in the resultant downscaled image by LoG better highlight the contrast

4.4 Metrics

It is worth noting that there are various measures when it comes to evaluating semantic segmentation tasks, and it is unclear which of the measures are best suited as indicators for overall segmentation quality [29]. Given this situation, we evaluate our method based on four widely used segmentation quality metrics. First, we use Matthews correlation coefficient (MCC) to evaluate the overall segmentation quality of a DCNN, which has been found to be less likely to mislead than F1 and pixel accuracy, and well-suited for experiments on highly unbalanced label classes (e.g., mostly either background or text over table or separators) [33].

$$\begin{aligned} \text {MCC} = \frac{\text {TP} \times \text {TN} - \text {FP} \times \text {FN}}{\sqrt{(\text {TP}+\text {FP})(\text {TP}+\text {FN})(\text {TN}+\text {FP})(\text {TN}+\text {FN})}} \end{aligned}$$

Second, we use mean intersection over union (mIoU) to give a holistic view of the model’s performance. We use this measure, as opposed to accuracy, because it considers false-positive predictions.

$$\begin{aligned} mIoU = \frac{\text {TP}}{\text {TP}+\text {FP}+\text {FN}} \end{aligned}$$

For the third and fourth metrics, we use precision and recall to further evaluate the performance of the models.

$$\begin{aligned} \text {Precision}&=\frac{\text {TP}}{\text {TP}+\text {FP}}\\ \text {Recall}&=\frac{\text {TP}}{\text {TP}+\text {FN}} \end{aligned}$$

5 Results and discussion

In this section, we report the experimental results and discuss our observations. Given that our experiment involves various factors that include segmentation tasks, training scenarios, and datasets, to better comprehend the effectiveness of our downscaling method, our discussion is structured as follows. First, in Sect. 5.1, we evaluate our downscaling method for two types of segmentation tasks, blk and sep, to explore its effectiveness in retaining visually informative features for segmentation. Then, in Sect. 5.2, we evaluate different scenarios of using our method to train a segmentation model, from which we investigate the robustness of the features retained by our method. In Sect. 5.3, we further evaluate our method on different datasets to discuss its generalizability on a wide range of document image domains.

5.1 Performance in segmentation tasks

In this section, we evaluate the segmentation performance of DCNN models that are trained on the downscaled images generated by (1) the content-independent approach using Lanczos (i.e., baseline) and (2) our method of using four different mapping functions as a reference map. Our focus in this investigation is to explore whether the downscaled images produced by our method effectively retain visually important features for two different types of segmentation tasks, blk and sep. The downscaled images are in the stand-alone format during training.

Fig. 8
figure 8

A visual comparison of the quality of segmentation in the blk and sep tasks on BBZ images

For the blk task, as shown in Table 3, the results demonstrate that the model trained on the images downscaled by our method, with the use of any mapping function, achieves better performance than the baseline in all metrics. Such results imply that our proposed method is flexible in the selection of mapping functions for computing the reference map. That is, depending on the application, it is possible to customize the mapping function to better reflect what to consider as a visually important cue.

Similar results are observed for the sep task in that our method using any mapping function improves the overall segmentation performance compared to the baseline in all metrics, as shown in Table 4.

It is worth noting that, for each task, using LoG as a mapping function achieves slightly better performance than other mapping functions. The performance differences likely stem from the nature of LoG, a second-order derivative filter, where the high signal responses (i.e., visually important cue) are found inside of either foreground- or background-pixel nearby the edges. In contrast, responses from non-second-order methods, such as the entropy of GLCM, are found right-on the edges, as shown in Fig. 7. That is, the resultant relocated anchors found by LoG can better highlight the contrast, whereas the relocated anchors found by GLCM are bound to be blurry.

In addition, we investigate whether using our method yields different performance gain depending on the type of segmentation tasks. Comparing Tables 3 and 4, we observe that our method is more effective on the sep task than the blk task in improving the overall segmentation quality; that is, the performance gain on average driven by our method against the baseline on the sep task (i.e., Table 4) is higher than that on the blk task (i.e., Table 3) in every metric. Additionally, Fig. 8 shows the actual segmentation results of both the blk and sep tasks, providing a visual representation of the performance improvement achieved by our method in terms of real-world use cases. From the figure, we can visually see that the overall performance gains in the blk task are minimal. Indeed, Tables 3 and 4 support this observation. For example, in Table 3, there was only 2.6% of improvement in MCC by A.+LoG, which was smaller compared to that achieved in the sep task as shown in Table 4, as 7.2% of improvement in MCC by A.+LoG. Such limited improvement in blk may motivate further investigation in the future on the conceptual potential of our content-aware downscaling method in recognition of the text-like components (e.g., letters or numbers). However, the greater significance of the accuracy improvement achieved in the sep task suggests more confidence in applying our method to retain line-relevant visual features (e.g., straight edges) for a segmentation model.

Overall, from this investigation, using our content-aware downscaling method is shown to be more effective than the content-independent downscaling method, Lanczos, in obtaining visually informative features, especially for the sep task.

Table 3 Segmentation performance of DCNNs on the blk task where the models are trained with images downscaled by baseline and ours
Table 4 Segmentation performance of DCNNs on the sep task where the models are trained with images downscaled by baseline and ours

5.2 Performance with different training scenarios

As described in Sect. 4.1, in addition to a typical training scenario, feeding a set of downscaled images (i.e., stand-alone images), we investigate the following two additional training scenarios: (1) feeding a set of image-pyramid and (2) feeding augmented images, both generated using our downscaling method. From this investigation, we explore the viability of our method as a tool of multi-scale operation and augmentation, respectively. Specifically, for the image-pyramid, we feed a stacked image where each image is downscaled using our method with different scaling factors (i.e., \(f=1\), \(f=2\) and \(f=4\)) using Lanczos (baseline) and our method. For the augmentation, we increase the size of the training dataset by including the above set of downscaled images. Thus, the two training scenarios differ in how the downscaled images are fed into the model: one capturing multi-scale features (as in image-pyramid), and one with increased variance in the dataset (as in augmentation). Also, note that in this investigation, we use only LoG since it outperforms other mapping functions as reported in the previous section.

As shown in Table 5, training a model with a set of image-pyramid generated by our method achieves better performance than the baseline on all datasets in all metrics in both tasks. Such a result implies that our downscaling method is better at retaining visual cues across multiple scales (i.e., image-pyramid) than Lanczos.

Table 5 Segmentation performance of DCNNs on the blk and sep task where the models are trained with images downscaled by baseline and ours using image-pyramid training scenario

As shown in Table 6, training a model with the augmented dataset, which is produced by adding images generated by our method, also shows better performance than the baseline on all datasets in all metrics in both tasks. Importantly, among the two additional training scenarios, the augmentation scenario shows overall the best performance, in terms of MCC, mIoU, and precision. Such a result is reasonable in that the models trained with a set of either stand-alone or image-pyramid images are limited to generalize either relatively poor quality (i.e., stand-alone) or enhanced quality (i.e., image-pyramid) images; whereas the model trained with augmented dataset (i.e., Lanczos+ours) becomes capable of generalizability on both poor and enhanced quality images.

Table 6 Segmentation performance of DCNNs on the blk and sep task where the models are trained with images downscaled by baseline and ours using the augmentation training scenario

The results from this investigation show the viability of our method as used for different training scenarios, i.e., image-pyramid and augmentation. This suggests that our method is capable of retaining visual features that are (1) useful for multi-scale operations (i.e., image-pyramid) and (2) distinctive (i.e., augmentation) from the features that can be obtained by the content-independent method.

5.3 Performance in different datasets

In this section, we further investigate the performance gain driven by our method on the three different datasets to assess its robustness.

As shown in Tables 345, and 6, we observe that the overall performance gains on different datasets are similar. For instance, as shown in Table 1, the MCC gains driven by our method using LoG are +2.47, +2.3, and +1.45 on the BBZ, ENP, and BnL datasets, respectively. Also, as shown in Table  4, the MCC gains on the BBZ and ENP datasets are +5.82 and +6.95, respectively. Similar results are observed in Tables 5 and 6.

As discussed earlier in Sect. 4.2, ENP shows (1) a distinctive quality (i.e., more challenging in terms of noise, dimension, and page layout) than the other two datasets and BnL shows (2) a distinctive quantity (i.e., 1,220 vs. 104 and 528) than the other two datasets. That our method is able to achieve consistent performance gains signifies the generic usage of our method on a wide range of document image domains.

6 Conclusion

In this paper, we proposed a novel adaptive image downscaling approach that combines the strengths of content-independent and content-aware strategies. The proposed approach aims to address the limitation of conventional image downscaling methods—i.e., losing perceptual details—by focusing on retaining both (i) the spatial relationship between instances (e.g., characters) using a local window and (ii) perceptually important features (e.g., edges) using visual cue-driven location and intensity adjustment. In this way, our method can be beneficial for training a segmentation model on large images.

Through several comprehensive experiments, we showed the proposed method’s robustness according to the following three characteristics: types of segmentation tasks, training scenarios, and datasets. The overall findings from such experiments can be summarized as follows. First, our approach is effective in retaining visually informative features without severe spatial distortion. For example, it can be used in applications where both visual cues and spatial structure are equally important, such as a semantic segmentation task. Second, for image downscaling, using LoG is deemed superior to entropy in measuring the local intensity gradient due to its nature of second-order derivative (i.e., high signal responses reside inside either foreground- or background-pixel; near, instead of right-on, the edges). Such a finding may provide useful insight into designing an advanced segmentation model architecture or training strategy, for instance, designing a model to pay more attention to inputs near, instead of right-on, the edges. Third, the visual features retained by our method are versatile in terms of their utilization as a multi-scale operation and augmentation approach. This implies that our adaptive downscaling approach can supplement image representation to improve a multi-modality training framework.

For future work, we will extend the application of our approach to different types of document image domains showing their distinction for more diversely structured layouts, such as magazines. Also, we will investigate the application of our approach to various types of document processing tasks, such as classification. Besides, we will examine additional mapping functions that are not limited to the intensity-gradient aspect, so that we can expand the applicability of our approach to where it requires a different assumption on the visually important pixels. Additionally, we will explore whether our adaptive relocation and boosting concept can be directly embedded into the convolutional neural network architecture itself, such as a layer, by which the kernels in the network can learn the importance of input from the previous layer in a content-adaptive fashion.