Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping

Numerous business workflows involve printed forms, such as invoices or receipts, which are often manually digitalized to persistently search or store the data. As hardware scanners are costly and inflexible, smartphones are increasingly used for digitalization. Here, processing algorithms need to deal with prevailing environmental factors, such as shadows or crumples. Current state-of-the-art approaches learn supervised image dewarping models based on pairs of raw images and rectification meshes. The available results show promising predictive accuracies for dewarping, but generated errors still lead to sub-optimal information retrieval. In this paper, we explore the potential of improving dewarping models using additional, structured information in the form of invoice templates. We provide two core contributions: (1) a novel dataset, referred to as Inv3D, comprising synthetic and real-world high-resolution invoice images with structural templates, rectification meshes, and a multiplicity of per-pixel supervision signals and (2) a novel image dewarping algorithm, which extends the state-of-the-art approach GeoTr to leverage structural templates using attention. Our extensive evaluation includes an implementation of DewarpNet and shows that exploiting structured templates can improve the performance for image dewarping. We report superior performance for the proposed algorithm on our new benchmark for all metrics, including an improved local distortion of 26.1 %. We made our new dataset and all code publicly available at https://felixhertlein.github.io/inv3d.

to automatically extract information. This, however, creates additional costs and reduces the flexibility of the given solution.
In order to overcome the hardware restriction, current state-of-the-art approaches attempt to analyze document images taken with smartphones. Prominent examples are DewarpNet [6] and GeoTr [9] which learn to dewarp images using supervised learning, having the available dewarping meshes as ground truth. While these approaches generate promising results, they still are not satisfactorily robust when dealing with environmental factors, such as light incidence, shadows, occlusions, crumpled or folded paper, and perspective transformations.
One potential remedy is the use of additional structured information in the form of templates, which confine the general structure of the documents in order to improve unwarping. While it might be tedious to define initial templates, the added value is significant due to the considerable increase in dewarping precision and robustness.
In this paper, we follow exactly this path and propose a novel labeled invoice dataset with additional structural information to assist image dewarping. More specifically, we present Inv3D, a large, high-resolution invoice dataset comprising both synthetic data generated from carefully designed templates and challenging real-world data. Inv3D consists of 25,000 samples, each composed of four flatbed invoice image layers, two ground-truth annotations, the 3D warped document, nine supervision signal maps, and the backward transformation map (see Fig. 1). We then propose a novel supervised dewarping approach, referred to as GeoTrTemplate, which exploits the novel structured information by extending the recent GeoTr [9] algorithm. We encode both the warped image and our template image using a convolutional neural network and combine the feature representations. The subsequent transformer encoder-decoder module learns the attention between all pairs of features which enables the model to combine the warped image with our structural template information. For our extensive evaluation, we trained DewarpNet [6] without refinement network and GeoTr [9] in a unified framework and evaluate both approaches on the Doc3D dataset, as well as Inv3D, making use of established metrics such as MS-SSIM, LD, ED, and CER. In addition, we introduce the newer perceptual metric LPIPS [50] as a benchmark metric for document dewarping. To the best of our knowledge, this is the first dataset and approach to make use of structured template information for available invoices to foster research on robust image dewarping systems.
Our contributions are threefold: • We present a novel high-resolution dataset with template information, 3D renderings, a multiplicity of supervision signal maps, and backward transforms to enable designated learning of structural features for image dewarping. • We propose a novel image dewarping algorithm, which improves the state-of-the-art by a considerable margin through leveraging additional template information.
• We provide an extensive empirical evaluation of the novel dataset and model. In addition, we will publish our code and data in its completeness to be utilized as a benchmark system for future research.
The paper is structured as follows: First, we introduce the related work and then we explain the creation process of Inv3D and present our own approach before reporting our evaluation. We conclude with Sect. 6.

Related work
For the extraction of textual information from images, several factors strongly influence the performance of off-theshelf solutions such as Tesseract [38]. Images from flatbed scanners have constant illumination, little noise, and no deformations. Since these factors are important for the accuracy of OCR, there has been research to reconstruct those conditions from images of documents that were captured in the wild.
Datasets. There are several datasets available that focus on document rectification, which are summarized in Table 1. An early real dataset comprising 102 images of bend documents for evaluation was presented for a page dewarping contest [35]. A big synthetic dataset focusing on bends only by employing the cylindrical surface model from Cao et al. [2] was presented by Garai et al. [12]. One step forward toward higher complexity of deformations and thus toward higher applicability for real-world applications was presented by Ma et al. [29]. They created a large synthetic dataset comprising bends and folds. Deformations were generated in 2D and thus generalize poorly on 3D structures. This dataset was extended by RectiNet [1]. The current state-ofthe-art dataset is Doc3D [6]. It is a large document unwarping dataset comprising captured 3D meshes and a collection of rendered document images. It adds more realism by including crumpled documents. CREASE [31] follows the pipeline of Das et al. [6] to render a high-resolution dataset with additional supervisory signals, which, however, is not publicly available. Inv3D represents the first publicly accessible dataset with high resolution and complex 3D structures. Similar to CREASE, we provide additional 3D annotations such as per-pixel angle, curvature, and text maps. In addition to that, Inv3D is the first template-based dataset that enables research on structured documents in the warped domain with the support of flatbed template information while comprising the same challenging aspects as Doc3D. The exploitation of known flatbed templates at inference time might ease the problem of image dewarping and thus improve the potential for real-world applications of this technology.
Approaches. The correction of illumination and noise in document images has been considered in [7,9,21,33,41]. DIDC [35] 102 -----Note that we could not find conclusive information for characteristics with ?, e.g., since we do not know whether that information would be made available upon public access Also, there is work on improving text extraction from noisy images [16]. The task of document rectification, i.e., generating a flatbed version of the captured document, is at the core of our work and thus reviewed in more detail. While there has been work using 3D sensors for document unwarping [4,23,40,47], we are interested in recovering a flatbed version of documents from a single image only. Features in 2D have played an important role for this task. Popular examples include the usage of baselines of horizontal text [14,17,18,26,36] and vertical stroke boundaries [27,28]. These feature-driven approaches are frequently combined with the assumption of a cylindrical surface model [2] or a developable surface [22]. Tian and Narasimhan [39] relaxed the assumption on the surface model by allowing to reconstruct a broader class of 3D shapes. A CNN-based approach for texts in English and Bangla was presented by Garai et al. [12].
One drawback of the aforementioned works is that they focus on bend surfaces. Folds and especially crumples, which make the geometry of the surface significantly more complex, are neglected despite their common occurrence in document images. The first work using deep learning for distortion estimation, and not only feature extraction, from a single captured image is DocUNet [29]. The usage of this deep learning pipeline significantly sped up the unwarping process compared to previous methods; however, only 2D deformations were used during training. RectiNet [1] uses a gated and bifurcated stacked U-Net module and slightly improves upon the original DocUNet model. DewarpNet [6] improves upon DocUnet by incorporating 3D information. A two-stage system with two sub-networks, one for unwarping and one for texture mapping, is presented. Xie et al. [43] introduced a deep learning-based method that uses displacement flow estimation. Here, Doc3D is used as a dataset. CREASE [31] introduces a context-aware end-to-end dewarping pipeline. They acknowledge the importance of predicting the orientation of the text and add a text angle prediction as previously seen in Scene Text Recognition tasks. Feng et al. [9] introduce a transformer-based model to capture the global context of the document via self-attention. Xie et al. [44] propose a model which learns sparse control points that they use to interpolate the backward mapping. Das et al. [8] decouple local and global unwarping by dividing the image into patches prior to the rectification process and stitching them together afterward. DocScanner [10] and Marior [49] tackle the unwarping problem iteratively using a single rectification estimate in multiple steps. Xue et al. [46] unwarps the documents in two steps using the coarse and the refinement transformer. In contrast to previous work, the authors focus on highfrequency signals for training. Jiang et al. [15] formulate the task as a constrained optimization problem. They segment background/foreground, detect text lines, and use these constraints to search for an unwarping map. The PaperEdge system by Ma et al. [30] pre-unwarps the documents based on the outline of the paper sheet and learns subsequently a texture-based deformation to get the final result. Lastly, [11] propose DocGeoNet, a geometrically constrained representation of the document. The approach learns an intermediate 3D representation of a given document before inferring the backward map. While the state-of-the-art shows promising results, it is still a very active area of research to increase both the quality and reliability of current approaches. We chose to base our approach on the geometric unwarping model DocTr [9] which shows highly competitive unwarping performance. DocTr is especially a good choice since its integrated attention mechanism allows the model to meaningfully integrate the additional structural information in the form of templates.

Dataset
Since, to the best of our knowledge, there is no large-scale, high-resolution dataset for document dewarping publicly available, which includes visual templates, we fill this gap by creating a novel dataset called Inv3D. The dataset creation pipeline consists of four stages: resource preparation, invoice rendering, invoice warping, and finally the auxiliary map generation. Note that our dataset Inv3D focuses on a single type of documents only, namely invoices. Since we provide our generation pipeline, one can easily create their own dataset using other types of documents. We conclude this section by introducing a new real-world evaluation dataset called Inv3DReal.

Resource preparation
In order to create convincingly realistic invoices, we collected publicly available invoice templates for entrepreneurs in common text processing formats. By converting them to HTML documents, we are capable of manipulating given formats and contents through simple text modifications while preserving their overall layout. We replaced all exemplary content provided by the input document with machinereadable tags such as {{ seller.company.name }}. Thus, our system is capable of automatically inserting the correct content at the correct position within the invoice Web page as intended by the invoice template creators. In total, we collected and prepared 100 different invoices which form the basis for the subsequent stages.

Pipeline: invoice
The first step in creating a dataset sample is creating a realistic invoice instance. Starting from an invoice Web template, we perform random content generation and apply random appearance changes before rendering the invoice instance files.
Random content generation. We randomly create fake sales orders and personas that resemble real invoices as closely as possible using existing libraries 1 and the E-Commerce Kaggle dataset [3]. To achieve a high level of realism, we retain the data coherency during the generation process and fit the data to the layout constraints imposed by the web page template, i.e., number of rows available. Additionally, we generate random representations of the data to increase its variance, e.g., different date formats. Since we provide the generated content in a structured manner, its text representation, and position in the document, our dataset can be used for the task of information extraction.
Random appearance changes. By applying random modifications to the invoice Web templates, we increase the visual variance and thus reduce the potential for overfitting. We employ color and font substitutions, as well as random font size scaling. Furthermore, we replaced the logo of the given invoice document-if present-with a random logo image from the Large Logo Dataset [34] and altered the document margin.
Rendering. To create a fake invoice sample, the random content is filled into the randomly modified Web templates and rendered as an image in A4 format. Furthermore, we create three auxiliary images using JavaScript and CSS 1 https://faker.readthedocs.io/en/master/.

Fig. 2
A synthetically generated invoice sample. From a to e: full document, information delta, template, text mask, and relevant area visualization manipulations: information delta, template, and text mask. The information delta depicts all randomly generated texts. The template image shows everything except the information delta and hence the overall structure of the given document and static text. The third and last auxiliary image contains all text within the invoice document. See Fig. 2a Additionally, we provide two types of ground-truth information: the true word list and relevant image areas. The latter describes which information is expected to be at which position within the document. Figure 2e visualizes the relevant areas. These ground-truth annotations are relevant for other tasks than image dewarping, e.g., information extraction and document understanding.

Pipeline: warping
The next step in the dataset generation process is the mapping of flat invoices to deformed sheets of paper in 3D and creating 2D renderings (see Fig. 3a). We project our invoices to the meshes from Doc3D [6] using Blender. 2 For the environment maps, we used the Laval Indoor HDR dataset [13]. In contrast to the Doc3D dataset, we rendered our samples with a considerably higher resolution in order to better represent the real-world scenario. We chose 1600x1600 instead of 448x448 pixels used by Doc3D.

Pipeline: auxiliary
In addition to the previously mentioned ground-truth maps, we create and provide four more maps to facilitate the usage of our dataset and to reduce the need for computationintensive calculations. The first is a high-resolution backward map (BM), that is, a discrete function to remap relative pixel positions to their original relative position. The backward map is the inverse mapping to the UV map, which specifies the deformation of the original mesh in the warped space. Since the UV map generated by blender is incomplete in the border region of the texture, we used nearest-neighbor extrapolation to fill the missing pixels in the backward map. The other auxiliary maps are relevant for CREASE [31], namely per-pixel orientation angles (see Fig. 3h), curvature estimations (see Fig. 3i), and text masks (see Fig. 3j).

Design decisions
Since we are using more than one external resource (invoice documents, logos, environments, object meshes, and fonts) we need to define the datasets train, validation, and test split before creating the dataset. We split all resources according to our split ratios (66.6% train, 16.7% validation, 16.7% test) and assigned the resource split to each split, respectively. This way, we prevent information leakage between these three splits by using resources in more than one split at a time. Note that the fonts were split on font level instead of style level. Furthermore, the meshes were split with regard to their generation. Most meshes were created by modifying a recorded parent mesh [6]. We split all meshes by its parent mesh to keep a clear separation between splits.

Real-world dataset
To complete the Inv3D dataset, we created a real-world dataset, referred to as Inv3DReal, to measure the performance of dewarping models under realistic conditions. Inv3DReal consists of 360 pictures displaying printed and altered invoices taken by a smartphone camera under different lighting conditions and backgrounds. We randomly selected 20 samples from the synthetic test dataset as the basis and applied six different deformations (perspective, curled, fewfold, multifold, crumples easy, crumples hard) inspired by Das et al. [6], as well as three different settings (bright, colored, shadow). We provide examples in Fig. 4. The bright setting (Fig. 4h) displays the documents on a gray background with daylight incidence. The second setting (Fig. 4i) displays the document on a white background sheet with RGB lighting. Lastly, we defined the shadow setting (Fig. 4j) as a document in front of a wooden surface with multiple shadows falling onto the document.

Architecture
In this section, we present our novel approach for image dewarping by leveraging structural templates at training and inference time. We extend the transformer-based state-of-the art model GeoTr introduced by Feng et al. [9] to incorporate the a-priori known structural information represented through the invoice templates. In the following, we refer to our new model as GeoTrTemplate and its extension as GeoTrTemplateLarge. See Fig. 5 for a schematic. Our model receives the warped image W ∈ R h 0 ×w 0 ×3 and the template image T ∈ R h 1 ×w 1 ×3 . Both inputs are scaled to a fixed resolution of 288 × 288 for GeoTrTemplate and 600 × 600 for GeoTrTemplateLarge before applying a geometric head H individually to each image. The head H creates deep image representations with 36 × 36 positional features in a 128dimenstional space. For GeoTrTemplate, we employed the geometric head proposed by Feng et al. [9]. We define the geometric head for our large model as a slice of the Efficient-Net B7 noisy student model [45], namely the first four blocks followed by a convolutional layer. The features of the warped image H (W) and the template H (T) are concatenated, forming a combined input representation R ∈ R 36×36×256 . We then applied the transformer encoder and decoder from Feng et al. [9] and their geometric tail module to upsample the resulting backward map. For details regarding the employed modules, see the original paper. The output is a backward map B ∈ [0, 1] 288×288×2 . Our loss function is defined as the L1-norm between the output backward map B and the true backward mapB.

Metrics
The metrics can be divided into visual and text-based metrics and are explained in the following sections. Each metric measures the similarity between two images: the unwarped image based on the learned backward map B and the flat invoice image. Note that even with the perfect backward map B the metrics do not yield a perfect score since the backward maps do not correct the lighting influence; thus, the perfectly unwarped imageB(W) contains shadows and ambient lighting, while the reference image does not.

Visual metrics
We used the visual metrics MS-SSIM [42], LD [48], and LPIPS [50], which we explain briefly individually in the following. For all visual metrics, we resized the images to a fixed area of 598400 pixels while retaining the ground-truth aspect ratio as proposed by Ma et al. [29].

MS-SSIM.
As an established perceptual metric, we employed the multiscale structural similarity (MS-SSIM) [42]. It measures the perceived change in structural information by calculating statistical properties on multiple image windows at different scales. The metric consists of multiple structural similarity (SSIM) calculations on different scales of input and reference image in order to become scale-invariant. Similar to the evaluation of Das et al. [6], we convert the source and reference image to grayscale in order to create comparable numbers before applying the metric. The MS-SSIM ranges between 0 and 1, whereas 1 denotes the optimal score. LD. The local distortion (LD) as defined by You et al. [48] quantifies the similarity of two images based on the SIFT flow [24]. Input and reference images are converted to dense SIFT feature matrices and subsequently are matched pixel-wise to form the SIFT flow. The local distortion is defined as the mean L2-norm of the SIFT flow. We used the implementation and parametrization of Ma et al. [29] and apply it to grayscale images. The LD ranges between 0 and infinity, where 0 is the optimal score.
LPIPS. In addition to MS-SSIM and LD, we employ the learned perceptual image patch similarity (LPIPS) metric introduced by Zhang et al. [50] to measure the perceived image similarity. The authors show that the LPIPS metric is better suited for measuring perceived image similarity than SSIM. The metric is learned using a large-scale similarity preference dataset. For our evaluation, we used the pre-trained weights provided by the authors based on the AlexNet [19] model. LPIPS ranges between 0 and infinity, where 0 denotes the optimal score.

Text-based metrics
For many use cases such as information extraction, a textbased metric is better suited to evaluate the unwarping method. Using the learned backward map B, we calculate the unwarped image B(W) and perform OCR using the opensource engine Tesseract 4.0.0 [38]. In order to detect the text in images, the input image requires a sufficiently high resolution with respect to the contained text size. Therefore, we scaled the unwarped image and reference image to a size of 3740000 pixels while preserving the reference image aspect ratio. To evaluate the recognized text, we use two different metrics ED and CER described in the following.
ED. The edit distance (ED) is defined as the number of insertions, deletions, and substitutions required to transform an input text to the corresponding reference text.
CER. We calculate the character error rate (CER) for each reference text. The CER is defined as the Levenshtein distance [20] between input and reference divided by the number of characters in the reference text.

Baseline selection
We compare our results to the baselines DewarpNet (without refinement network) [8] and GeoTr [9]. Document dewarping can be decomposed into two subtasks, geometric dewarping, and illumination correction. The prior remaps all pixel locations, whereas the latter alters the per-pixel colors to remove  [9] shading and environmental light effects. As our model does purely geometric dewarping, we selected our baselines to use geometric dewarping only to make a fair comparison. The usage of an illumination correction model is decoupled from the geometric dewarping, thus can be appended to all baselines as well as our model. Since the refinement network of DewarpNet [6] and IllTr [9] are an illumination correction networks, we argue that omitting these networks is well founded.

Hyperparameters
We trained all models up to 300 epochs with an early stopping patience of 25 epochs based on the validation mean squared error between B andB. All other hyperparameters depend on the model type. We used the parametrization published by the original authors to reproduce their results as closely as possible. For GeoTrTemplate, we used a batch size of 8, the AdamW optimizer [25] with an initial learning rate of 10 −3 and the OnceCycleLR scheduler [37] with a maximum learning rate of 10 −3 . Note that for the training of GeoTr, GeoTrTemplate, and GeoTrTemplateLarge, we employed gradient clipping to increase the training stability. We clipped the global gradient norm to a value of 1.
DewarpNet [6] and GeoTr [9] employ different background augmentation strategies to boost the model performances. DewarpNet replaces the warped image background with randomly selected textures from the Describable Textures Dataset [5] during training. GeoTr learns a light semantic segmentation network [32] in order to remove the background beforehand. To enable a fair comparison of both approaches and our models, we kept the original backgrounds. Furthermore, we augmented all images for training using color jitter with a random change of up to 20 % in brightness, contrast, saturation, and hue, respectively. Note that the input image resolution differs on the individual model due to architectural constraints. In particular, DewarpNet receives images with 128 × 128 pixels, whereas GeoTr and GeoTrTemplate use images with a resolution of 288 × 288 pixels and GeoTrTemplateLarge the resolution 600 × 600. Table 2 shows the quantitative results of our approach and the baseline methods evaluated on our new dataset Inv3DReal. We include the identity backward map for reference, thus creating a lower bound for the scores. As expected, all approaches outperform the identity baseline in all metrics by far. Our experiments show that GeoTr is superior to Dewarp-Net without the refinement network in all metrics and for all training datasets. When comparing the two models with respect to the used training dataset, we conclude that training on the Inv3D dataset slightly improves the evaluation results compared to training on Doc3D. Since Inv3D is more similar to Inv3DReal than Doc3D, this effect is expected. The by far best results yield our models GeoTrTemplate and GeoTrTemplateLarge in all metrics, especially for the visual evaluation metrics. The local distortion improves by 23.4 % and 26.1 %, respectively, in comparison with the runner-up GeoTr trained on Inv3D. These results show the effectiveness of our approach.

Quantitative results
We also trained the baseline models on Doc3D and evaluated on the established DocUNet benchmark [29]. We were able to reproduce the reported numbers approximately. The difference is likely due to differences in the image augmentation methods that were applied, e.g., we omit the random background replacement of DewarpNet in our setting. We chose to apply the exact same image augmentations for all approaches to allow for a fair comparison between them. See Table 3 for a direct comparison of our results with the reported numbers.
When comparing the absolute number of the DocUNet benchmark and our Inv3DReal benchmark, we observe that the DocUnet evaluations are closer to the optimum for most metrics, which indicates that our benchmark is harder to solve for given approaches. Note that the edit distance on DocUNet Values in brackets denote standard deviations Table 3 Comparison of our implementation with the numbers reported by the original papers trained on Doc3D and evaluated on the DocUNet benchmark [29].  Values in brackets denote standard deviations is higher than on Inv3DReal which is due to the fact, that DocUNet images contain more text by a large margin. Therefore, the OCR engine yields long texts, which leads to a high absolute number of insertions, deletions, and replacements for DocUNet. To better understand the characteristics of each approach, we provide an in-depth evaluation of our model GeoTrTemplate based on Inv3DReal. We average the evaluation data per deformation class and per lighting setting. The results are given in Table 4. According to most metrics, the curled deformation appears to be the easiest task, whereas the heavy crumples represent the hardest class to unwarp. Interestingly, the LD does not agree with other metrics as stronger deformations lead to better results with regard to the LD metric. When we compare the three different environment settings, it appears that bright is the easiest, whereas shadow is the hardest according to most metrics. Similar to the deforma-tion evaluation, we observe the inverse order according to the local distortion.
For a quantitative evaluation of DewarpNet [6] and GeoTr [9] on Inv3DReal with regard to the different modifications, please refer to supplementary material.   6 Qualitative evaluation of DewarpNet [6] without refinement network, GeoTr [9], and our model based on selected samples of Inv3DReal

Ablation study
We conducted an ablation study to measure the influence of different types of structural information on the model performance. For our study, we altered the templates in three different ways and trained our model from scratch using the altered template as input. The modifications are as follows: White Template. The template image is completely white and thus contains no additional information over the warped input image.
Text Only. The template image for this ablation contains all texts visible on the original template image (see Fig. 2c), but no other structures such as lines or images. Note that all texts were converted to black so that the texts are still visible after removing the background colors. The final template ablation is a black and white image. Structure Only. We removed all textual information from the original template image (see Fig. 2c), such that only the structural information remains.
The evaluation results on the Inv3DReal dataset are given in Table 5. The white template ablation performs the poorest of all ablations. The absolute metrics values are comparable to the GeoTr model trained on Inv3d as given in Table 2. This performance is as expected since both experiments receive the same information as input and have a fairly similar network structure. According to the visual metrics, the structure-only ablation and the text-only ablation boost the performance of our model in each case by a few points, but the combination of both yields the best overall performance. This finding indicates a correlation between the model performance and the amount of a-priori known information about the target structure. For the text metrics, while there is no significant change of ED and CER values, the structure-only ablation performs best by a small margin. Overall, we see a stronger improvement for the visual metrics compared to the text-based metrics, which indicates that adding structural information primarily helps to improve the global positioning rather than the fine-grained details.
To investigate the influence of template on the performance further, we conducted a second ablation test. For this, we trained the GeoTrTemplate model using the Inv3D dataset and selected random templates during inference. The results are given in Table 5. The comparison of choosing a random vs the correct template shows that the correct template selection is crucial for performance. Indeed, falsely selecting a template results in degraded performance compared with not providing additional information (white template).

Conclusion
In this work, we presented Inv3D, a novel high-resolution invoice dataset with both synthetic and real-world data with rich label information. Apart from the rectification mesh for dewarping each individual invoice, we add corresponding template information which can be utilized during training for improving generalizability. Inv3D comprises 25,000 samples based on 100 templates and heterogeneous environmental factors, such as challenging lighting conditions and a multiplicity of document deformations. In addition, we introduced GeoTrTemplate and GeoTrTemplateLarge, two novel models which leverage a-priori available structural information for the task of document image dewarping. We conducted a detailed evaluation study to compare our new models with the state-of-the-art approaches DewarpNet [6] and GeoTr [9]. Our empirical analysis showed that both outperform the baseline methods significantly; in particular, the GeoTrTem-plateLarge model improves the local distortion of GeoTr by 26.1 %. Nevertheless, the absolute values for text detection show that further research is needed in order to solve this task robustly and consistently.
In future work, we plan to investigate better approaches to exploit template information, as available in Inv3D. The iterative refinement approach proposed by Feng et al. [10] for DocScanner might be used to create a correspondence map of interest points in the warped image and the template. We believe that the availability of templates at inference time can play an important role in document dewarping through the visual cues they provide. Another interesting direction is the calculation of unwarping confidence scores based on the matching of the unwarped image and the template. The confidence scores could become important for real-world applications to avoid the acquisition of erroneous information.
Author Contributions Felix Hertlein wrote the main manuscript text and all associated code. Alexander Naumann was involved in major contributions to Fig. 1 and related work. Patrick Philipp was involved in major contributions to the abstract, introduction, and conclusion. All authors reviewed the manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL.

Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.