Computer Vision Based Automatic Margin Computation Model for Digital Document Images

Margin, in typography, is described as the space between the text content and the document edges and is often essential information for the consumer of the document, digital or physical. In the present age of digital disruption, it is customary to store and retrieve documents digitally and retrieve information automatically from the documents when necessary. Margin is one such non-textual information that becomes important for some business processes, and the demand for computing margins algorithmically mounts to facilitate RPA. We propose a computer vision-based text localization model, utilizing classical DIP techniques such as smoothing, thresholding, and morphological transformation to programmatically compute the top, left, right, and bottom margins within a digital document image. The proposed model has been experimented with different noise filters and structural elements of various shapes and size to finalize the bilateral filter and lines and structural elements for the removal of noises most commonly occurring due to scans. The proposed model is targeted towards text document images and not the natural scene images. Hence, the existing benchmark models developed for text localization in natural scene images have not performed with the expected accuracy. The model is validated with 485 document images of a real-time business process of a reputed TI company. The results show that 91.34%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.34\%$$\end{document} of the document images have conferred more than 90%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90\%$$\end{document} IoU value which is well beyond the accuracy range determined by the company for that specific process.


Introduction
Historically, margins have been used as a method to layout text within a document. The use of it dates back to ancient Egypt when people used papyrus scrolls to organize writings, and margins were the visual mark indicating the end of one line and the beginning of another. Since the invention of the codex, the need for margin to distinguish text blocks were no more relevant. However, instead of becoming antiquated, it took on a new role. The margin provided a visual aesthetic to the text and allowed the reader to put down the notes and commentary within the blank space. It is ubiquitous in the twenty-first century to organize the texts into digital form, and the margin remains and is utilized for placing signatures, stamping, notes, etc. In some business processes, the margin has become indispensable for the above reasons. One such real-time obligation for margin is observed in the TRS, which is the primary motivation of the present research (see Fig. 2).
Recordation of legal instruments in a county recording office that is a public registry is an act of constructive public This article is part of the topical collection "Cyber Security and Privacy in Communication Networks" guest edited by Rajiv Misra, R K Shyamsunder, Alexiei Dingli, Natalie Denk, Omer Rana, Alexander Pfeiffer, Ashok Patel and Nishtha Kesswani. Abhijit Guha, Debabrata Samanta and Sandeep Singh Sengar have contributed equally to this work. notice of ownership to the subsequent purchaser, creditor, or mortgagers. The recordation itself does not determine the title but provides a framework for the legal system to do so during any future litigation. Various TI companies provide the TRS to their customers, facilitating the recording. The state statute establishes the rules of recording. The rules differ from county to county. The documents to be recorded must comply with local and state requirements. Some of the standard regulations most of the counties in the USA follow are as follows; • The documents must be notarized. • There must be county-specified margins on the top, left, right, and bottom of every document's first and last page. • If the specified margin is absent, a cover sheet must be attached for a recording stamp. • Original signatures must be present on all instruments.
Before the recording, the service providers verify the rules manually. Human intervention makes the validation timeconsuming and error-prone, which in turn increases the cost of recording. An intelligent system to automate the verification is the need of the hour for the TRS providers.
Numerous studies related to text localization and recognition have been conducted in the recent past to read text from various scenes and videos for content-based analysis. It is an essential prerequisite for OCR of a digital text document. Finding out the text from the image document is the first step for any OCR product to extract the characters from the text. Applications to assist visually impaired individuals with the surrounding entities to give them a certain autonomy have been around for some time. Reading license plates and automatic navigation are some of the other applications of text localization. To our knowledge, no previous study to date examined the computation of the margin within a digital image document to automate the recording process. A whole range of different approaches is administered to detect text areas within a document or a natural scene in the past decade.
In the present study, we have adopted a classical design to traditional image processing techniques for detecting the text segments within the document image after correcting the skewness (See Fig. 1). DIP operations such as noise removal, image binarization followed by morphological transformation are carried out sequentially. Additionally, to eliminate the vertical or horizontal prominent noise on the edges of the documents commonly noticed in scanned document images, vertical and horizontal structure detector kernels are interjected. After detecting all the text areas within the document, the bounding regions are merged into one frame to accommodate all the text within the minimum rectangular bounding box. The predicted text region is then compared with the GT manually drawn area. Finally, IoU is calculated from both the bounding boxes to gauge the accuracy of the algorithm. The margin is calculated by subtracting the computed rectangle from the original image. IoU for 485   images are computed using the model, and more than 91.34% of the computed IoU are found to be more than 90% accurate. The margins are initially calculated in pixels and converted to inches for consumption by the client applications. The objective is to devise a model that automatically computes the top, left, right, and bottom margins of a given page of a document image by locating the text area (foreground) within the image background without human intervention. Based on the automatically computed margin, the consumer application can verify the county requirement and take subsequent workflow or process flow decisions.

Motivation
• TRS providers spend a significant amount of time and effort checking the prerequisites of the recording rules. This human cognition-dependent step increases the cost of the service. In the dawn of the technological revolution, there is a dire need for research to seek the possibility of real-time, intelligent automation in this regard. • Several sophisticated machine learning-based techniques are explored in the past to solve the text localization problem for various real-time applications. Still, little emphasis has been paid to the classical DIP and computer vision techniques which can effectively be utilized in scenarios like this. • Besides the fact that the chance of error and risk and processing time increases due to human validation of such recording prerequisite examinations, it is a classic example of misuse of the potential of human cognition. The tasks are monotonous and repetitive to be performed by a dedicated expert system in place of human associates.

Contribution
• The proposed model is validated in a real-time environment with actual production digital documents, and the presented results are for consumption for business decisions. • Although the research has been conducted with digital documents from TRS, the model can be utilized for margin computation for any digital documents irrespective of any domain. • The model produced as high as 90% IoU score for more than 91% documents including samples with significant shot noise and edge noises occurring due to poor scanning.
Text localization is a prerequisite for the information extraction and solution approaches have been categorized as Region based, Texture based and Hybrid [50]. A host of artificial neural network and machine learning based architectures have been proposed that have proved the efficacy over the time. ICDAR has been the standard benchmark data set for all the past experiments. However, such sophisticated methods have proved to be less effective for a much simpler task of text localization for the scanned digital image documents for margin detection. The machine learning based methods have turned out to be time consuming and costly. Our proposed method depends on simpler image processing methods without the need of training hazards and address all sorts of digital scanned documents with variety of noise.

Paper Organization
The remainder of the paper is divided into five sections. A thorough literature survey of different techniques of text localization and its application is presented in "Background and Related Work". The dataset and the description of the methods used are briefed in "Materials and Methods", followed by the experiment steps and the results obtained in "Experiment". Finally, in "Results and Discussion" and "Conclusion with Future Scope", the insights obtained from the result are discussed, followed by the conclusive remarks and future scope.

Background and Related Work
'Text' being one of the most expressive mediums of communication, can be primarily embedded in three different digital sources; scanned documents, random images, and videos [38,43,47]. Increased attention to detect and recognize text from the sources described above is seen in recent times. Text localization is a sub-problem of detection which focuses on locating the region where the text data is present within a given image rather than recognizing its semantics [23][24][25][26][27]. The literature in this regard is divided based on the source where the text is to be localized. The advance in computer vision and pattern recognition also has made it more feasible to address different challenges faced during text detection and localization [23,28]. Further categorization of the literature can be perceived based on the technology, such as deep learning-based, statistical and CV-based.

Text Localization in Document Image
Khan et al., in their study proposed a hybrid technique for localizing text elements from both document and scene images. They applied Morphological operations to segregate the foreground objects that is text from the backgrounds. They adopted MLP approach to classify the text regions and non text regions using statistical features. The proposed method achieved 86.38% accuracy for text region isolation [29].
Nikitin et al. proposed a two step architecture to detect the word level bounding box followed by text region identification using classic computer vision techniques. Their proposed model outperformed the other state-of-the-art techniques of text localization for document images [30].
Nagaoka et al. used R-CNN object detection method for text localization. They introduced an additional layer called merge layer that enables multiple region of interest more effectively than the standard R-CNN network [31]. It is a primary prerequisite to detect and segment text regions within a document image for transcription. The task is even more difficult when the text is handwritten. Carbonell et al. in their study [44] proposed the technique of full page transcription after text area segmentation.
Neumann et al. proposed an end-to-end unconstrained text localization and recognition method for text region detection. The method is a deviation from the common assumption of region-based approaches of connected component analysis [49].

Materials and Methods
The proposed model is designed to compute the margin by locating the text area within the document image. Once the text areas are segmented, the margin is calculated by subtracting the text area from the remaining document. Although various state-of-the-art text localization techniques are adopted in the recent past, we have chosen multiple classical digital image processing techniques for text localization. Having dealt with digital text document images, the complexity of text localization from the natural scene needed no special attention during the experiments.

Data Set
Thirteen different types of digitized, real-time, recordable, multi-page, legal instruments that a reputed TI company uses for the recording are collected and used as samples for the experiment. The document types and the distribution of the total number of pages considered are as shown in Fig. 3.
Every page of the document is regarded as an independent recordable instrument to improve the study's rigor. However, the real-time recordation considers the first page of every document as a recordable instrument, and the recording office provides stamps only on the first page. Four thousand five hundred sixty-seven individual document pages are annotated with the margins manually. The system predicted margins for every document page are programmatically compared with the manually annotated margins.

Annotation
The model identified bounding box enclosing the text area within the image document is compared with the manually drawn bounding box. It is not expected to get the exact convergence of the both but convergence beyond a certain predetermined threshold can be considered for the evaluation of the model. Every single page of the documents were considered independent document image and the bounding box excluding the margin of the page was manually drawn using VIA [1,2], an open source image annotation tool developed by VGG [3]. The annotated bounding box data for every image is saved in the below tabular format (see Table 1). As described in Fig. 4, (x,y) denotes the top left pixel coordinate of the drawn rectangle and h and w denotes the height and width of the rectangle enclosing the text respectively. H and W are the original height and width of the image.
The human annotated data is compared with the model annotated data and IoU is calculated from the coordinates of both GT and the predicted coordinates. The model is built upon the below algorithm and the steps are described in the following sections.

Skew Correction
Digital documents are often scanned from the physical copies, and many a time, the orientation of the text within a document is not appropriately aligned. It is an essential preprocessing step to deskew the text within the image to identify the margin space within a digital textual document image (as shown in Fig. 5).  There are predominantly four techniques that are used for shew correction of text within a document image; Projection Profile, Hough Transform, Nearest Neighbour Clustering and Fourier Transform are explored and experimented in a number of different variations [4].

Gray-Scale Conversion
Digital document images, like other digital images, are usually represented in three-color channels, namely Red, Green, and Blue, also known as RGB color space [6]. However, colors seldom have much significance in the applications of digital image processing or computer vision algorithms. Gray-scale or monochrome representations are frequently used successfully by different descriptors that reduce the complexity and computational effort. Applications concerned with text data within digital images have little importance with the color space as the single-channel retains all the necessary information. In the present study, the interest is to localize the text regions within the document, which can be accomplished through a gray-scale image. In this pre-processing step, an RGB to the gray-scale conversion function is applied to a color image in ℝ n×m×3 space to convert it to a ℝ n×m representation [5] where the pixel values are within 0 representing the strongest intensity and 255, the weakest intensity of the contrast (Fig. 6). Subsequent processing steps are performed on the gray-scale version of the document image.
The three most commonly used color space conversion methods are based on Lightness, Average and Luminosity. Lightness based conversion averages the most prominent and the least prominent colors to represent the gray-scale pixel value.
The average method simply averages the RGB pixels of the colored image.
The luminosity method is a little more sophisticated weighted average method that accounts for the human perception of the color intensity.
where w 1 = 0.21 , w 2 = 0.72 and w 3 = 0.07 . As human eye is most sensitive towards the green color, the highest weight is assigned to the green channel. Luminosity based grayscale conversion has been used for converting the digital documents to gray-scale, monochrome images in the present study.

Denoising
Noise is an unavoidable occurrence during the capture or transmission phase of a digital image that degrades the image quality and poses hindrances during image processing. Digital text images are no exception to this side-effect [7,8]. Digital text documents are often printed and scanned multiple times before storing them in the digital archive. A common occurrence of noise is the dust particles present in the scanner or the printer screen, causing the edge or marginal noise within a digital text document (see Fig. 7). These noises make the text localization difficult within a digital document and frequently results in false detection of the noise area.
(1) Lightness = max (R, G, B) + min(R, G, B) 2 Filtering techniques like mean, median, Gaussian, Bilateral, and Weiner filtering are standard and frequently used for noise removal and smoothing for subsequent processing. In the present study, bilateral filtering effectively removes the shot noise and edge noise occurring due to the scan of the document (as shown in Fig. 8).

Adaptive Thresholding
Document images concerning the recording process have primarily two segments within the image; the background, which is of lighter intensity, and the foreground text, which is of darker intensity [15,16]. It is sufficient to segment the images into these two groups of pixels to segregate the text area's background and text area. As the text regions within the gray-scale image can have various intensity levels of gray, data-driven adaptive thresholding (Otsu's method) has been adopted for the binarization task [9,[12][13][14].
Otsu's binarization method identifies and returns a single intensity threshold for a given image to represent the image in two classes namely foreground (the pixels representing textual elements) and background (the pixels representing the empty section of the canvas). The optimal threshold is determined by minimizing within class variance and maximizing the between class variance through an exhaustive search algorithm. Considering between class variance maximization is equivalent to minimizing within class variance, The class means are represented as

SN Computer Science
After the binarization, the gray-scale image is considered fit for a series of morphological transformation operations to remove other horizontal and vertical structured edge noise arising from the scanning (Fig. 6) and localize the foreground text features.

Morphological Transformation
MT is a series of non-linear mathematical operations on the morphology or shape of features in a digital image [19,20]. These operations take the relative ordering of the neighboring pixel intensities into account and do not depend on the absolute pixel intensity; hence, the functions are best suited for a binary image. However, there are mathematical variations of MT available for gray-scale images. Although MT is usually considered to remove the imperfections occurring due to the binarization of an image, the present study utilizes the transformation not to improve the feature (foreground text) prominence but to approximate the foreground discovery.
A small image template called the SE or Kernel is convoluted over the original binary image to probe the presence or absence of certain shapes or structures within the image [17]. During the convolution of the kernel over the original image, the kernel is said to fit the image if each pixel with intensity 1 of the SE matches exactly with the pixel intensity of the original image and is said to hit if at least one pixel of the kernel set to 1 matches with that of the larger image.
Let I be the binary image in Euclidean space E and K is the kernel.

Erosion
The erosion of I by the kernel K produces an output image with 1's in the origin of the kernel at which K fits I [19]. It is denoted by where K z is the translation of K by vector z i.e.
Erosion reduces the boundary of regions of the white pixels or the foreground pixels (pixels representing the text in the digital document in the present case). The gaps within the regions holding the foreground pixels are enlarged [18,21].
The dilation of I by the kernel K produces an output image with 1's in the origin of the kernel at which K hits I. It is denoted by Dilation increases the boundaries of the regions by adding pixels to the foreground [19]. It improves or enhances the features.

Opening
Opening is a compound morphological operation represented by erosion of I by K, followed by dilation of the resultant image by K [19] denoted as, This operation opens up the gap between two objects connected by a thin layer of pixels. The surviving pixels after the erosion are restored to the original size by dilation.

Closing
Closing is the reverse compound operation of opening which applies erosion on the resultant image obtained by the dilation operation of I by K [19], represented as It helps closing small gaps or holes within a region of binary image [22].
Primary objective of the applying MT in the experiment is to locate the text region within a document image with high accuracy so that the non text regions that are present in the foreground pixels due to noise can be excluded as much as possible. After applying the combination of the aforesaid MT, we have been able to detect and remove most of the noises with precision as shown in the below Fig. 9.

Experiment
AMCM is validated using 485 various document image pages collected from a real-time TRS provided by a reputed TI company [46,48]. Every page of the document is considered to be an independent image candidate for margin computation. Every image is manually annotated using VIA annotator. A single rectangle is drawn to localize all the text regions within the document image. Finally the document images are passed through AMCM and the predicted rectangle is finally compared with the GT using the Intersection over Union (IoU) method.
Four different de-noising filters namely Gaussian filter, Mean filter, Median filter and Bilateral filter are evaluated by comparing the average IoU obtained ( Table 2 below). Bilateral filter has obtained significantly higher IoU over the other filters.
Horizontal and vertical line kernels of different sizes with respect to the height (h) and width (w) of the image are considered and obtained IoU's are compared. Below table shows that the IoU obtained for the line kernel of size 0.05 times the h and w received the best IoU (Table 3).
The proposed classical DIP approach for margin detection is compared with the state-of-the-art object detection models YOLOv4 (You Only Look Once version 4) and Mask R-CNN (Regions with mask convolutional neural network). The same data set used in the experiment mentioned above is used. 485 document images were split into training and testing sets with 80 and 20 percent, respectively. Additionally, we have also utilized state-of-the-art OCR technologies like Google Vision, Azure Cognitive OCR, and Tesseract for text localization and calculated the margins from the word boundaries retrieved from the OCR output. For all the methods, we have calculated the average IoU and compared it with the average IoU of the proposed model. The comparison is shown in Table 4.

Evaluation Metric
IoU is a popular evaluation metric used for an OD to measure the accuracy of the localization of the detected object. As long as, there is a GT bounding box (drawn by a human) available to be compared with a machine predicted bounding box, the IoU can be utilized to measure the accuracy of the machine prediction ( Fig. 10 depicts the IoU calculation). It is extremely unlikely that the predicted bounding box and the GT bounding box will exactly overlap with each other, pixel by pixel. However, the higher the overlap better the prediction. This is also known as Jaccard similarity coefficient, a statistic that measures the similarity between two finite sample sets S 1 and S 2 using the below set operation.
By design, 0 ≤ J(S 1 , S 2 ) ≤ 1 . The GT bounding box is represented by S 1 and predicted bounding box by S 2 (as depicted in Fig. 11). There is no absolute value to determine the accuracy as good or bad. It depends on the specific OD problem. In the present study, our aim is to detect the minimum area covering the foreground text region and the threshold of 0.9 that corresponds to 90% overlapping region is considered to be a very close or acceptable accurate prediction by the algorithm.
The most common metrics used in object detection are AP and mAP. These metrics are calculated combining the metrics for object classification as well as object localization. In the present proposed study, the interest is only to localize the object (the text region) and not the classification. 9 Scanned images with significant shot and edge noise in the first column. The second column shows the noise status after smoothing and binarization and the third column shows the impact of morphological operations. The noise become insignificant and the text area becomes prominent for localization ◂  IoU for every image is calculated for measuring the overall performance of the model. Figure 10 below depicts the IoU calculation. As it is quite unlikely to get the IoU value 1, an IoU threshold of 0.9 is considered as a threshold for a near-perfect prediction, and 10% error window is acceptable in the real-time scenario.

Results and Discussion
The IoU distribution is captured for 485 test image documents as shown in Fig. 12. The maximum and minimum IoU score attained by the algorithm is 0.99700 and 0.04384 respectively. The median IoU score is 0.96842 (see Table 5).
IoU values are rounded off to two decimal places to get a cumulative distribution pattern (Fig. 13). Out of 485 document images, we obtained IoU as 0.99 for 56, 0.98 for 110 and 0.97 for 100 documents which means that the system obtained an IoU ≥ 0.97 for 54.84% of the observations. Empirical observation shows that the IoU ≥ 0.9 are extremely accurate. Considering 0.9 as the threshold, we see from Table 6 is that 91.34% of the documents obtained an IoU beyond the threshold.
If the threshold is relaxed to 0.8 which as per IoU definition is fairly high accuracy of prediction, 97.93% of the documents fall within the positive prediction class. Based on the business need, the threshold can be moved up and down to decide the proportion of document to be passed through a human eyeballing.
We see from the result in Table 4 that the classical DIP technique outperforms all the state-of-the-art techniques in terms of detecting the text boundary and calculating the margin. However, accuracy is not the only advantage that is of prime concern in the study. YOLOv4 and Mask R-CNN needs heavy training and tagging cost in order to achieve respectable accuracy. The variation of the image documents is large and it is difficult to generalize a margin detection solution with a small set of sample training. On the other hand, the OCR techniques need two-step operations. First, get the OCR output, and second, calculate the margin from the token localized coordinates. This makes the solution heavily dependent on another solution and the cost factor is also included. The only open-source OCR solution that has been validated here has produced an IoU of 0.7322 which is considerably lower than that of the proposed method.

Conclusion with Future Scope
Despite the fact that advanced state-of-the-art AI and ML technologies are widely utilised in CV applications, the suggested model using fundamental computer vision techniques has empirically demonstrated to achieve a highly accurate detection of margins inside a digital image. More than 90% of 485 randomly selected digital documents from a real-time process of TRS of a reputed TI, where there were significant variations of noise present, our model was able to achieve 91.34% IoU. The result is encouraging for the adoption of the model in any business process with the requirement of margin detection across domains. The model uses classical approach of image binarization, color space conversion, Image inversion, Noise reduction and Morphological operations with three different kernels to accurately remove non text foreground pixels and detect and localize the textual foreground pixel within a single bounding box. Finally, the margin computation is a simple subtraction of the predicted bounding box containing the text region from the image height and width.
Future research could benefit from looking at digital papers from other industries, such as healthcare, tourism, and other BFSI. Although the model has been successfully tested on documents with prominent edge noises with significant vertical and horizontal structures, as well as shot noises caused by dirt in the scanner, documents with other types of noises, particularly those overlapping the document's margin, must be validated for accuracy. Furthermore, the study was conducted with documents that contained English language content, and the model must be tested with documents in other languages.    network and computer vision, that can be further applied to the same data set and compared with the proposed accuracy.