Localization Using DeepLab in Document Images Taken by Smartphones

Baniadamdizaj, Shima

doi:10.1007/978-3-031-11432-8_6

Shima Baniadamdizaj ORCID: orcid.org/0000-0003-1678-5108¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 440))

Included in the following conference series:

Conference on Multimedia, Interaction, Design and Innovation

4604 Accesses

Abstract

The seamless integration of statistics from virtual and paper files could be very crucial for the know-how control of efficient. A handy manner to obtain that is to digitize a report from a picture. This calls for the localization of the report in the picture. Several approaches are deliberate to resolve this hassle; however, they are supported historical picture method strategies that are not robust to intense viewpoints and backgrounds. Deep Convolutional Neural Networks (CNNs), on the opposite hand, have been validated to be extraordinarily strong to versions in heritage and perspective of view for item detection and classification duties. Inspired by their robustness and generality, we advocate a CNN-primarily based totally technique for the correct localization of files in real-time. We advocate the new utilization of Neural Networks (NNs) for the localization hassle as a key factor detection hassle. The proposed technique ought to even localize snapshots that don't have a very square shape. Also, we used a newly amassed dataset that has extra tough duties internal and is in the direction of a slipshod user. The result is knowledgeable in 3 specific classes of snapshots and our proposed technique has 100% accuracy on easy one and 77% on average. The result is as compared with the maximum famous report localization strategies and cell applications.

You have full access to this open access chapter, Download conference paper PDF

Fast End-to-End Deep Learning Identity Document Detection, Classification and Cropping

Hough Encoder for Machine Readable Zone Localization

Article 26 December 2022

LDRNet: Enabling Real-Time Document Localization on Mobile Devices

Keywords

1 Introduction

Based on the convenience of use and portability of smartphones, enhancing the processing power, and enhancing the first-class of pictures taken the usage of smartphones, those telephones had been capable of partly updating the paintings of scanners in file imaging. In the meantime, because of the distinctive abilities of smartphones and scanners, there are issues and demanding situations alongside the manner of turning telephones into scanners. Also, scanners are slow, costly, and now no longer portable. Smartphones, on the other hand, have become very accessible.

There are distinctive kinds of file paper files that are simpler to carry, read, and proportion while virtual files are simpler to search, index, and save. One of the benefits of file imaging is the cap potential to transform a paper file right into a virtual file, which through persevering with the technique of digitization and the usage of digital letter reputation strategies, the textual content, and contents of the file may be without difficulty edited and searched. It is likewise viable to save the virtual file withinside the area of outside reminiscence and without difficulty proportion or switch it among humans with a smaller extent than the distance that a paper file occupies withinside the environment.

There are a few demanding situations to digitizing a file the usage of a cellphone, some of which can be stated. Lack of uniform light, shadows at the file, which may be the shadow of the hand, phone, or different objects, form of materials, colors, and functions of the file, version withinside the historical past of the file and its contents, having three-D distortion, blurring, historical past complexity of files (which includes covered pages, chessboard, etc.), low file contrast, or terrible cellphone digital digicam first-class, undetectable life of the file from its historical past (because of being the equal color, light, etc.), the complexity of the file, for example, having folds, taking pix of multi-web page files which includes books and identification cards, being a part of the file out of the picture, covered. Being a part of the file through different objects, etc.). The best approach ought to be strong in those demanding situations. It needs to also be capable of running on a cellphone in a completely less expensive time.

In general, withinside the subject of digitizing files, a few researchers had been withinside the subject of resolving or supporting to enhance the picture first-class and decreasing the issues stated withinside the preceding paragraph, and others have supplied algorithms that during the case of issues withinside the picture taken through the careless person, the file can nonetheless be located withinside the picture. There is a 3rd class of studies that, whilst enhancing picture first-class and guiding the person to seize the best first-class picture of the file, offers the set of rules had to discover the file withinside the picture, that is an aggregate of the preceding strategies.

We advise a way that makes use of deep convolutional neural networks for semantic segmentation in pictures taken through smartphones. Our approach outperforms the kingdom of the artwork in the literature on three-D distorted pictures and might run in real-time on a cellphone. Additionally, it's miles extraordinary than previous strategies with inside the feel that its miles custom-designed to be robust to extra problems certainly through education on extra consultant data.

2 Literature Review

2.1 Document Localization Datasets

To localize files in photographs taken via way of means of smartphones we want a real-global dataset this is amassed from an ordinary user. There are four special datasets in report photographs taken via way of means of smartphones task. Three of those datasets comprise photographs that have identical or very near photographs together. The fourth dataset amassed extra photographs than others and changed into additionally in the direction of the real-global taken photographs with numerous demanding situations.

The to be had information set changed into used for the qualitative assessment of photographs of files excited about smartphones [1]. The information set of Kumar et al. Accommodates 29 special files below special angles and with blurring, and finally, 375 photographs had been obtained. The dataset offered in [2] makes use of 3 not unusual place forms of paper in doing away with numerous forms of distortion or harm consisting of blurring, shaking, special lights conditions, combining forms of distortion in an image, and taking photographs that have one or extra distortions on the identical time, and the usage of numerous forms of smartphones, which makes this information set extra reliable. The information set is offered withinside the article [3], which covers a few factors of the scene, consisting of the light’s conditions. An easy historical past changed into used. A robot arm was changed into used to take photographs to dispose of the digital digicam shake. The identical concept [4] offered a video dataset that there are five classes from easy to complicated, all with the identical content material and historical past, it consists of movies with 20 frames. And photographs are extracted from those frames. Different smartphones had been used for the harm resulting from the device, and additionally via way of means of the usage of special files. An overall of four, 260 special photographs of 30 files had been taken.

Paper [5] gives a Mobile Identification Document Video information set (MIDV-500) inclusive of 500 movies for fifty precise identity report sorts with floor truth, permitting studies on an extensive type of report processing issues. The paper gives capabilities of the information set and assessment consequences for present techniques of face recognition, textual content line recognition, and information extraction from report fields. Because the sensitivity of identity papers, which consist of non-public information, is a critical aspect, all photographs of supply files utilized in MIDV-500 are both withinside the public area or launched below public copyright licenses. In the paper [6] a brand-new report dataset is offered this is in the direction of the real-global photographs taken via way of means of users. The information is labeled into easy, middle, and complicated responsibilities for detection. It consists of nearly all demanding situations and consists of numerous report sizes and brands and backgrounds. It compares the result of the report localizing techniques with famous techniques and cell applications.

2.2 Document Localization Methods

Due to the demanding situations, it isn't always viable to digitize files with the use of smartphones without preprocessing or post-processing and count on suitable outcomes in all situations. That is why algorithms had been proposed to enhance the outcomes. The impact of image venture algorithms on the result may be divided into 3 categories: 1. Reduce demanding situations earlier than capturing 2. Fixed problems even as taking snapshots 3. Solve demanding situations after capturing. One of the earliest strategies of record localization changed into primarily based totally on a version of the history for segmentation. The history changed into modeled with the aid of using taking a photo of the history without the record. The distinction between the 2 photographs changed into applied to decide in which the record changed into found. This approach had the apparent negative aspects that the digital digicam needed to be stored desk-bound and pictures needed to be taken [7]. In general, the algorithms used to locate the record withinside the photo may be divided into 3 categories: 1. Use of extra hardware 2. Depend upon photo processing strategies 3. Use of system studying techniques. This trouble has arisen with the unfold of smartphones from 2002 to 2021 and may be improved.

2.2.1 Additional Hardware

In the article [8] they gift courses for the consumer in taking with fewer demanding situations primarily based totally on exceptional functions. As a result, the photograph calls for tons much less pre-processing to localize the document. This technique became now no longer very consumer-pleasant for the customers because of the constraints and slowdown of digitization. Article [9] used this technique for localizing. Following pre-processing, in addition, algorithms are required to complete the localization task. These algorithms can be divided into categories: 1. Use of more hardware 2. Consider strategies to apply gadgets imaginative and prescient. 3. The utility of deep gaining knowledge of algorithms. A scanning utility is presented [10] that consists of real-time web page recognition, fine assessment, and automated detection of a web page cover [11] while scanning books. Additionally, a transportable tool for putting the smartphone all through scanning is presented. Another paper that used extra hardware introduces a scale-invariant characteristic remodel into the paper detection machine [12]. The hardware of the paper detection machine includes a virtual sign processor and a complicated programmable good judgment tool. The equipment can obtain and process images. The software program of this machine makes use of the SIFT technique to discover the papers. Compared to the conventional technique, this set of rules offers higher with the detection process. In the paper [13] paper detection desires a sheet of paper with a few styles revealed on it. It takes a laptop imaginative and prescient era one step toward getting used withinside the field.

2.2.2 Machine Vision Techniques

The set of rules [14] operates with the aid of using finding ability line segments from horizontal test strains. Detected line segments are prolonged or merged with neighboring test line textual content line segments to provide larger textual content blocks, which might be finally filtered and refined. The paper [15] provides a spotting device in complicated history video frames. The proposed morphological method [16] is insensitive to noise, skew, and textual content orientation. So it’s miles without artifacts as a result of each constant/most excellent international thresholding and constant-length block-primarily based nearby thresholding. [17] proposes a morphology-primarily based totally method for extracting key assessment traits as guidelines for attempting to find appropriate license plates. Preferred license plates. The assessment function is in lighting fixtures adjustments and invariant to numerous variations like scaling, translation, and skewing. The paper [18] applies aspect detection and makes use of a low threshold to filter non-textual content edges. Then, a nearby threshold is chosen to each hold low-assessment textual content and simplify the complicated history of the excessive assessment textual content. Following that, textual content-region enhancement operators are proposed to emphasize areas with excessive aspect power or density.

[19] describes a step-with the aid of using-step method for locating candidate areas from the entered photo the use of gradient data, figuring out the plate region of the various candidates, and enhancing the region's border with the aid of using including a plate template. In the textual content extracts from video frames paper [20] the nook factors of the chosen video frames are detected. After deleting a few remoted corners, it merges the closing corners to shape candidate textual content areas. Target frames [21] are decided on at constant time durations from photographs detected with the aid of using a scene-extrude detection approach. A color histogram is used to perform segmentation with the aid of using color clustering surrounding color peaks for every selection.

The approach [22] locates candidate areas without delay withinside the DCT compressed area by the use of the depth version data encoded withinside the DCT area. The paper [23] makes use of a clean history for pix to locate the areas of interest (ROI). [24] proposes a linear-time line phase detector with dependable findings, a constrained quantity of fake detections, and no parameter tweaking. On a big set of herbal pics, our approach is evaluated and as compared in opposition to present-day techniques. Using Geodesic Object Proposals [26], a technique for detecting ability files [25] is defined in a given photo. The entered pics had been downsampled to the useful resource with the extraction of structures/functions of interest, to lessen noise, and to enhance runtime pace and accuracy. The outcomes indicated that the use of Geodesic Object Proposals withinside the file item identity activity is promising. Also, [27] operators are associated with the max-tree and min-tree representations of files in pix. In paper [28] a simple-to-write set of rules is proposed to compute the tree of shapes; When statistics quantization is low, it works for nD pics and has a quasi-linear complexity.

The methodology [29] is primarily based totally on projection profiles blended with a linked aspect labeling process. Signal cross-correlation is likewise used to affirm the detected noisy textual content regions. Several awesome steps are used for this challenge [30] a pre-processing system the use of a low-by skip Wiener filter, a difficult estimation of foreground areas, a history floor computation with the aid of using interpolating adjoining history intensities, the edge cost with the aid of using combining the computed history floor with the authentic photo consisting of the pinnacle test of the photo, and sooner or later a post-processing step to enhance the nice of the textual content regions and keep line connectivity. Removing the skew impact on digitalized files [31] proposed that every horizontal textual content line intersects a predefined set of vertical strains at non-horizontal positions. Just with the aid of using the use of the pixels on such vertical strains, we create a correlation matrix and calculate the file's skew attitude with terrific precision. In the challenge of white beard files [32] evolved a strong function-primarily based totally method to routinely sew a couple of overlapping pix. The cautioned approach [33] is primarily based totally on the combinatorial production of ability quadrangle picks from hard and fast line segments, in addition to projective file reconstruction with a recognized focal length. For line detection, the Fast Hough Transform [34] is applied. With the set of rules, a 1D model of the brink detector is presented. Three localization algorithms are given in an article [35]. All algorithms employ function factors, and of them moreover have a take a observe near-horizontal and near-vertical strains at the photo. The cautioned method [36] is a distinctly specific file localization approach for spotting the file's 4 nook factors in herbal settings. The 4 corners are about expected withinside the first step the use of a deep neural network-primarily based Joint Corner Detector (JCD) with an interesting mechanism, which makes use of a selective interest map to the kind of discover the file location.

2.2.3 Machine Learning Method

The paper [37] suggests a CNN primarily based approach as it should be localized files in real-time and version localization hassle as a key factor detection hassle. The 4 corners of the files are collectively expected with the aid of using a Deep Convolutional Neural Network. In the paper [38] first, stumble on the sort of the file and classify the images, after which with the aid of using understanding the sort of the file a matched localization approach is finished at the file and allows data extraction. Furthermore, another method offered a new usage of U-Net for file localization in pictures taken through smartphones [39].

3 Methodology

We version the hassle of record localization as key factor detection. The approach desires a ground truth as a mask of the record component and the non-record component. We exhibit the record with white (255) and non-record components with black (0).

3.1 Dataset Preparation

We use [1,2,3,4,5] datasets as train and validation datasets and make the photo length comparable to the usage of the max photo top and width amongst snapshots the usage of 0 paddings. And use [6] dataset because of the check dataset for the assessment and evaluating the proposed approach with the preceding strategies and cellular applications.

3.2 Using Deep Neural Networks

For the mission of report localization in pix taken via way of means of smartphones, we used deeplabv3 [40] technique and the fine-tuning technique to retrain a few ultimate layers in the DeepLab neural network. This community has benefited from deconvolution. In this dissertation, we've taken into consideration locating the location of the report withinside the pix as a semantic segmentation mission. For convolutional neural networks, they've proven incredible overall performance in semantic photograph segmentation.

We use the possibilities of deeplabv3 to lessen the operational complexity via way of means of the usage of numerous neural community fashions withinside the semantic segmentation component like MobileNet [41]. The deeplabv3 with the mobilenetv2 has 2.11M parameters in total. We use mobilenetv2 as a function extractor in a simplified version of deeplabv3 to allow on-tool semantic segmentation. The resultant version achieves equal overall performance to the usage of mobilenetv1 as a function extractor (Fig. 1).

There are three layers in the DeepLab network. First, we emphasize atrous convolution [42], or convolution with up-sampled filters, as a useful technique in dense prediction problems. Throughout Deep Convolutional Neural Networks, atrous convolution allows us to directly regulate the resolution at which feature responses are calculated. It also enables us to efficiently expand the field of view of filters to include more context without increasing the number of parameters or calculation time.

Second, we present an atrous spatial pyramid pooling (ASPP) method for segmenting objects at different sizes with high accuracy. ASPP uses filters at different sample rates and functional fields-of-views to probe an input convolutional feature layer, collecting objects and picture context at different levels. Third, we combine approaches from DCNNs and probabilistic graphical models to enhance object boundary localization. In CNN's, the frequently used combination of max-pooling and downsampling produces invariance but at the cost of accuracy in localization, we solve this by integrating the answers at the final DCNN layer with a fully linked Conditional Random Field (CRF), which has been proven to increase localization accuracy both qualitatively and statistically.

Because of the problem of semantic segmentation, there are large and small instances that need to be segmented. If convolution kernels of equal size are used, the problem may arise that the receptive field is not large enough and the accuracy of segmentation of large objects may decrease. As a reaction to this problem, at the historic moment an atrous convolution was created, i.e. The size of the dilation rate was adjusted to modify the convolution kernel's receptive field. The impact of atrous convolution on a branch convolutional neural network, on the other hand, is not beneficial. If we continue to use smaller atrous convolutions to recover the information of small objects, a large redundancy will result. ASPP uses the dilation rate of different sizes to capture information on different scales in the network decoder. Each scale is an independent branch. It is merged at the end of the network and then a convolution layer is an output to predict the label. This approach successfully eliminates the gathering of unnecessary data on the encoder, allowing the encoder to focus just on the object correlation.

The training level needs a different ground truth from the paper [6] so we provide a masked ground truth (Fig. 2) that the document part and the non-document part Are differentiated with black and white colors. The document with white (255) and the non-document part with black (0). After freezing the intended layers, the final network has been updated and implemented in Ubuntu Linux version 16.04 LTS implementation and programming. The STRIX-GTX1080-O8G graphics card and the Core i7 6900k processor with 32 GB of RAM are also used for training, power, and network testing.

4 Experiments and Results

4.1 Evaluation Protocol

The IoU method described in [43], has been used to evaluate. First, the perspective effect is deleted from the ground truth (G) and predicted (S) with the help of image size. We call new situations (G′) and (S′) respectively so that the IoU or Jaccard index is equal to:

$${\rm{IoU}} = ({\rm{area}}({\rm{G}}{\prime} \cap {\rm{S}}{\prime}))/({\rm{area}}({\rm{G}}{\prime} \cup {\rm{S}}{\prime}))$$

(1)

The final result is the average of the IoU value for each image.

4.2 Results

The result is compared with all well-known methods, algorithms, and mobile applications that can solve the document localization task in images. The other methods' results are compared based on [6] (Fig. 3).

In Table 1, the final results in different categories are presented and Fig. 4 shows the result in comparison with the previous methods. We run the model on a check the dataset and look at our outcomes to the previously released outcomes on the same dataset [6]. While this check is sufficient to evaluate the tool's accuracy on the competition dataset, it does have drawbacks. One, it cannot show to us how effective our tool conflates to unknown contents because of the reality several samples of contents were used for education. Second, its miles now no longer capable of providing records about how well our framework generalizes to unseen documents with similar content material fabric.

We cross-validate our technique through manner of way of deleting each content material fabric from the education setting and then checking on the deleted content material fabric to recovery the weaknesses and diploma generalization on unseen content material fabric. The quit result is in comparison with all well-known methods, algorithms, and mobile packages that might solve the document localization venture in snapshots. The special methods' outcomes regions compared, based definitely on [6] (Fig. 3). Our technique successfully generalizes to unseen documents, as outcomes are validated in Fig. 4.

This is not unreasonable for the purpose that the low choice of the entered picture graph prevents our model from relying on features of the document's layout or content material fabric. We moreover end from the outcomes that our technique generalizes well to unseen smooth contents. It is crucial to mention that the technique is designed to be effective on a low huge style of reasserts like a midrange smartphone without the usage of cloud or server-based recourses. The frames from the four corners are processed within the order within the implementation. It can implement snapshots all at once through a manner of way of taking walks a batch of four snapshots via the model. This must result in a substantial boom in inefficiency.

Table 1. The result of the proposed method on dataset [6]

Full size table

5 Conclusion

In this paper, we provided a new application of DeepLabv3 the use of MobileNetv2 for report localization in pics taken with the aid of using smartphones. The very last result is the exceptional result of those tasks' methods. We used all dependable datasets on this task. And the primary dataset to evaluate is the newly gathered dataset with a diverse variety of report localization challenges. And finally, we offer a utility the use of the Kivy framework.

We moreover go through some sensible techniques that can be carried out with the usage of software program software belongings like Python, PyTorch, TensorFlow, and OpenCV. We used all reliable datasets in this task. And, the number one dataset to study is the newly accumulated dataset with a various variety of document localization challenges. Also, we gift a unique method for locating documents in pictures. The problem of localization is modeled as a problem of key element detecting. We show that this method should make assumptions nicely on new and unseen documents via the usage of a deep convolutional network.

References

Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15(11), 3440–3451 (2006)
Article Google Scholar
Ye, P., Doermann, D.: Document image quality assessment: a brief survey. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE (2013)
Google Scholar
Nayef, N., et al.: SmartDoc-QA: a dataset for quality assessment of smartphone captured document images-single and multiple distortions. In: 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE (2015)
Google Scholar
Burie, J.C., Chazalon, J., et al.: ICDAR2015 competition on smartphone document capture and OCR (SmartDoc). In: 13th International Conference on Document Analysis and Recognition, IEEE (2015)
Google Scholar
Arlazarov, V.V., et al.: MIDV-500: a dataset for identity document analysis and recognition on mobile devices in video stream. Кoмпьютepнaя oптикa 43.5 (2019)
Google Scholar
Dizaj, S.B., Soheili, M., Mansouri, A.: A new image dataset for document corner localization. In: 2020 International Conference on Machine Vision and Image Processing (MVIP). IEEE (2020)
Google Scholar
Lampert, H., Braun, H.T., et al.: Oblivious document capture and real-time retrieval. In: Proceedings. Camera-Based Document Analysis and Recognition, pp. 79–86 (2005)
Google Scholar
Chen, F., et al.: SmartDCap: semi-automatic capture of higher quality document images from a smartphone. In: Proceedings of the 2013 International Conference on Intelligent User Interfaces. ACM (2013)
Google Scholar
Jayaraman, D., et al.: Objective quality assessment of multiply distorted images. In: 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). IEEE (2012)
Google Scholar
Kleber, F., et al.: Mass digitization of archival documents using mobile phones. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. ACM (2017)
Google Scholar
Fototechnischer Ausschuss der KLA. 2016. Wirtschaftliche Digitalisierung in Archiven. (2016)
Google Scholar
Zhu, J., Wang, S., Meng, F.: SIFT method for paper detection system. In: 2011 International Conference on Multimedia Technology (ICMT), IEEE (2011)
Google Scholar
Quan, N., Zhou, X., Chen, X.: Scan paperback books by a camera. In: 2016 IEEE International Conference on Information and Automation (ICIA) (2006)
Google Scholar
Zunino, R., Rovetta, S.: Vector quantization for license-plate location and image coding. IEEE Trans. Industr. Electron. 47(1), 159–167 (2000)
Article Google Scholar
Kuwano, H., et al.: Telop-on-demand: video structuring and retrieval based on text recognition. In: 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast-Changing World of Multimedia (Cat. No. 00TH8532). vol. 2. IEEE (2002)
Google Scholar
Hasan, Y.M.Y., Karam, L.J.: Morphological text extraction from images. IEEE Trans. Image Process. 9(11), 1978–1983 (2000)
Google Scholar
Hsieh, J.-W., Yu, S.-H., Chen, Y.-S.: Morphology-based license plate detection from complex scenes. In: Object Recognition Is Supported by User Interaction for Service Robots. vol. 3. IEEE (2002)
Google Scholar
Cai, M., Song, J., Michael, R., Lyu, A.: A new approach for video text detection. In: Proceedings of the International Conference on Image Processing. vol. 1. IEEE (2002).
Google Scholar
Kim, S., et al.: A robust license-plate extraction method under complex image conditions. In: Object Recognition Is supported by User Interaction for Service Robots. vol. 3. IEEE (2002)
Google Scholar
Hua, X.-S., et al.: Automatic location of text in video frames. In: Proceedings of the 2001 ACM Workshops on Multimedia: Multimedia Information Retrieval. ACM (2001)
Google Scholar
Kim, H.-K.: Efficient automatic text location method and content-based indexing and structuring of video database. J. Vis. Commun. Image Represent. 7(4), 336–344 (1996)
Article Google Scholar
Zhong, Y., Zhang, H., Jain, A.K.: Automatic caption localization in compressed video. IEEE Trans. Pattern Anal. Mach. Intell. 22(4), 385–392 (2000)
Article Google Scholar
Wu, V., Manmatha, R., Riseman, E.M.: Finding text in images. In: ACM DL (1997)
Google Scholar
Gioi, V., Grompone, R., et al.: LSD: A fast line segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach. Intell. 32(4), 722–732 (2010)
Article Google Scholar
Leal, L.R.S., Bezerra, B.L.D.: Smartphone camera document detection via geodesic object proposals. In: 2016 IEEE Latin American Conference on Computational Intelligence (LA-CCI), IEEE (2016)
Google Scholar
Krähenbühl, P., Koltun, V.: Geodesic object proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 725–739. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_47
Chapter Google Scholar
Carlinet, E., Géraud, T.: A comparative review of component tree computation algorithms. IEEE Trans. Image Process. 23(9), 3885–3895 (2014)
Article MathSciNet Google Scholar
Géraud, T., Carlinet, E., Crozet, S., Najman, L.: A quasi-linear algorithm to compute the tree of shapes of nD images. In: Hendriks, C.L., Borgefors, G., Strand, R. (eds.) ISMM 2013. LNCS, vol. 7883, pp. 98–110. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38294-9_9
Chapter MATH Google Scholar
Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic borders detection of camera document images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil (2007)
Google Scholar
Gatos, B., Pratikakis, I., Perantonis, S.J.: Adaptive degraded document image binarization. Pattern Recogn. 39(3), 317–327 (2006)
Article Google Scholar
Chang, F., Chen, C.-J., Lu, C.J.: A linear-time component-labeling algorithm using contour tracing technique. Comput. Vis. Image Underst. 93(2), 206–220 (2004)
Google Scholar
Zhang, Z., He. L.-W.: Whiteboard Scanning and Image Enhancement (2016)
Google Scholar
Skoryukina, N., et al.: Real-time rectangular document detection on mobile devices. In: Seventh International Conference on Machine Vision (ICMV 2014), vol. 9445. International Society for Optics and Photonics (2015)
Google Scholar
Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972)
Article Google Scholar
Skoryukina, N., et al.: Document localization algorithms based on feature points and straight lines. In: Tenth International Conference on Machine Vision (ICMV2017), vol. 10696. International Society for Optics and Photonics (2018)
Google Scholar
Zhu, A., Zhang, C., Li, Z., Xiong, S.: Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement. Int. J. Doc. Anal. Recog. 22(3), 351–360 (2019). https://doi.org/10.1007/s10032-019-00341-0
Article Google Scholar
Javed, K., Shafait, F.: Real-time document localization in natural images by recursive application of a CNN. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (CDAR). vol. 1. IEEE (2017)
Google Scholar
Awal, A.M., et al.: Complex document classification and localization application on identity document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1. IEEE (2017)
Google Scholar
Baniadamdizaj, S., Soheili, M., Mansouri, A., et al.: Document localization in images taken by smartphones using a fully convolutional neural network, 04 October 2021, PREPRINT (Version 1). Research Square [https://doi.org/10.21203/rs.3.rs-952656/v1]
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image Segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
He, K., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Analy. Mach. Intell. 37(9), 1904–1916 (2015)
Google Scholar
Rahman, M.A., Wang, Y.: Optimizing intersection-over-union in deep neural networks forimage segmentation. In: Bebis, G., et al. (eds.) ISVC 2016. LNCS, vol. 10072, pp. 234–244. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50835-1_22
Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Kharazmi University, Tehran, Iran
Shima Baniadamdizaj

Authors

Shima Baniadamdizaj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shima Baniadamdizaj .

Editor information

Editors and Affiliations

National Research Institute, National Information Processing Institute, Warsaw, Poland
Cezary Biele
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk
Polish-Japanese Academy of Information Technology, Warsaw, Poland
Wiesław Kopeć
Systems Research Institute, Polish Academy of Science, Warsaw, Poland
Jan W. Owsiński
Institute of Applied Computer Science, Faculty of Electrical, Electronic, Computer and Control Engineering, Łódż University of Technology, Łódź, Poland
Andrzej Romanowski
Department of Applied Informatics in Management, Faculty of Management and Economics, Gdańsk University of Technology, Gdańsk, Poland
Marcin Sikorski

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baniadamdizaj, S. (2022). Localization Using DeepLab in Document Images Taken by Smartphones. In: Biele, C., Kacprzyk, J., Kopeć, W., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds) Digital Interaction and Machine Intelligence. MIDI 2021. Lecture Notes in Networks and Systems, vol 440. Springer, Cham. https://doi.org/10.1007/978-3-031-11432-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-11432-8_6
Published: 27 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11431-1
Online ISBN: 978-3-031-11432-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics