Skip to main content

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

Abstract

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. https://github.com/wdqin/BE-Arabic-9K.

  2. https://github.com/wdqin/BE-Arabic-9K.

  3. https://github.com/wdqin/BE-Arabic-9K.

  4. https://github.com/wdqin/BE-Arabic-9K

  5. https://www.mturk.com/mturk/welcome.

  6. https://www.crowdflower.com/ (re-branded as ’Figure Eight’ starting 2018

References

  1. Abdelaziz, I., Abdou, S.: Altecondb: a large-vocabulary arabic online handwriting recognition database. arXiv:1412.7626 (2014)

  2. Dobais, M.A.A, Alrasheed, F.A.G., Latif, G., Alzubaidi, L.: Adoptive thresholding and geometric features based physical layout analysis of scanned arabic books. In: 2018 IEEE 2nd international workshop on arabic and derived script analysis and recognition (ASAR), pp. 171–176. IEEE (2018)

  3. Albadi, N., Kurdi, M., Mishra, S.: Are they our brothers? Analysis and detection of religious hate speech in the arabic twittersphere. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 69–76 (2018)

  4. Alexey, B., Yao, W.C., Yuan, L.H.: Yolov4: optimal speed and accuracy of object detection. In arXiv:2004.10934 (2020)

  5. Almutairi, A., Almashan, M.: Instance segmentation of newspaper elements using mask R-CNN. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1371–1375. IEEE (2019)

  6. Alshameri, A., Abdou, S., Mostafa, K.: A combined algorithm for layout analysis of Arabic document images and text lines extraction. Int. J. Comput. Appl. 49(23), 30–37 (2012)

  7. ALTEC dataset. http://www.altec-center.org/conference/?page_id=87

  8. Amazon Mechanical Turk. https://www.mturk.com/mturk/welcome

  9. The ASAR Physical Layout Analysis Challenge at the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition, London, U.K., March 2018. https://asar.ieee.tn/competition/

  10. 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition, London, U.K., March 2018

  11. Asi, A., Cohen, R., Kedem, K., El-Sana, J., Dinstein,I.: A coarse-to-fine approach for layout analysis of ancient manuscripts. In: 14th International Conference on Frontiers in Handwriting Recognition, pp. 140–145 (2014)

  12. Barakat, B., Droby, A., Kassis, M., El-Sana, J.: Text line segmentation for challenging handwritten document images using fully convolutional network. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 374–379 (2018)

  13. Barakat, B.K., El-Sana, J.: Binarization free layout analysis for arabic historical documents using fully convolutional networks. In: 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), pp. 151–155. IEEE (2018)

  14. Belaïd, A., Ouwayed, N.: Segmentation of ancient Arabic documents. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 103–122. Springer, London (2012)

    Chapter  Google Scholar 

  15. Boussellaa, W., Zahour, A., Taconet, B., Alimi, A., Benabdelhafid, A.: PRAAD: preprocessing and analysis tool for Arabic ancient documents. In: 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1058–1062 (2007)

  16. Bukhari, S.S., Azawi, A., Ali, M.I., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, Boston, pp. 183–190 (2010)

  17. Bukhari, S.S., Breuel, T.M., Asi, A., El Sana, J.: Layout analysis for arabic historical document images using machine learning. In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 639–644. IEEE (2012)

  18. Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Information 11(2), 125 (2020)

    Article  Google Scholar 

  19. Chen, K., Liu, C.L., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation for historical document images based on superpixel classification with unsupervised feature learning. In: 12th IAPR workshop on document analysis systems (DAS), pp. 299–304 (2016)

  20. Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1011–1015 (2015)

  21. Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written arabic. In: LREC, pp. 241–245 (2014)

  22. Cotterell., Ryan, B., Chris, C..: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)

  23. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. Comput. Vis. Pattern Recognit. CVPR 2009, 248–255 (2009)

    Google Scholar 

  24. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei, L.F.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  25. Abed, H.E., Märgner, V., Kherallah, M., Alimi, A.M.: ICDAR 2009 online arabic handwriting recognition competition. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1388–1392. IEEE (2009)

  26. El-Mawass, N., Alaboodi, S.: Detecting arabic spammers and content polluters on twitter. In: Sixth International Conference on Digital Information Processing and Communications (ICDIPC), pp. 53–58 (2016)

  27. Elanwar, R., Betke, M.: The ASAR 2018 competition on physical layout analysis of scanned arabic books (PLA-SAB 2018). In: 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), pp. 177–182. IEEE (2018)

  28. Elanwar, R., Qin, W., Betke, M.: Making scanned arabic documents machine accessible using an ensemble of SVM classifiers. Int. J. Doc. Anal. Recognit. (IJDAR) 21(1–2), 59–75 (2018)

    Article  Google Scholar 

  29. Farra, N., McKeown, K., Habash, N.: Annotating targets of opinions in Arabic using crowdsourcing. In: Second workshop on Arabic natural language processing, pp. 89–98 (2015)

  30. Girshick, Ross.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  31. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

  32. Hadjar, K., Ingold, R.: Arabic newspaper page segmentation. In: 7th International Conference on Document Analysis and Recognition, pp. 895—899 (2003)

  33. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  35. Hesham, A.M., Rashwan, M.A.A., Barhamtoshy, H.M.A., Abdou, S.M., Badr, A.A., Farag, I.: Arabic document layout analysis. Pattern Anal. Appl. 20(4), 1275–1287 (2017)

    MathSciNet  Article  Google Scholar 

  36. Kassis, M., El-Sana, J.: Scribble based interactive page layout segmentation using gabor filter. In: 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 13–18 (2016)

  37. Ibn Khedher, M., Jmila, H., El-Yacoubi, M.A.: Automatic processing of historical arabic documents: a comprehensive survey. Pattern Recognit. 100, 107144 (2020)

    Article  Google Scholar 

  38. Lawson, N., Eustice, K., Perkowitz, M., Yetisgen-Yildiz, M.: Annotating large email datasets for named entity recognition with Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 71–79 (2010)

  39. LabelMe tool. http://labelme.csail.mit.edu/Release3.0/

  40. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  41. Mahmoud, S.A., Ahmad, I., Khatib, W.G.A., Alshayeb, M., Parvez, M.T., Märgner, V., Fink, G.A.: KHATT: an open arabic offline handwritten text database. Pattern Recognit. 47(3), 1096–1112 (2014)

    Article  Google Scholar 

  42. Mahmoud, S.A., Luqman, H., Al-Helali, B.M., BinMakhashen, G., Parvez, M.T.: Online-khatt: an open-vocabulary database for arabic online-text processing. Open Cybern. Syst. J. 12(1), 42–59 (2018)

  43. Minghao, L., Yiheng, X., Lei, C., Shaohan, H., Furu, W., Zhoujun, L., Ming, Z.: Docbank: a benchmark dataset for document layout analysis. arXiv:2006.01038 (2020)

  44. Neche, C., Belaid, A., Kacem-Echi, A.: Arabic handwritten documents segmentation into text-lines and words using deep learning. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), pp. 19–24 (2019)

  45. Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., Papamarkos, N.: Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis. Comput. 28(4), 590–604 (2010)

    Article  Google Scholar 

  46. Pastor-Pellicer, J., Afzal, M.Z., Liwicki, M., Castro-Bleda, M.J.: Complete system for text line extraction using convolutional neural networks and water-shed transform. In: 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 30-35 (2016)

  47. Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT-database of handwritten arabic words. In: Proceedings of CIFED, volume 2, pp. 127–136. Citeseer (2002)

  48. Pletschacher S., Antonacopoulos, A.: The PAGE (page analysis and ground-truth elements) format framework. In: 20th International Conference on Pattern Recognition (ICPR), pp. 257–260 (2010)

  49. PyTorch sytem of libraries and tools for machine learning. https://pytorch.org/ (2020)

  50. Rashtchian, C., Youngand, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)

  51. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015)

  52. Saad, R.S.M., Elanwar, R., Abdel Kader, N.S., Mashali, S., Betke, M., Asar 2018 layout analysis challenge: using random forests to analyze scanned Arabic books. In: 2nd IEEE International Workshop on Arabic and derived Script Analysis and Recognition (ASAR 2018), London, March 2018, 2018. p. 6

  53. Rana S.M.S., Randa I.E., Abdel Kader, N.S., Samia, M., Margrit, B.: BCE-Arabic-v1 dataset: towards interpreting arabic document images for people with visual impairments. In: Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments, pp. 1–8 (2016)

  54. Shafait, Faisal, Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 941–954 (2008)

    Article  Google Scholar 

  55. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  56. Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new arabic printed text image database and evaluation protocols. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 946–950. IEEE (2009)

  57. Strassel, S.: Linguistic resources for arabic handwriting recognition. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)

  58. Studer, L., Alberti, M., Pondenkandath, V., Goktepey, P., Kolonko, T., Fischeryz, A., Liwicki, M., Ingold, R.: A comprehensive study of imagenet pre-training for historical document image analysis. In: 15th International Conference on Document Analysis and Recognition (ICDAR), pp. 720–725 (2019)

  59. Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)

    Article  Google Scholar 

  60. Wei, H., Seuret, M., Chen, K., Fischer, A., Liwicki, M., Ingold, R.: Selecting autoencoder features for layout analysis of historical documents. In: ACM 3rd International Workshop on Historical Document Imaging and Processing, pp. 55–62 (2015)

  61. Wick, C., Puppe, F.: Fully convolutional neural networks for page segmentation of historical document images. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–292. IEEE (2018)

  62. Wray, S., Mubarak, H., Ali,A.: Best practices for crowdsourcing dialectal arabic speech transcription. In: ANLP Workshop, p. 99 (2015)

  63. Wray, S., Mubarak, H., Ali, A.: Best practices for crowdsourcing dialectal arabic speech transcription. In: Proceedings of the Second Workshop on Arabic Natural Language Processing, pp. 99–107 (2015)

  64. Zaidan, O.F., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, 2:37–41 (2011)

  65. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

    Article  Google Scholar 

  66. Zaidan, O.F., Burch, C.C..: The arabic online commentary dataset: an annotated dataset of informal arabic with high dialectal content. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers-volume 2, pp. 37–41. Association for Computational Linguistics (2011)

  67. Zaidan, O.F., Burch, C.C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

  68. Zhong, X., Jianbin, T., Jimeno, Y.A.: Publaynet: largest dataset ever for document layout analysis. In: 15th International Conference on Document Analysis and Recognition (ICDAR) (2019)

Download references

Acknowledgements

The authors would like to thank the library staff at the Mugar Library at Boston University, the Rotch Library at MIT, and the Widener and Fine Arts libraries at Harvard University for facilitating the collection process of our dataset BE-Arabic-9K. The authors thank the National Science Foundation (1838193) and the Hariri Institute for Computing at Boston University for partial support of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Randa Elanwar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Documents images datasets

Appendix: Documents images datasets

Collection and annotation

Large size annotated datasets are one crucial resource needed for supervised machine learning. Researchers often spend a considerable amount of time annotating their self-collected datasets because they cannot find a publicly available dataset that match their research need. This may result in them ending up conducting their research on limited size datasets.

Unless special characteristics are required for a research dataset, the data collection phase is not as challenging or as expensive as data annotation. For example, The internet archive has billions of unlabeled images that could be downloaded using web search crawlers, an approach that has been followed before to construct large public computer vision datasets like TinyImage [59] and ImageNet [23]. The annotation phase is what controls the research outcome. Annotations should match the research question, meaning that a single image can have multiple annotations and several levels of details. Accordingly, the annotation process has been always expert-based and problem-oriented, expensive and time consuming.

Dataset collection for document analysis and recognition is one of the most challenging tasks compared to other research areas. One might expect ”transcripts” as the only annotation needed for documents images, however, according to the research problem the required annotations might include much more information like:

  1. 1.

    Segmentation information: locating the text position inside the document image (i.e., bounding box coordinates),

  2. 2.

    Logical labeling: identification of the text logical function (i.e., title, caption, footnote, etc.)

  3. 3.

    the text reading order (specially in multi-columns layouts)

  4. 4.

    Geometrical labeling: classifying the non-text element type (i.e., image, chart, map, logo, math formulae, etc.)

  5. 5.

    Descriptions: the alternative text for image elements, cells functions-and-relations for tables elements.

Logical labeling of a document’s text elements, is one of the most human-intelligence-based tasks that sometimes become controversial especially with unfamiliar layouts or with the absence of appropriate text formatting (e.g., font size and emphasis).

Our previous attempt to provide the first labeled dataset for page segmentation and layout analysis BCE-Arabic V1 was one of a kind [53]. We investigated the importance of having physically analyzed documents (i.e., segmenting regions and identifying their type as text or non-text), and showed that it is no trivial task and has a significant impact on improving the OCR results compared to introducing raw images to the system.

Our study highlighted behind-the-seen efforts of sample annotation and selecting the appropriate metadata set and labeling tools to prepare a dataset of document images. We discussed the tools used by researchers and set comparison to discover the most suitable one for annotating an Arabic documents dataset (Aletheia tool).

The document image annotation standards were finally created after studying the most common labels and metadata hierarchy needed for representing a document content in many research areas and PAGE format (created by the Aletheia tool) ended up being the most comprehensive annotation scheme for such representation.

Crowdsourcing for datasets annotation

Crowdsourcing has been recently used for constructing different relatively large image, audio and video research datasets through annotation tasks, like segmentation and labeling.

It has proved to be very fast and cheap, compared to the expert-based method. However, this has not yet been commonly used in all research areas.

Crowdsourcing was not only used for computer vision dataset annotation but also for natural language processing (NLP) datasets. Datasets for named entity recognition [38], image transcriptions [50] were annotated using crowdsourcing.

Annotating text corpora, and social media tweets and comments in the form of transcripts, dialects, sentiment analysis and people opinions or orientations were all done through crowdsourcing. Literature about crowdsourcing for annotation of Arabic datasets all lie in this area [3, 21, 26, 29, 62, 64, 65].

However, as far as we know, there is still no attempt for crowdsourcing to annotate scanned documents dataset for the sake of Arabic document image analysis and recognition research, we were the first.

Amazon mechanical Turk (MTurk)

Researchers used the crowdsourcing services offered by Artificial intelligence companies at very small profit like Amazon Mechanical Turk (MTurk)Footnote 5, and CrowdFlower (CF)Footnote 6. for the purpose of information collection or fast accomplishment of tedious small tasks. Amazon MTurk might be the first and most popular crowdsourcing platform used by researchers.

An MTurk job/HIT The requester divides the entire task to a large number of small jobs (also called human intelligent tasks ’HITs’), that could be done in parallel. Usually the tasks are short time data entry or information extraction, for example answering questions about identifying and/or segmenting an object, or selecting an appropriate label, etc.

An instructions set of how the task should be performed with examples of possible instances and common errors are posted to the workers once they accept to do the job. The requester also specifies the maximum time duration for accomplishing the job, and the monetary reward for the given job.

The requesters can select workers based on specific qualities related to their tasks and the same HIT could also be assigned to multiple workers for quality assurance.

Upon posting the jobs to MTurk, workers try to accept the job, perform the job according to the instructions, and submit it before the specified deadline. Workers might choose long duration jobs with high rewards or a number of short duration jobs with smaller lump-sum.

After the jobs submission, the requesters have a deadline to review them and agree to paying/not-paying the workers individually according the job quality. Some requesters choose to offer bonuses beyond the basic reward to some high quality workers as well.

Rewards could be as less as two cents and could be as high as tens of dollars according to the job difficulty. The rewards do not represent the entire task pricing, as the budget also include the service provider profit (percentage of the rewards and bonuses).

The process is completed by quality assurance procedures and tests to detect spammers and insure agreement between the workers performing the same HIT and also evaluating the annotation accuracy and analyzing errors.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Elanwar, R., Qin, W., Betke, M. et al. Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model. IJDAR 24, 349–362 (2021). https://doi.org/10.1007/s10032-021-00382-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-021-00382-4