Skip to main content
Log in

The image and ground truth dataset of Mongolian movable-type newspapers for text recognition

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

OCR approaches have been widely advanced in recent years thanks to the resurgence of deep learning. However, to the best of our knowledge, there is little work on Mongolian movable-type document recognition. One major hurdle is the lack of a domain-specific well-labeled set for training robust models. This paper aims to create the first Mongolian movable type text-image dataset for OCR research. We collated 771 paragraph-level pages segmented from 34 newspapers from 1947 to 1952. For each page, word- and line-level text transcriptions and boundary annotations are recorded. It consists of 86,578 word appearances and 9711 text-line images in total. The vocabulary is 7964. The dataset was finally established from scratch through image collection, text transcription, text-image alignment and manual correction. Moreover, an official train and test set partition is defined on which the typical text segmentation and recognition experiments are tested to set the strong baselines. This dataset is available for research, and we encourage researchers to develop and test new methods using our dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.iam.unibe.ch/fki/databases/iam-historical-document-database.

  2. https://www.primaresearch.org/datasets/ENP.

  3. https://www.digitization.edu.

  4. Latin: oNgerebel, Unicodes: U+1825 U+1829 U+182D U+1821 U+1837 U+1821 U+182A U+1821 U+182F, meaning: walk over, past.

  5. Latin:tologelegqid, Unicodes: U+1832 U+1825 U+182F U+1825 U+182D U+1821 U+182F U+1821 U+182D U+1834 U+1822 U+1833, meaning: Representatives.

  6. Latin: undusuten, Unicodes: U+1826 U+1828 U+1833 U+1826 U+1830 U+1826 U+1832 U+1821 U+1828, meaning: nation, race.

  7. Latin: uiledburilel, Unicodes: U+1826 U+1822 U+182F U+1821 U+1833 U+182A U+1826 U+1837 U+1822 U+182F U+1821 U+182F,meaning: Industry.

  8. Latin: beyeleguluged, Unicodes: U+182A U+1821 U+1836 U+1821 U+182F U+1821 U+182D U+1826 U+182F U+1826 U+182D U+1821 U+1833, meaning: finished.

  9. Latin: burilduhun, Unicodes: U+182A U+1826 U+1837 U+1822 U+182F U+1833 U+1826 U+182C U+1826 U+1828, meaning: component.

  10. Latin: bvlbasvrajv, U+182A U+1824 U+182F U+182A U+1820 U+1830 U+1824 U+1837 U+1820 U+1835 U+1824, meaning: training, exercising.

  11. Latin: yabvgvlvgsan, Unicodes: 1836 U+1820 U+182A U+1824 U+182D U+1824 U+182F U+1824 U+182D U+1830 U+1820 U+1828, meaning: let sb go.

  12. Latin: yarilqahv, Unicodes: 1836 U+1820 U+1837 U+1822 U+182F U+1834 U+1820 U+182C U+1824, meaning: talk.

  13. Latin: tegsidhen, Unicodes: U+1832 U+1821 U+182D U+1830 U+1822 U+1833 U+182C U+1821 U+1828, meaning: average.

References

  1. Al-Dmour, A., Zitar, R.A.: Word extraction from Arabic handwritten documents based on statistical measures. Int. Rev. Comput. Softw. 11(5), 1–10 (2016)

    Google Scholar 

  2. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)

  3. Biller, O., Asi, A., Kedem, K., El-Sana, J., Dinstein, I.: Webgt: An interactive web-based system for historical document ground truth generation. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, pp. 305–308. IEEE (2013)

  4. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)

  5. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084 (2017)

  6. Clausner, C., Papadopoulos, C., Pletschacher, S., Antonacopoulos, A.: The enp image and ground truth dataset of historical newspapers. In: Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 931–935. IEEE (2015)

  7. Clausner, C., Pletschacher, S., Antonacopoulos, A.: Aletheia-an advanced document layout and text ground-truthing system for production environments. In: Proceedings of the 2011 11th International Conference on Document Analysis and Recognition, pp. 48–52. IEEE (2011)

  8. D., F., G, G., H, W.: Mhw Mongolian off-line handwriting database and its application. J. Chin. Inf. Process. 32(1), 7 (2018)

  9. Daoerji, F., Guanglai, G.: DNN-HMM for large vocabulary Mongolian offline handwriting recognition. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 72–77. IEEE (2016)

  10. Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 29–36 (2011)

  11. Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic transcription of handwritten medieval documents. In: 2009 15th International Conference on Virtual Systems and Multimedia, pp. 137–142. IEEE (2009)

  12. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning, pp. 369–376 (2006)

  13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  14. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

  15. Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: Vml-hd: The historical arabic documents dataset for recognition systems. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), pp. 11–14. IEEE (2017)

  16. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239 (2016)

  17. Liao, M., Shi, B., Bai, X.: Textboxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  18. Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11474–11481 (2020)

  19. Liu, J., Liu, X., Sheng, J., Liang, D., Li, X., Liu, Q.: Pyramid mask text detector. arXiv preprint arXiv:1903.11800 (2019)

  20. Liwicki, M., Indermuhle, E., Bunke, H.: On-line handwritten text line detection using dynamic programming. In: Proceedings of the 2007 9th International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 1, pp. 447–451. IEEE (2007)

  21. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2015)

  22. Lu, M., Bao, F., Gao, G., Wang, W., Zhang, H.: An automatic spelling correction method for classical mongolian. In: International Conference on Knowledge Science, Engineering and Management, pp. 201–214. Springer (2019)

  23. Marti, U.V., Bunke, H.: Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Proceedings of the 2001 6th International Conference on Document Analysis and Recognition, pp. 159–163. IEEE (2001)

  24. Nagy, G., Seth, S.: A prototype document image analysis system for technical journals. IEEE Computer 25(7), 10–22 (1992)

    Article  Google Scholar 

  25. Ostu, N.: A threshold selection method from gray-histogram. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  Google Scholar 

  26. Papadopoulos, C., Pletschacher, S., Clausner, C., Antonacopoulos, A.: The impact dataset of historical document images. In: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pp. 123–130 (2013)

  27. Papavassiliou, V., Stafylakis, T., Katsouros, V., Carayannis, G.: Handwritten document image segmentation into text lines and words. Pattern Recognit. 43(1), 369–377 (2010)

    Article  MATH  Google Scholar 

  28. Peng, L., Liu, C., Ding, X., Jin, J., Wu, Y., Wang, H., Bao, Y.: Multi-font printed Mongolian document recognition system. Int. J. Doc. Anal. Recognit. (IJDAR) 13(2), 93–106 (2010)

    Article  Google Scholar 

  29. Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23–26 August 2010 (2010)

  30. Rath, T.M., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. (IJDAR) 9(2), 139–152 (2007)

    Article  Google Scholar 

  31. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 1 (2015)

    Google Scholar 

  32. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176 (2016)

  33. Stamatopoulos, N., Louloudis, G., Gatos, B.: Efficient transcript mapping to ease the creation of document image segmentation ground truth with text-image alignment. In: 2010 12th International Conference on Frontiers in Handwriting Recognition, pp. 226–231. IEEE (2010)

  34. Wei, H., Zhang, H., Gao, G.: Representing word image using visual word embeddings and rnn for keyword spotting on historical document images. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1368–1373. IEEE (2017)

  35. Yang, F., Bao, F., Gao, G.: Online handwritten mongolian character recognition using cma-mohr and coordinate processing. In: 2020 International Conference on Asian Language Processing (IALP), pp. 30–33. IEEE (2020)

  36. Zhang, H., Wei, H., Bao, F., Gao, G.: Segmentation-free printed traditional mongolian ocr using sequence to sequence with attention model. In: Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 585–590. IEEE (2017)

  37. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)

  38. Zimmermann, M., Bunke, H.: Automatic segmentation of the iam off-line database for handwritten english text. In: 2002 International Conference on Pattern Recognition, Vol. 4, pp. 35–39. IEEE (2002)

  39. Zinger, S., Nerbonne, J., Schomaker, L.: Text-image alignment for historical handwritten documents. In: Document recognition and retrieval XVI, Vol. 7247, p. 724703. International Society for Optics and Photonics (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feilong Bao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, M., Bao, F., Zhang, H. et al. The image and ground truth dataset of Mongolian movable-type newspapers for text recognition. IJDAR (2023). https://doi.org/10.1007/s10032-023-00450-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10032-023-00450-x

Keywords

Navigation