Learning to detect, localize and recognize many text objects in document images from few examples

Abstract

The current trend in object detection and localization is to learn predictions with high capacity deep neural networks trained on a very large amount of annotated data and using a high amount of processing power. In this work, we particularly target the detection of text in document images and we propose a new neural model which directly predicts object coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data are not as abundant as in the classical configuration of natural images and Imagenet/Pascal-VOC tasks. The proposed model also facilitates the detection of many objects in a single image and can deal with inputs of variable sizes without resizing. To enhance the localization precision of the coordinate regressor, we limit the amount of information produced by the local model components and propose two different regression strategies: (i) separately predict lower-left and upper-right corners of each object bounding box, followed by combinatorial pairing; (ii) only predict the left side of the objects and estimate the right position jointly with text recognition. These strategies lead to good full-page text recognition results in heterogeneous documents. Experiments have been performed on a document analysis task, the localization of the text lines in the Maurdor dataset.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    The purpose of this figure is to show the strategy these models use to translate geometry and resolution into features. In particular, we do not show the actual numbers of layers and units. For SSD [22], we do not show the way how this model handles multiple scales.

  2. 2.

    http://pjreddie.com/darknet/yolo.

References

  1. 1.

    Behnke, S.: Face localization and tracking in the neural abstraction pyramid. Neural Comput. Appl. 14(2), 97–103 (2005)

    Article  Google Scholar 

  2. 2.

    Bell, S., Zitnick, L., Bala, K., Girshick, R.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas (2016)

  3. 3.

    Bluche, T.: Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. In: Advances in Neural Information Processing System, Barcelona (2016)

  4. 4.

    Bluche, T., Moysset, B., Kermorvant, C.: Automatic line segmentation and ground-truth alignment of handwritten documents. In: International Conference on Frontiers in Handwriting Recognition, Crete (2014)

  5. 5.

    Brunessaux, S., Giroux, P., Grilheres, B., Manta, M., Bodin, M., Choukri, K., Galibert, O., Kahn, J.: The maurdor project—improving automatic processing of digital documents. In: Document Analysis Systems, Tours (2014)

  6. 6.

    Chen, K., Seuret, M., Hennebert, J., Ingold, R.: Convolutional neural networks for page segmentation of historical document images. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, Kyoto, vol. 1, pp. 965–970. IEEE (2017)

  7. 7.

    Dai, J., Li, Y., He, K., Sun, J.: R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, Barcelona, pp. 379–387 (2016)

  8. 8.

    Delakis, M., Garcia, C.: Text detection with convolutional neural networks. In: International Conference on Computer Vision Theory and Applications, Madeira, pp. 290–294 (2008)

  9. 9.

    Doetsch, P., Zeyer, A., Voigtlaender, P., Kulikov, I., Schlüter, R., Ney, H.: Returnn: The rwth extensible training framework for universal recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, New Orleans, pp. 5345–5349. IEEE (2017)

  10. 10.

    Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, Colombus (2014)

  11. 11.

    Eskenazi, S., Gomez-Krämer, P., Ogier, J.M.: A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recognit. 64, 1–14 (2017)

    Article  Google Scholar 

  12. 12.

    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)

    Article  Google Scholar 

  13. 13.

    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  14. 14.

    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision, Santiago (2015)

  15. 15.

    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, Colombus (2014)

  16. 16.

    Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, Pittsburgh (2006)

  17. 17.

    Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing System, Vancouver (2008)

  18. 18.

    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas (2016)

  19. 19.

    Iandola, F., Hand, S., Moskewicz, M., Ashraf, K.: Squeezenet: Alexnet-level accuracy with \(50 \times \) fewer parameters and \(< 0.5\text{MB}\) model size. In: Openreview submission to ICLR 2017, Toulon (2016)

  20. 20.

    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1), 1–20 (2016)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Likforman-Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical documents: a survey. Int J Doc Anal Recognit 9(2–4), 123–138 (2007)

    Article  Google Scholar 

  22. 22.

    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, Amsterdam (2016)

  23. 23.

    Liu, Y., Jin, L.: Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, vol. 2, p. 8 (2017)

  24. 24.

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston (2015)

  25. 25.

    Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  26. 26.

    Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia (2018)

  27. 27.

    Messelodi, S., Modena, C.M.: Automatic identification and skew estimation of text lines in real scene images. Pattern Recogn. 32(5), 791–810 (1999)

    Article  Google Scholar 

  28. 28.

    Mordan, T., Thome, N., Cord, M., Henaff, G.: Deformable part-based fully convolutional network for object detection. In: British Machine Vision Conference (BMVC), London (2017)

  29. 29.

    Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)

    MathSciNet  Article  MATH  Google Scholar 

  30. 30.

    Nicolaou, A., Gatos, B.: Handwritten Text Line Segmentation by Shredding Text into its Lines. In: International Conference on Document Analysis and Recognition, Barcelona (2009)

  31. 31.

    Pham, V., Bluche, T., Kermorvant, C., Louradour, J.: Dropout improves recurrent neural networks for handwriting recognition. In: International Conference on Frontiers in Handwriting Recognition, Crete (2014)

  32. 32.

    Pinheiro, P., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston (2015)

  33. 33.

    Pinheiro, P., Lin, T., Collobert, R., Dollar, P.: Learning to refine object segments. In: European Conference on Computer Vision, Amsterdam (2016)

  34. 34.

    Pletschacher, S., Clausner, C., Antonacopoulos, A.: Europeana newspapers ocr workflow evaluation. In: Workshop on Historical Document Imaging and Processing, Nancy (2015)

  35. 35.

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas (2016)

  36. 36.

    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing System, Montreal (2015)

  37. 37.

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    MathSciNet  Article  Google Scholar 

  38. 38.

    Ryu, J., Koo, H.I., Cho, N.I.: Language-independent text-line extraction algorithm for handwritten documents. Signal Process. Lett. 21(9), 1115–1119 (2014)

    Article  Google Scholar 

  39. 39.

    Shi, Z., Setlur, S., Govindaraju, V.: A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines. In: International Conference on Document Analysis and Recognition, Barcelona (2009)

  40. 40.

    Stafylakis, T., Papavassiliou, V., Katsouros, V., Carayannis, G.: Robust text-line and word segmentation for handwritten documents images. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, pp. 3393–3396. IEEE (2008)

  41. 41.

    Stewart, R., Ermon, S.: Label-free supervision of neural networks with physics and domain knowledge. In: AAAI, San Francisco, pp. 2576–2582 (2017)

  42. 42.

    Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. arXiv:1412.1441 (2015)

  43. 43.

    Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, Barcelona (2016)

  44. 44.

    Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. 8(4), 280–296 (2006)

    Article  Google Scholar 

  45. 45.

    Wonmin, W., Breuel, T., Raue, F., Liwicki, M.: Scene labeling with LSTM recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston (2015)

  46. 46.

    Yin, F., Liu, C.L.: Handwritten chinese text line segmentation by clustering with distance metric learning. Pattern Recogn. 42(12), 3146–3157 (2009)

    Article  MATH  Google Scholar 

  47. 47.

    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations, San Juan (2016)

  48. 48.

    Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas (2016)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bastien Moysset.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Moysset, B., Kermorvant, C. & Wolf, C. Learning to detect, localize and recognize many text objects in document images from few examples. IJDAR 21, 161–175 (2018). https://doi.org/10.1007/s10032-018-0305-2

Download citation

Keywords

  • Text line detection
  • Neural network
  • Recurrent
  • Regression
  • Local
  • Document analysis