Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

SynthText3D: synthesizing scene text images from 3D virtual worlds

Abstract

With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, real-world variations, including complex perspective transformations, various illuminations, and occlusions, can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method.

This is a preview of subscription content, log in to check access.

References

  1. 1

    Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2315–2324

  2. 2

    Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of European Conference on Computer Vision, 2018. 249–266

  3. 3

    Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014. ArXiv: 1406.2227

  4. 4

    Zhu Z, Huang T, Shi B, et al. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2347–2356

  5. 5

    Varol G, Romero J, Martin X, et al. Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 109–117

  6. 6

    Papon J, Schoeler M. Semantic pose using deep networks trained on synthetic RGB-D. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 774–782

  7. 7

    McCormac J, Handa A, Leutenegger S, et al. Scenenet RGB-D: 5 m photorealistic images of synthetic indoor trajectories with ground truth. 2016. ArXiv: 1612.05079

  8. 8

    Ros G, Sellart L, Materzynska J, et al. The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3234–3243

  9. 9

    Saleh F S, Aliakbarian M S, Salzmann M, et al. Effective use of synthetic data for urban scene semantic segmentation. In: Proceedings of European Conference on Computer Vision, 2018. 86–103

  10. 10

    Peng X, Sun B, Ali K, et al. Learning deep object detectors from 3D models. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 1278–1286

  11. 11

    Tremblay J, To T, Birchfield S. Falling things: a synthetic dataset for 3D object detection and pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2038–2041

  12. 12

    Hinterstoisser S, Pauly O, Heibel H, et al. An annotation saved is an annotation earned: using fully synthetic training for object instance detection. 2019. ArXiv: 1902.09967

  13. 13

    Ye Y Y, Zhang C, Hao X L. Arpnet: attention regional proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104

  14. 14

    Cao J, Pang Y, Li X. Learning multilayer channel features for pedestrian detection. IEEE Trans Image Process, 2017, 26: 3210–3220

  15. 15

    Cao J, Pang Y, Li X. Pedestrian detection inspired by appearance constancy and shape symmetry. IEEE Trans Image Process, 2016, 25: 5538–5551

  16. 16

    Quiter C, Ernst M. deepdrive/deepdrive: 2.0. 2018. https://zenodo.org/record/1248998#.Xhd25Ef0laQ

  17. 17

    Martinez M, Sitawarin C, Finch K, et al. Beyond grand theft auto V for training, testing and enhancing deep learning in self driving cars. 2017. ArXiv: 1712.01397

  18. 18

    Qiu W, Yuille A. Unrealcv: connecting computer vision to unreal engine. In: Proceedings of European Conference on Computer Vision, 2016. 909–916

  19. 19

    Ganoni O, Mukundan R. A framework for visually realistic multi-robot simulation in natural environment. 2017. ArXiv: 1708.01938

  20. 20

    Wang T, Wu J D, Coates A, et al. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), 2012. 3304–3308

  21. 21

    Zhan F, Zhu H, Lu S. Spatial fusion gan for image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3653–3662

  22. 22

    Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672–2680

  23. 23

    Ye Q, Doermann D. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480–1500

  24. 24

    Bai X, Yang M K, Shi B G, et al. Deep learning for scene text detection and recognition (in Chinese). Sci Sin Inform, 2018, 48: 531–544

  25. 25

    Liu Y, Jin L, Zhang S, et al. Detecting curve text in the wild: new dataset and new solution. 2017. ArXiv: 1712.02170

  26. 26

    Liao M, Shi B, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017. 4161–4167

  27. 27

    Ma J, Shao W, Ye H, et al. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimedia, 2018, 20: 3111–3122

  28. 28

    Liu Y, Jin L. Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1962–1969

  29. 29

    He W, Zhang Y-X, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 745–753

  30. 30

    Zhou X, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5551–5560

  31. 31

    Liao M, Zhu Z, Shi B, et al. Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5909–5918

  32. 32

    Liao M, Lyu P, He M, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans Pattern Anal Mach Intell, 2019

  33. 33

    Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 91–99

  34. 34

    Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. 21–37

  35. 35

    Liao M, Shi B, Bai X. TextBoxes++: a single-shot oriented scene text detector. IEEE Trans Image Process, 2018, 27: 3676–3690

  36. 36

    Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2550–2558

  37. 37

    Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5000–5009

  38. 38

    Long S, Ruan J, Zhang W, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of European Conference on Computer Vision, 2018. 20–36

  39. 39

    Deng D, Liu H, Li X, et al. Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 6773–6780

  40. 40

    Lyu P, Yao C, Wu W, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7553–7563

  41. 41

    Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103

  42. 42

    Arbeláez P, Maire M, Fowlkes C, et al. Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 898–916

  43. 43

    Lu S J, Tan C, Lim J-H. Robust and efficient saliency modeling from image co-occurrence histograms. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 195–201

  44. 44

    Lin Y-T, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755

  45. 45

    Roth S D. Ray casting for modeling solids. Comput Graph Image Process, 1982, 18: 109–144

  46. 46

    Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of International Conference on Document Analysis and Recognition, 2013. 1484–1493

  47. 47

    Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition, 2015. 1156–1160

  48. 48

    He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61733007). Xiang BAI was supported by National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team (Grant No. 2017QYTD08).

Author information

Correspondence to Xiang Bai.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liao, M., Song, B., Long, S. et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. Sci. China Inf. Sci. 63, 120105 (2020). https://doi.org/10.1007/s11432-019-2737-0

Download citation

Keywords

  • optical character recognition (OCR)
  • synthetic data
  • scene text detection
  • 3D
  • deep learning