Abstract
With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, real-world variations, including complex perspective transformations, various illuminations, and occlusions, can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method.
This is a preview of subscription content, access via your institution.
References
- 1
Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2315–2324
- 2
Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of European Conference on Computer Vision, 2018. 249–266
- 3
Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014. ArXiv: 1406.2227
- 4
Zhu Z, Huang T, Shi B, et al. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2347–2356
- 5
Varol G, Romero J, Martin X, et al. Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 109–117
- 6
Papon J, Schoeler M. Semantic pose using deep networks trained on synthetic RGB-D. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 774–782
- 7
McCormac J, Handa A, Leutenegger S, et al. Scenenet RGB-D: 5 m photorealistic images of synthetic indoor trajectories with ground truth. 2016. ArXiv: 1612.05079
- 8
Ros G, Sellart L, Materzynska J, et al. The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3234–3243
- 9
Saleh F S, Aliakbarian M S, Salzmann M, et al. Effective use of synthetic data for urban scene semantic segmentation. In: Proceedings of European Conference on Computer Vision, 2018. 86–103
- 10
Peng X, Sun B, Ali K, et al. Learning deep object detectors from 3D models. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 1278–1286
- 11
Tremblay J, To T, Birchfield S. Falling things: a synthetic dataset for 3D object detection and pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2038–2041
- 12
Hinterstoisser S, Pauly O, Heibel H, et al. An annotation saved is an annotation earned: using fully synthetic training for object instance detection. 2019. ArXiv: 1902.09967
- 13
Ye Y Y, Zhang C, Hao X L. Arpnet: attention regional proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104
- 14
Cao J, Pang Y, Li X. Learning multilayer channel features for pedestrian detection. IEEE Trans Image Process, 2017, 26: 3210–3220
- 15
Cao J, Pang Y, Li X. Pedestrian detection inspired by appearance constancy and shape symmetry. IEEE Trans Image Process, 2016, 25: 5538–5551
- 16
Quiter C, Ernst M. deepdrive/deepdrive: 2.0. 2018. https://zenodo.org/record/1248998#.Xhd25Ef0laQ
- 17
Martinez M, Sitawarin C, Finch K, et al. Beyond grand theft auto V for training, testing and enhancing deep learning in self driving cars. 2017. ArXiv: 1712.01397
- 18
Qiu W, Yuille A. Unrealcv: connecting computer vision to unreal engine. In: Proceedings of European Conference on Computer Vision, 2016. 909–916
- 19
Ganoni O, Mukundan R. A framework for visually realistic multi-robot simulation in natural environment. 2017. ArXiv: 1708.01938
- 20
Wang T, Wu J D, Coates A, et al. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), 2012. 3304–3308
- 21
Zhan F, Zhu H, Lu S. Spatial fusion gan for image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3653–3662
- 22
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672–2680
- 23
Ye Q, Doermann D. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480–1500
- 24
Bai X, Yang M K, Shi B G, et al. Deep learning for scene text detection and recognition (in Chinese). Sci Sin Inform, 2018, 48: 531–544
- 25
Liu Y, Jin L, Zhang S, et al. Detecting curve text in the wild: new dataset and new solution. 2017. ArXiv: 1712.02170
- 26
Liao M, Shi B, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017. 4161–4167
- 27
Ma J, Shao W, Ye H, et al. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimedia, 2018, 20: 3111–3122
- 28
Liu Y, Jin L. Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1962–1969
- 29
He W, Zhang Y-X, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 745–753
- 30
Zhou X, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5551–5560
- 31
Liao M, Zhu Z, Shi B, et al. Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5909–5918
- 32
Liao M, Lyu P, He M, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans Pattern Anal Mach Intell, 2019
- 33
Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 91–99
- 34
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. 21–37
- 35
Liao M, Shi B, Bai X. TextBoxes++: a single-shot oriented scene text detector. IEEE Trans Image Process, 2018, 27: 3676–3690
- 36
Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2550–2558
- 37
Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5000–5009
- 38
Long S, Ruan J, Zhang W, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of European Conference on Computer Vision, 2018. 20–36
- 39
Deng D, Liu H, Li X, et al. Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 6773–6780
- 40
Lyu P, Yao C, Wu W, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7553–7563
- 41
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
- 42
Arbeláez P, Maire M, Fowlkes C, et al. Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 898–916
- 43
Lu S J, Tan C, Lim J-H. Robust and efficient saliency modeling from image co-occurrence histograms. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 195–201
- 44
Lin Y-T, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755
- 45
Roth S D. Ray casting for modeling solids. Comput Graph Image Process, 1982, 18: 109–144
- 46
Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of International Conference on Document Analysis and Recognition, 2013. 1484–1493
- 47
Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition, 2015. 1156–1160
- 48
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant No. 61733007). Xiang BAI was supported by National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team (Grant No. 2017QYTD08).
Author information
Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liao, M., Song, B., Long, S. et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. Sci. China Inf. Sci. 63, 120105 (2020). https://doi.org/10.1007/s11432-019-2737-0
Received:
Revised:
Accepted:
Published:
Keywords
- optical character recognition (OCR)
- synthetic data
- scene text detection
- 3D
- deep learning