Reading Text in the Wild with Convolutional Neural Networks

Jaderberg, Max; Simonyan, Karen; Vedaldi, Andrea; Zisserman, Andrew

doi:10.1007/s11263-015-0823-z

Reading Text in the Wild with Convolutional Neural Networks

Published: 07 May 2015

Volume 116, pages 1–20, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Max Jaderberg¹,
Karen Simonyan¹,
Andrea Vedaldi¹ &
…
Andrew Zisserman¹

10k Accesses
791 Citations
18 Altmetric
Explore all metrics

Abstract

In this work we present an end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Article 13 May 2020

Greg Van Houdt, Carlos Mosquera & Gonzalo Nápoles

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

References

Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.
Article Google Scholar
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566. doi:10.1109/TPAMI.2014.2339814.
Article Google Scholar
Alsharif, O., & Pineau, J. (2014). End-to-end text recognition with hybrid HMM maxout models. In International conference on learning representations.
Anthimopoulos, M., Gatos, B., & Pratikakis, I. (2013). Detection of artificial and scene text in images and video frames. Pattern Analysis and Applications, 16(3), 431–446.
Article MathSciNet Google Scholar
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the international conference on computer vision.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article MATH Google Scholar
Campos, D. T., Babu, B. R., & Varma, M. (2009). Character recognition in natural images. In A. Ranchordas & H. Araújo (Eds.), VISAPP 2009—Proceedings of the fourth international conference on computer vision theory and applications, Lisboa, Portugal, February 5–8, 2009 (Vol. 2, pp. 273–280). INSTICC Press.
Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., & Girod, B. (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of international conference on image processing (ICIP) (pp. 2609–2612).
Chen, X., & Yuille, A. L. (2004). Detecting and reading text in natural scenes. In Computer vision and pattern recognition, 2004. CVPR 2004 (Vol. 2, pp. II-366). Piscataway, NJ: IEEE.
Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). Bing: Binarized normed gradients for objectness estimation at 300fps. In 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014 (pp. 3286–3293). Piscataway, NJ: IEEE. doi:10.1109/CVPR.2014.414.
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1532–1545.
Article Google Scholar
Dollár, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian detector in the west. In F. Labrosse, R. Zwiggelaar, Y. Liu & B. Tiddeman (Eds.), British Machine Vision Conference, BMVC 2010, Aberystwyth, UK, August 31–September 3, 2010. Proceedings (pp. 1–11). British Machine Vision Association. doi:10.5244/C.24.68.
Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1841–1848). IEEE.
Dollár, P., & Zitnick, C. L. (2014). Fast edge detection using structured forests. arXiv:1406.5549.
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2963–2970). IEEE.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Felzenszwalb, P. F., Grishick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.
Article Google Scholar
Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2010). Hmm-based word spotting in handwritten documents using subword models. In 2010 20th International conference on pattern recognition (icpr) (pp. 3416–3419). IEEE.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.
Article MathSciNet MATH Google Scholar
Frinken, V., Fischer, A., Manmatha, R., & Bunke, H. (2012). A novel word spotting method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 211–224.
Article Google Scholar
Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Goel, V., Mishra, A., Alahari, K., & Jawahar, C. V. (2013). Whole is greater than sum of parts: Recognizing scene text words. In 2013 12th International conference on document analysis and recognition, Washington, DC, USA, August 25–28, 2013 (pp. 398–402). IEEE Computer Society. doi:10.1109/ICDAR.2013.87.
Gomez, L., & Karatzas, D. (2013). Multi-script text extraction from natural scenes. In 2013 12th International conference on document analysis and recognition (ICDAR) (pp. 467–471). IEEE.
Gomez, L., & Karatzas, D. (2014). A fast hierarchical method for multi-script and arbitrary oriented scene text extraction. arXiv:1407.7504.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2013). Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv:1312.6082.
Gordo, A. (2014). Supervised mid-level features for word image representation. CoRR. arXiv:1410.5224
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR arXiv:1207.0580.
Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 497–511). New York City: Springer.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227.
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In European conference on computer vision.
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S. R., Mas, J., et al. (2013). ICDAR 2013 robust reading competition. In ICDAR (pp. 1484–1493). Piscataway, NJ: IEEE.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Lucas, S. (2005). ICDAR 2005 text locating competition results. In Proceedings of the eighth international conference on document analysis and recognition, 2005 (pp. 80–84). IEEE.
Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., & Young, R. (2003). Icdar 2003 robust reading competitions. In Proceedings of ICDAR.
Manmatha, R., Han, C., & Riseman, E. M. (1996). Word spotting: A new approach to indexing handwriting. In Proceedings CVPR’96, 1996 IEEE computer society conference on computer vision and pattern recognition, 1996 (pp. 631–637). IEEE.
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference (pp. 384–393).
Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (pp. 127.1–127.11). BMVA Press.
Mishra, A., Alahari, K., & Jawahar, C. (2013). Image retrieval using textual cues. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3040–3047). IEEE.
Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Proceedings of the Asian conference on computer vision (pp. 770–783). Springer.
Neumann, L., & Matas, J. (2011). Text localization in real-world images using efficiently pruned exhaustive search. In Proceedings of ICDAR (pp. 687–691). IEEE.
Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Neumann, L., & Matas, J. (2013). Scene text localization and recognition with oriented stroke detection. In Proceedings of the international conference on computer vision (pp. 97–104).
Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In Proceedings of the European conference on computer vision (pp. 752–765). Springer.
Ozuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Posner, I., Corke, P., & Newman, P. (2010). Using text-spotting to query the world. In 2010 IEEE/RSJ international conference on intelligent robots and systems, October 18–22, 2010, Taipei, Taiwan (pp. 3181–3186). Piscataway, NJ: IEEE. doi:10.1109/IROS.2010.5653151.
Quack, T. (2009). Large scale mining and retrieval of visual data in a multimodal context. Ph.D. Thesis, ETH Zurich.
Rath, T., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2–4), 139–152.
Article Google Scholar
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British Machine Vision Conference.
Shahab, A., Shafait, F., & Dengel, A. (2011). ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of ICDAR (pp. 1491–1496). IEEE.
Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. Piscataway, NJ: Institute of Electrical and Electronics Engineers, Inc.
Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the international conference on computer vision (pp. 1457–1464). IEEE.
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In ICPR (pp. 3304–3308). IEEE.
Weinman, J. J., Butler, Z., Knoll, D., & Feild, J. (2014). Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 375–387. doi:10.1109/TPAMI.2013.126.
Article Google Scholar
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049). IEEE.
Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.
Article MathSciNet Google Scholar
Yin, X. C., Yin, X., & Huang, K. (2013). Robust text detection in natural scene images. CoRR arXiv:1301.2628.
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 391–405). New York City: Springer.

Download references

Acknowledgments

This work was supported by the EPSRC and ERC Grant VisRec No. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research. We thank the BBC and in particular Rob Cooper for access to data and video processing resources.

Author information

Authors and Affiliations

Department of Engineering Science, University of Oxford, Oxford, UK
Max Jaderberg, Karen Simonyan, Andrea Vedaldi & Andrew Zisserman

Authors

Max Jaderberg
View author publications
You can also search for this author in PubMed Google Scholar
Karen Simonyan
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Vedaldi
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Max Jaderberg.

Additional information

Communicated by Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jaderberg, M., Simonyan, K., Vedaldi, A. et al. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vis 116, 1–20 (2016). https://doi.org/10.1007/s11263-015-0823-z

Download citation

Received: 01 December 2014
Accepted: 31 March 2015
Published: 07 May 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11263-015-0823-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reading Text in the Wild with Convolutional Neural Networks

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reading Text in the Wild with Convolutional Neural Networks

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation