Skip to main content
Log in

Reading Text in the Wild with Convolutional Neural Networks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this work we present an end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.

    Article  Google Scholar 

  • Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566. doi:10.1109/TPAMI.2014.2339814.

    Article  Google Scholar 

  • Alsharif, O., & Pineau, J. (2014). End-to-end text recognition with hybrid HMM maxout models. In International conference on learning representations.

  • Anthimopoulos, M., Gatos, B., & Pratikakis, I. (2013). Detection of artificial and scene text in images and video frames. Pattern Analysis and Applications, 16(3), 431–446.

    Article  MathSciNet  Google Scholar 

  • Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the international conference on computer vision.

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Campos, D. T., Babu, B. R., & Varma, M. (2009). Character recognition in natural images. In A. Ranchordas & H. Araújo (Eds.), VISAPP 2009—Proceedings of the fourth international conference on computer vision theory and applications, Lisboa, Portugal, February 5–8, 2009 (Vol. 2, pp. 273–280). INSTICC Press.

  • Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., & Girod, B. (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of international conference on image processing (ICIP) (pp. 2609–2612).

  • Chen, X., & Yuille, A. L. (2004). Detecting and reading text in natural scenes. In Computer vision and pattern recognition, 2004. CVPR 2004 (Vol. 2, pp. II-366). Piscataway, NJ: IEEE.

  • Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). Bing: Binarized normed gradients for objectness estimation at 300fps. In 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014 (pp. 3286–3293). Piscataway, NJ: IEEE. doi:10.1109/CVPR.2014.414.

  • Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1532–1545.

    Article  Google Scholar 

  • Dollár, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian detector in the west. In F. Labrosse, R. Zwiggelaar, Y. Liu & B. Tiddeman (Eds.), British Machine Vision Conference, BMVC 2010, Aberystwyth, UK, August 31–September 3, 2010. Proceedings (pp. 1–11). British Machine Vision Association. doi:10.5244/C.24.68.

  • Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1841–1848). IEEE.

  • Dollár, P., & Zitnick, C. L. (2014). Fast edge detection using structured forests. arXiv:1406.5549.

  • Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2963–2970). IEEE.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Grishick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.

    Article  Google Scholar 

  • Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2010). Hmm-based word spotting in handwritten documents using subword models. In 2010 20th International conference on pattern recognition (icpr) (pp. 3416–3419). IEEE.

  • Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.

    Article  MathSciNet  MATH  Google Scholar 

  • Frinken, V., Fischer, A., Manmatha, R., & Bunke, H. (2012). A novel word spotting method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 211–224.

    Article  Google Scholar 

  • Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Goel, V., Mishra, A., Alahari, K., & Jawahar, C. V. (2013). Whole is greater than sum of parts: Recognizing scene text words. In 2013 12th International conference on document analysis and recognition, Washington, DC, USA, August 25–28, 2013 (pp. 398–402). IEEE Computer Society. doi:10.1109/ICDAR.2013.87.

  • Gomez, L., & Karatzas, D. (2013). Multi-script text extraction from natural scenes. In 2013 12th International conference on document analysis and recognition (ICDAR) (pp. 467–471). IEEE.

  • Gomez, L., & Karatzas, D. (2014). A fast hierarchical method for multi-script and arbitrary oriented scene text extraction. arXiv:1407.7504.

  • Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2013). Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv:1312.6082.

  • Gordo, A. (2014). Supervised mid-level features for word image representation. CoRR. arXiv:1410.5224

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR arXiv:1207.0580.

  • Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 497–511). New York City: Springer.

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227.

  • Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In European conference on computer vision.

  • Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.

  • Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S. R., Mas, J., et al. (2013). ICDAR 2013 robust reading competition. In ICDAR (pp. 1484–1493). Piscataway, NJ: IEEE.

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  • Lucas, S. (2005). ICDAR 2005 text locating competition results. In Proceedings of the eighth international conference on document analysis and recognition, 2005 (pp. 80–84). IEEE.

  • Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., & Young, R. (2003). Icdar 2003 robust reading competitions. In Proceedings of ICDAR.

  • Manmatha, R., Han, C., & Riseman, E. M. (1996). Word spotting: A new approach to indexing handwriting. In Proceedings CVPR’96, 1996 IEEE computer society conference on computer vision and pattern recognition, 1996 (pp. 631–637). IEEE.

  • Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference (pp. 384–393).

  • Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (pp. 127.1–127.11). BMVA Press.

  • Mishra, A., Alahari, K., & Jawahar, C. (2013). Image retrieval using textual cues. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3040–3047). IEEE.

  • Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Proceedings of the Asian conference on computer vision (pp. 770–783). Springer.

  • Neumann, L., & Matas, J. (2011). Text localization in real-world images using efficiently pruned exhaustive search. In Proceedings of ICDAR (pp. 687–691). IEEE.

  • Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Neumann, L., & Matas, J. (2013). Scene text localization and recognition with oriented stroke detection. In Proceedings of the international conference on computer vision (pp. 97–104).

  • Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In Proceedings of the European conference on computer vision (pp. 752–765). Springer.

  • Ozuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Posner, I., Corke, P., & Newman, P. (2010). Using text-spotting to query the world. In 2010 IEEE/RSJ international conference on intelligent robots and systems, October 18–22, 2010, Taipei, Taiwan (pp. 3181–3186). Piscataway, NJ: IEEE. doi:10.1109/IROS.2010.5653151.

  • Quack, T. (2009). Large scale mining and retrieval of visual data in a multimodal context. Ph.D. Thesis, ETH Zurich.

  • Rath, T., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2–4), 139–152.

    Article  Google Scholar 

  • Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British Machine Vision Conference.

  • Shahab, A., Shafait, F., & Dengel, A. (2011). ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of ICDAR (pp. 1491–1496). IEEE.

  • Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. Piscataway, NJ: Institute of Electrical and Electronics Engineers, Inc.

  • Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the international conference on computer vision (pp. 1457–1464). IEEE.

  • Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In ICPR (pp. 3304–3308). IEEE.

  • Weinman, J. J., Butler, Z., Knoll, D., & Feild, J. (2014). Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 375–387. doi:10.1109/TPAMI.2013.126.

    Article  Google Scholar 

  • Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049). IEEE.

  • Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.

    Article  MathSciNet  Google Scholar 

  • Yin, X. C., Yin, X., & Huang, K. (2013). Robust text detection in natural scene images. CoRR arXiv:1301.2628.

  • Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 391–405). New York City: Springer.

Download references

Acknowledgments

This work was supported by the EPSRC and ERC Grant VisRec No. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research. We thank the BBC and in particular Rob Cooper for access to data and video processing resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Max Jaderberg.

Additional information

Communicated by Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jaderberg, M., Simonyan, K., Vedaldi, A. et al. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vis 116, 1–20 (2016). https://doi.org/10.1007/s11263-015-0823-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0823-z

Keywords

Navigation