Advertisement

International Journal of Computer Vision

, Volume 116, Issue 1, pp 1–20 | Cite as

Reading Text in the Wild with Convolutional Neural Networks

  • Max JaderbergEmail author
  • Karen Simonyan
  • Andrea Vedaldi
  • Andrew Zisserman
Article

Abstract

In this work we present an end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

Keywords

Text spotting Text recognition  Text detection  Deep learning Convolutional neural networks Synthetic data Text retrieval 

Notes

Acknowledgments

This work was supported by the EPSRC and ERC Grant VisRec No. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research. We thank the BBC and in particular Rob Cooper for access to data and video processing resources.

References

  1. Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.CrossRefGoogle Scholar
  2. Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566. doi: 10.1109/TPAMI.2014.2339814.CrossRefGoogle Scholar
  3. Alsharif, O., & Pineau, J. (2014). End-to-end text recognition with hybrid HMM maxout models. In International conference on learning representations.Google Scholar
  4. Anthimopoulos, M., Gatos, B., & Pratikakis, I. (2013). Detection of artificial and scene text in images and video frames. Pattern Analysis and Applications, 16(3), 431–446.MathSciNetCrossRefGoogle Scholar
  5. Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the international conference on computer vision.Google Scholar
  6. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefzbMATHGoogle Scholar
  7. Campos, D. T., Babu, B. R., & Varma, M. (2009). Character recognition in natural images. In A. Ranchordas & H. Araújo (Eds.), VISAPP 2009—Proceedings of the fourth international conference on computer vision theory and applications, Lisboa, Portugal, February 5–8, 2009 (Vol. 2, pp. 273–280). INSTICC Press.Google Scholar
  8. Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., & Girod, B. (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In Proceedings of international conference on image processing (ICIP) (pp. 2609–2612).Google Scholar
  9. Chen, X., & Yuille, A. L. (2004). Detecting and reading text in natural scenes. In Computer vision and pattern recognition, 2004. CVPR 2004 (Vol. 2, pp. II-366). Piscataway, NJ: IEEE.Google Scholar
  10. Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). Bing: Binarized normed gradients for objectness estimation at 300fps. In 2014 IEEE conference on computer vision and pattern recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014 (pp. 3286–3293). Piscataway, NJ: IEEE. doi: 10.1109/CVPR.2014.414.
  11. Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1532–1545.CrossRefGoogle Scholar
  12. Dollár, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian detector in the west. In F. Labrosse, R. Zwiggelaar, Y. Liu & B. Tiddeman (Eds.), British Machine Vision Conference, BMVC 2010, Aberystwyth, UK, August 31–September 3, 2010. Proceedings (pp. 1–11). British Machine Vision Association. doi: 10.5244/C.24.68.
  13. Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1841–1848). IEEE.Google Scholar
  14. Dollár, P., & Zitnick, C. L. (2014). Fast edge detection using structured forests. arXiv:1406.5549.
  15. Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2963–2970). IEEE.Google Scholar
  16. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  17. Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.CrossRefGoogle Scholar
  18. Felzenszwalb, P. F., Grishick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.CrossRefGoogle Scholar
  19. Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2010). Hmm-based word spotting in handwritten documents using subword models. In 2010 20th International conference on pattern recognition (icpr) (pp. 3416–3419). IEEE.Google Scholar
  20. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–407.MathSciNetCrossRefzbMATHGoogle Scholar
  21. Frinken, V., Fischer, A., Manmatha, R., & Bunke, H. (2012). A novel word spotting method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 211–224.CrossRefGoogle Scholar
  22. Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  23. Goel, V., Mishra, A., Alahari, K., & Jawahar, C. V. (2013). Whole is greater than sum of parts: Recognizing scene text words. In 2013 12th International conference on document analysis and recognition, Washington, DC, USA, August 25–28, 2013 (pp. 398–402). IEEE Computer Society. doi: 10.1109/ICDAR.2013.87.
  24. Gomez, L., & Karatzas, D. (2013). Multi-script text extraction from natural scenes. In 2013 12th International conference on document analysis and recognition (ICDAR) (pp. 467–471). IEEE.Google Scholar
  25. Gomez, L., & Karatzas, D. (2014). A fast hierarchical method for multi-script and arbitrary oriented scene text extraction. arXiv:1407.7504.
  26. Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2013). Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv:1312.6082.
  27. Gordo, A. (2014). Supervised mid-level features for word image representation. CoRR. arXiv:1410.5224
  28. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR arXiv:1207.0580.
  29. Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 497–511). New York City: Springer.Google Scholar
  30. Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227.
  31. Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In European conference on computer vision.Google Scholar
  32. Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
  33. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S. R., Mas, J., et al. (2013). ICDAR 2013 robust reading competition. In ICDAR (pp. 1484–1493). Piscataway, NJ: IEEE.Google Scholar
  34. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
  35. Lucas, S. (2005). ICDAR 2005 text locating competition results. In Proceedings of the eighth international conference on document analysis and recognition, 2005 (pp. 80–84). IEEE.Google Scholar
  36. Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., & Young, R. (2003). Icdar 2003 robust reading competitions. In Proceedings of ICDAR.Google Scholar
  37. Manmatha, R., Han, C., & Riseman, E. M. (1996). Word spotting: A new approach to indexing handwriting. In Proceedings CVPR’96, 1996 IEEE computer society conference on computer vision and pattern recognition, 1996 (pp. 631–637). IEEE.Google Scholar
  38. Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference (pp. 384–393).Google Scholar
  39. Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (pp. 127.1–127.11). BMVA Press.Google Scholar
  40. Mishra, A., Alahari, K., & Jawahar, C. (2013). Image retrieval using textual cues. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3040–3047). IEEE.Google Scholar
  41. Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Proceedings of the Asian conference on computer vision (pp. 770–783). Springer.Google Scholar
  42. Neumann, L., & Matas, J. (2011). Text localization in real-world images using efficiently pruned exhaustive search. In Proceedings of ICDAR (pp. 687–691). IEEE.Google Scholar
  43. Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  44. Neumann, L., & Matas, J. (2013). Scene text localization and recognition with oriented stroke detection. In Proceedings of the international conference on computer vision (pp. 97–104).Google Scholar
  45. Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In Proceedings of the European conference on computer vision (pp. 752–765). Springer.Google Scholar
  46. Ozuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  47. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  48. Posner, I., Corke, P., & Newman, P. (2010). Using text-spotting to query the world. In 2010 IEEE/RSJ international conference on intelligent robots and systems, October 18–22, 2010, Taipei, Taiwan (pp. 3181–3186). Piscataway, NJ: IEEE. doi: 10.1109/IROS.2010.5653151.
  49. Quack, T. (2009). Large scale mining and retrieval of visual data in a multimodal context. Ph.D. Thesis, ETH Zurich.Google Scholar
  50. Rath, T., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2–4), 139–152.CrossRefGoogle Scholar
  51. Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British Machine Vision Conference.Google Scholar
  52. Shahab, A., Shafait, F., & Dengel, A. (2011). ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of ICDAR (pp. 1491–1496). IEEE.Google Scholar
  53. Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. Piscataway, NJ: Institute of Electrical and Electronics Engineers, Inc.Google Scholar
  54. Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.Google Scholar
  55. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  56. Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In Proceedings of the international conference on computer vision (pp. 1457–1464). IEEE.Google Scholar
  57. Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In ICPR (pp. 3304–3308). IEEE.Google Scholar
  58. Weinman, J. J., Butler, Z., Knoll, D., & Feild, J. (2014). Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2), 375–387. doi: 10.1109/TPAMI.2013.126.CrossRefGoogle Scholar
  59. Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049). IEEE.Google Scholar
  60. Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.MathSciNetCrossRefGoogle Scholar
  61. Yin, X. C., Yin, X., & Huang, K. (2013). Robust text detection in natural scene images. CoRR arXiv:1301.2628.
  62. Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part IV (pp. 391–405). New York City: Springer.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Max Jaderberg
    • 1
    Email author
  • Karen Simonyan
    • 1
  • Andrea Vedaldi
    • 1
  • Andrew Zisserman
    • 1
  1. 1.Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations