International Journal of Computer Vision

, Volume 113, Issue 3, pp 193–207 | Cite as

Label Embedding: A Frugal Baseline for Text Recognition

  • Jose A. Rodriguez-Serrano
  • Albert Gordo
  • Florent Perronnin


The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.


Label embedding Scene text recognition Structured learning 



This work was partially funded by the French ANR project FIRE-ID.


  1. Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2013). Handwritten word spotting with corrected attributes. In ICCV.Google Scholar
  2. Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.Google Scholar
  3. Bai, B., Weston, J., Grangier, D., Collobert, R., Chapelle, O., & Weinberger, K. (2009). Supervised semantic indexing. In CIKM.Google Scholar
  4. Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary ocr system for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495–504.CrossRefGoogle Scholar
  5. Bishop, C. (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation.Google Scholar
  6. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.zbMATHGoogle Scholar
  7. Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013) Photoocr: Reading text in uncontrolled conditions. In ICCV.Google Scholar
  8. Brakensiek, A., & Rigoll, G. (2004). Handwritten address recognition using hidden markov models. Reading and Learning (pp. 103–122). Berlin: Springer.Google Scholar
  9. Brakensiek, A., Rottland, J., Kosmala, A., & Rigoll, G. (2000). Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In ICFHR.Google Scholar
  10. Breuel, T. M. (2001). Segmentation of handprinted letter strings using a dynamic programming algorithm. In ICDAR.Google Scholar
  11. Bunke, H., Roth, M., & Schukat-Talamazzini, E. G. (1995). Off-line cursive handwriting recognition using hidden Markov models. Pattern Recognition, 28(9), 1399–1413.CrossRefGoogle Scholar
  12. Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3), 291–310.CrossRefGoogle Scholar
  13. Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC.Google Scholar
  14. Chen, M. Y., Kundu, A., & Zhou, J. (1994). Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 481–496. doi: 10.1109/34.291449.CrossRefGoogle Scholar
  15. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV SLCV workshop.Google Scholar
  16. Dutta, S., Sankaran, N., Sankar, K. P., & Jawahar, C. V. (2012). Robust recognition of degraded documents using character n-grams. In DAS.Google Scholar
  17. El-Yacoubi, A., Sabourin, R., Suen, C. Y., & Gilloux, M. (1999). An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 752–760.CrossRefGoogle Scholar
  18. Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv.Google Scholar
  19. Jain, R. & Jawahar, C. (2010). Towards more effective distance functions for word image matching. In DAS (pp. 363–370). ACM.Google Scholar
  20. Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRefGoogle Scholar
  21. Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD.Google Scholar
  22. Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q. (2012). Non-linear metric learning. In NIPS.Google Scholar
  23. Knerr, S., Augustin, E., Baret, O., & Price, D. (1998). Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3), 404–419.CrossRefGoogle Scholar
  24. Koerich, A. L., Sabourin, R., & Suen, C. Y. (2003). Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis and Applications, 6(2), 97–121.MathSciNetCrossRefGoogle Scholar
  25. Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI.Google Scholar
  26. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  27. LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. In G. Orr & K. Muller (Eds.), Neural networks: Tricks of the trade. New York: Springer.Google Scholar
  28. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. J. Mach. Learn. Res., 2, 419–444.zbMATHGoogle Scholar
  29. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  30. Madhvanath, S., & Govindaraju, V. (2001). The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 149–164.CrossRefGoogle Scholar
  31. Marti, U. V., & Bunke, H. (2001). Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15, 65–90.CrossRefGoogle Scholar
  32. Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.Google Scholar
  33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS.Google Scholar
  34. Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In BMVC.Google Scholar
  35. Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.Google Scholar
  36. Mohamed, M. A., & Gader, P. D. (1996). Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5), 548–554. doi: 10.1109/34.494644.CrossRefGoogle Scholar
  37. Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition. New York: Wiley.Google Scholar
  38. Nagy, G. (2000). Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 38–62.CrossRefGoogle Scholar
  39. Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In CVPR.Google Scholar
  40. Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In ECCV.Google Scholar
  41. Nowozin, S., & Lampert, C. (2011). Structured learning and prediction in computer vision. Foundations and trends in computer graphics and vision.Google Scholar
  42. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.Google Scholar
  43. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed Fisher vectors. In CVPR.Google Scholar
  44. Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.Google Scholar
  45. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In ECCV.Google Scholar
  46. Rath, T. M., & Manmatha, R. (2003). Word image matching using dynamic time warping. In CVPR.Google Scholar
  47. Rodríguez-Serrano, J. A., & Perronnin, F. (2012). A model-based sequence similarity with application to handwritten word spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2108–2120.CrossRefGoogle Scholar
  48. Rodriguez-Serrano, J. A., & Perronnin, F. (2013). Label embedding for text recognition. In BMVC.Google Scholar
  49. Rodríguez-Serrano, J. A., Sandhawalia, H., Bala, R., Perronnin, F., & Saunders, C. (2012). Data-driven vehicle identification by image matching. In ECCV Workshop on Computer Vision for Vehicle Technology.Google Scholar
  50. Sankar, K., Manmatha, R., Jawahar, C. V., & Manmatha, R. (2010). Nearest neighbor based collection ocr. In DAS.Google Scholar
  51. Schölkopf, B., Smola, A., & Müller, K. R. (1998). Non-linear component analysis as a kernel eigenvalue problem. In Neural Computation.Google Scholar
  52. Senior, A. W., & Robinson, A. J. (1998). An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 309–321. doi: 10.1109/34.667887.CrossRefGoogle Scholar
  53. Vinciarelli, A., Bengio, S., & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709–720.CrossRefGoogle Scholar
  54. Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In ICCV.Google Scholar
  55. Wang, K., & Belongie, S. (2010). Word spotting in the wild. In ECCV.Google Scholar
  56. Weston, J., Bengio, S., & Usunier, N. (2010). Learning to rank with joint word-image embeddings. ECML: Large scale image annotation.Google Scholar
  57. Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS.Google Scholar
  58. Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In CVPR.Google Scholar
  59. Zimmermann, M., Chappelier, J. C., & Bunke, H. (2006). Offline grammar-based recognition of handwritten sentences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 818–821.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Jose A. Rodriguez-Serrano
    • 1
  • Albert Gordo
    • 2
  • Florent Perronnin
    • 2
  1. 1.Machine Learning for Services AreaXerox Research Centre EuropeMeylanFrance
  2. 2.Computer Vision GroupXerox Research Centre EuropeMeylanFrance

Personalised recommendations