Label Embedding: A Frugal Baseline for Text Recognition

Rodriguez-Serrano, Jose A.; Gordo, Albert; Perronnin, Florent

doi:10.1007/s11263-014-0793-6

Label Embedding: A Frugal Baseline for Text Recognition

Published: 23 December 2014

Volume 113, pages 193–207, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jose A. Rodriguez-Serrano¹,
Albert Gordo² &
Florent Perronnin²

1341 Accesses
67 Citations
9 Altmetric
Explore all metrics

Abstract

The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

An alternative upper-bound is the slack-rescaled hinge loss \(\max _{y \in \mathcal {Y}} \Delta (y_n,y) (1 - F(x_n,y_n;w) + F(x_n,y;w))\). Note that in the 0/1 loss case, both are equivalent. See (Nowozin and Lampert (2011), p.120) for more details.
Marginalization can be done “early”, by constructing a string representation that includes all possible symbols in that position (weighted by the size of the symbols’ alphabet), or “late”, by explicitly generating a new set of queries that match the query with the wildcard and averaging the similarities of those queries with the image. This is equivalent to generating the new set of queries, averaging them, and then computing the similarity between that average query and the image. The subtle differences between “early” and “late” marginalization are only due to the way the string representation is normalized. We focus on late marginalization since it obtained slightly better results than early marginalization.

References

Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2013). Handwritten word spotting with corrected attributes. In ICCV.
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.
Bai, B., Weston, J., Grangier, D., Collobert, R., Chapelle, O., & Weinberger, K. (2009). Supervised semantic indexing. In CIKM.
Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary ocr system for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495–504.
Article Google Scholar
Bishop, C. (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation.
Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
MATH Google Scholar
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013) Photoocr: Reading text in uncontrolled conditions. In ICCV.
Brakensiek, A., & Rigoll, G. (2004). Handwritten address recognition using hidden markov models. Reading and Learning (pp. 103–122). Berlin: Springer.
Google Scholar
Brakensiek, A., Rottland, J., Kosmala, A., & Rigoll, G. (2000). Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In ICFHR.
Breuel, T. M. (2001). Segmentation of handprinted letter strings using a dynamic programming algorithm. In ICDAR.
Bunke, H., Roth, M., & Schukat-Talamazzini, E. G. (1995). Off-line cursive handwriting recognition using hidden Markov models. Pattern Recognition, 28(9), 1399–1413.
Article Google Scholar
Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3), 291–310.
Article Google Scholar
Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC.
Chen, M. Y., Kundu, A., & Zhou, J. (1994). Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 481–496. doi:10.1109/34.291449.
Article Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV SLCV workshop.
Dutta, S., Sankaran, N., Sankar, K. P., & Jawahar, C. V. (2012). Robust recognition of degraded documents using character n-grams. In DAS.
El-Yacoubi, A., Sabourin, R., Suen, C. Y., & Gilloux, M. (1999). An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 752–760.
Article Google Scholar
Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv.
Jain, R. & Jawahar, C. (2010). Towards more effective distance functions for word image matching. In DAS (pp. 363–370). ACM.
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Article Google Scholar
Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD.
Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q. (2012). Non-linear metric learning. In NIPS.
Knerr, S., Augustin, E., Baret, O., & Price, D. (1998). Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3), 404–419.
Article Google Scholar
Koerich, A. L., Sabourin, R., & Suen, C. Y. (2003). Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis and Applications, 6(2), 97–121.
Article MathSciNet Google Scholar
Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. In G. Orr & K. Muller (Eds.), Neural networks: Tricks of the trade. New York: Springer.
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. J. Mach. Learn. Res., 2, 419–444.
MATH Google Scholar
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Madhvanath, S., & Govindaraju, V. (2001). The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 149–164.
Article Google Scholar
Marti, U. V., & Bunke, H. (2001). Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15, 65–90.
Article Google Scholar
Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS.
Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In BMVC.
Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.
Mohamed, M. A., & Gader, P. D. (1996). Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5), 548–554. doi:10.1109/34.494644.
Article Google Scholar
Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition. New York: Wiley.
Google Scholar
Nagy, G. (2000). Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 38–62.
Article Google Scholar
Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In CVPR.
Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In ECCV.
Nowozin, S., & Lampert, C. (2011). Structured learning and prediction in computer vision. Foundations and trends in computer graphics and vision.
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed Fisher vectors. In CVPR.
Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In ECCV.
Rath, T. M., & Manmatha, R. (2003). Word image matching using dynamic time warping. In CVPR.
Rodríguez-Serrano, J. A., & Perronnin, F. (2012). A model-based sequence similarity with application to handwritten word spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2108–2120.
Article Google Scholar
Rodriguez-Serrano, J. A., & Perronnin, F. (2013). Label embedding for text recognition. In BMVC.
Rodríguez-Serrano, J. A., Sandhawalia, H., Bala, R., Perronnin, F., & Saunders, C. (2012). Data-driven vehicle identification by image matching. In ECCV Workshop on Computer Vision for Vehicle Technology.
Sankar, K., Manmatha, R., Jawahar, C. V., & Manmatha, R. (2010). Nearest neighbor based collection ocr. In DAS.
Schölkopf, B., Smola, A., & Müller, K. R. (1998). Non-linear component analysis as a kernel eigenvalue problem. In Neural Computation.
Senior, A. W., & Robinson, A. J. (1998). An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 309–321. doi:10.1109/34.667887.
Article Google Scholar
Vinciarelli, A., Bengio, S., & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709–720.
Article Google Scholar
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In ICCV.
Wang, K., & Belongie, S. (2010). Word spotting in the wild. In ECCV.
Weston, J., Bengio, S., & Usunier, N. (2010). Learning to rank with joint word-image embeddings. ECML: Large scale image annotation.
Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS.
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In CVPR.
Zimmermann, M., Chappelier, J. C., & Bunke, H. (2006). Offline grammar-based recognition of handwritten sentences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 818–821.
Article Google Scholar

Download references

Acknowledgments

This work was partially funded by the French ANR project FIRE-ID.

Author information

Authors and Affiliations

Machine Learning for Services Area, Xerox Research Centre Europe, Meylan, France
Jose A. Rodriguez-Serrano
Computer Vision Group, Xerox Research Centre Europe, Meylan, France
Albert Gordo & Florent Perronnin

Authors

Jose A. Rodriguez-Serrano
View author publications
You can also search for this author in PubMed Google Scholar
Albert Gordo
View author publications
You can also search for this author in PubMed Google Scholar
Florent Perronnin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose A. Rodriguez-Serrano.

Additional information

Communicated by Tilo Burghardt, Majid Mirmehdi, Walterio Mayol-Cuevas and Dima Damen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodriguez-Serrano, J.A., Gordo, A. & Perronnin, F. Label Embedding: A Frugal Baseline for Text Recognition. Int J Comput Vis 113, 193–207 (2015). https://doi.org/10.1007/s11263-014-0793-6

Download citation

Received: 17 May 2014
Accepted: 03 December 2014
Published: 23 December 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11263-014-0793-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Label Embedding: A Frugal Baseline for Text Recognition

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Learning to Prompt for Vision-Language Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Label Embedding: A Frugal Baseline for Text Recognition

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Learning to Prompt for Vision-Language Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation