Abstract
The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.
Similar content being viewed by others
Notes
An alternative upper-bound is the slack-rescaled hinge loss \(\max _{y \in \mathcal {Y}} \Delta (y_n,y) (1 - F(x_n,y_n;w) + F(x_n,y;w))\). Note that in the 0/1 loss case, both are equivalent. See (Nowozin and Lampert (2011), p.120) for more details.
Marginalization can be done “early”, by constructing a string representation that includes all possible symbols in that position (weighted by the size of the symbols’ alphabet), or “late”, by explicitly generating a new set of queries that match the query with the wildcard and averaging the similarities of those queries with the image. This is equivalent to generating the new set of queries, averaging them, and then computing the similarity between that average query and the image. The subtle differences between “early” and “late” marginalization are only due to the way the string representation is normalized. We focus on late marginalization since it obtained slightly better results than early marginalization.
References
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2013). Handwritten word spotting with corrected attributes. In ICCV.
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.
Bai, B., Weston, J., Grangier, D., Collobert, R., Chapelle, O., & Weinberger, K. (2009). Supervised semantic indexing. In CIKM.
Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary ocr system for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495–504.
Bishop, C. (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation.
Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013) Photoocr: Reading text in uncontrolled conditions. In ICCV.
Brakensiek, A., & Rigoll, G. (2004). Handwritten address recognition using hidden markov models. Reading and Learning (pp. 103–122). Berlin: Springer.
Brakensiek, A., Rottland, J., Kosmala, A., & Rigoll, G. (2000). Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In ICFHR.
Breuel, T. M. (2001). Segmentation of handprinted letter strings using a dynamic programming algorithm. In ICDAR.
Bunke, H., Roth, M., & Schukat-Talamazzini, E. G. (1995). Off-line cursive handwriting recognition using hidden Markov models. Pattern Recognition, 28(9), 1399–1413.
Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3), 291–310.
Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC.
Chen, M. Y., Kundu, A., & Zhou, J. (1994). Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 481–496. doi:10.1109/34.291449.
Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV SLCV workshop.
Dutta, S., Sankaran, N., Sankar, K. P., & Jawahar, C. V. (2012). Robust recognition of degraded documents using character n-grams. In DAS.
El-Yacoubi, A., Sabourin, R., Suen, C. Y., & Gilloux, M. (1999). An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 752–760.
Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv.
Jain, R. & Jawahar, C. (2010). Towards more effective distance functions for word image matching. In DAS (pp. 363–370). ACM.
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD.
Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q. (2012). Non-linear metric learning. In NIPS.
Knerr, S., Augustin, E., Baret, O., & Price, D. (1998). Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3), 404–419.
Koerich, A. L., Sabourin, R., & Suen, C. Y. (2003). Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis and Applications, 6(2), 97–121.
Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. In G. Orr & K. Muller (Eds.), Neural networks: Tricks of the trade. New York: Springer.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. J. Mach. Learn. Res., 2, 419–444.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Madhvanath, S., & Govindaraju, V. (2001). The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 149–164.
Marti, U. V., & Bunke, H. (2001). Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15, 65–90.
Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS.
Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In BMVC.
Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.
Mohamed, M. A., & Gader, P. D. (1996). Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5), 548–554. doi:10.1109/34.494644.
Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition. New York: Wiley.
Nagy, G. (2000). Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 38–62.
Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In CVPR.
Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In ECCV.
Nowozin, S., & Lampert, C. (2011). Structured learning and prediction in computer vision. Foundations and trends in computer graphics and vision.
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed Fisher vectors. In CVPR.
Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In ECCV.
Rath, T. M., & Manmatha, R. (2003). Word image matching using dynamic time warping. In CVPR.
Rodríguez-Serrano, J. A., & Perronnin, F. (2012). A model-based sequence similarity with application to handwritten word spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2108–2120.
Rodriguez-Serrano, J. A., & Perronnin, F. (2013). Label embedding for text recognition. In BMVC.
Rodríguez-Serrano, J. A., Sandhawalia, H., Bala, R., Perronnin, F., & Saunders, C. (2012). Data-driven vehicle identification by image matching. In ECCV Workshop on Computer Vision for Vehicle Technology.
Sankar, K., Manmatha, R., Jawahar, C. V., & Manmatha, R. (2010). Nearest neighbor based collection ocr. In DAS.
Schölkopf, B., Smola, A., & Müller, K. R. (1998). Non-linear component analysis as a kernel eigenvalue problem. In Neural Computation.
Senior, A. W., & Robinson, A. J. (1998). An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 309–321. doi:10.1109/34.667887.
Vinciarelli, A., Bengio, S., & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709–720.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In ICCV.
Wang, K., & Belongie, S. (2010). Word spotting in the wild. In ECCV.
Weston, J., Bengio, S., & Usunier, N. (2010). Learning to rank with joint word-image embeddings. ECML: Large scale image annotation.
Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS.
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In CVPR.
Zimmermann, M., Chappelier, J. C., & Bunke, H. (2006). Offline grammar-based recognition of handwritten sentences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 818–821.
Acknowledgments
This work was partially funded by the French ANR project FIRE-ID.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Tilo Burghardt, Majid Mirmehdi, Walterio Mayol-Cuevas and Dima Damen.
Rights and permissions
About this article
Cite this article
Rodriguez-Serrano, J.A., Gordo, A. & Perronnin, F. Label Embedding: A Frugal Baseline for Text Recognition. Int J Comput Vis 113, 193–207 (2015). https://doi.org/10.1007/s11263-014-0793-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0793-6