Skip to main content
Log in

SoftCTC—semi-supervised learning for text recognition using soft pseudo-labels

  • Special Issue Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper explores semi-supervised training for sequence tasks, such as optical character recognition or automatic speech recognition. We propose a novel loss function—SoftCTC—which is an extension of CTC allowing to consider multiple transcription variants at the same time. This allows to omit the confidence-based filtering step which is otherwise a crucial component of pseudo-labeling approaches to semi-supervised learning. We demonstrate the effectiveness of our method on a challenging handwriting recognition task and conclude that SoftCTC matches the performance of a finely tuned filtering-based pipeline. We also evaluated SoftCTC in terms of computational efficiency, concluding that it is significantly more efficient than a naïve CTC-based approach for training on multiple transcription variants, and we make our GPU implementation public.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. In our training setup, we consider on average ca. \(10^{85}\) possible transcription variants for each text line; the respected Eddington number estimates the number of protons in the observable universe to about \(10^{80}\).

  2. https://github.com/DCGM/SoftCTC.

  3. In HMM literature, the issue is avoided by not including \({{\textbf {q}}}_t\) in \({{\varvec{\beta }}}_t\) – at the cost of breaking the symmetry of the recursive formulations of \({{\varvec{\alpha }}}_t\) and \({{\varvec{\beta }}}_t\) and their initial values \({{\varvec{\alpha }}}_1\) and \({{\varvec{\beta }}}_T\) [35].

  4. More precisely, as the loss is log-likelihood of the transcription given the input, it requires to exponentiate the loss, calculate the weighted sum, and then take the logarithm of the result, i.e., to perform the weighting in the probability domain.

  5. It is not a necessary condition, but any acyclic automaton can be expressed as a triangular matrix using topographic ordering and this way it is easy to check for possible cycles.

  6. They can be even encoded as completely disjoint chains, but we prefer this way of presentation as it is easier to relate to encoding confusion networks.

  7. The prefix search does not take into account all possible paths corresponding to each transcription, so the actual posterior probabilities are higher. In practice, the amount of probability left out is negligible.

  8. Full implementation is public in the pero-ocr GitHub repository at https://github.com/DCGM/pero-ocr.

  9. The whole process is implemented in the SoftCTC GitHub repository, specifically in https://github.com/DCGM/SoftCTC/blob/main/soft_ctc/models/connections.py.

  10. This is very similar to \(\epsilon \)-closures in determinization of finite state automata.

  11. The exception being a lattice representing that the whole line is possibly empty. This can, however, be easily addressed in a post hoc fashion by constructing a direct connection from the first to the last blank state.

  12. https://www.ucl.ac.uk/bentham-project.

  13. From PyTorch module torchvision.models.vgg16.

  14. To compare the results as fairly as possible, we took the seed model directly from the AT-ST [8].

  15. torch.nn.CTCLoss.

References

  1. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  2. Graves, A., Fernández, S., Gomez, F., et al.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp. 369-376. New York, NY, USA (2006)

  3. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. In: Ghahramani, Z., Welling, M., Cortes, C., et al. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates Inc (2014)

    Google Scholar 

  4. Radford, A., Kim, J.W., Xu, T., et al.: Robust speech recognition via large-scale weak supervision (2022)

  5. Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates Inc., pp 1877–1901 (2020)

  6. Rombach, R., Blattmann, A., Lorenz, D., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10684–10695 (2022)

  7. Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. 2204.06125 [cs] (2022)

  8. Kišš, M., Beneš, K., Hradiš, M.: AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition-ICDAR 2021, pp. 463–477. Springer International Publishing (2021)

    Chapter  Google Scholar 

  9. Arazo, E., Ortego, D., Albert, P., et al.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020)

  10. Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. 1412.4864 [cs, stat] (2014)

  11. Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, vol 29. Curran Associates, Inc (2016)

  12. Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc (2017)

  13. Berthelot, D., Carlini, N., Goodfellow, I., et al.: MixMatch: A holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc (2019)

  14. Kurakin, A., Raffel, C., Berthelot, D., et al.: ReMixMatch: semi-supervised learning with distribution matching and augmentation anchoring. In: ICLR (2020)

  15. Xie, Q., Dai, Z., Hovy, E., et al.: Unsupervised Data Augmentation for Consistency Training. In: Larochelle, H., Ranzato, M., Hadsell, R., et al. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6256–6268. Curran Associates, Inc (2020)

    Google Scholar 

  16. Englesson, E., Azizpour, H.: Generalized jensen-shannon divergence loss for learning with noisy labels. In: Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc., pp 30,284–30,297 (2021)

  17. Zheng, C., Li, H., Rhee, S., et al.: Pushing the performance limit of scene text recognizer without human annotation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 14096–14105 (2022)

  18. Aberdam, A., Ganz, R., Mazor, S., et al.: Multimodal semi-supervised learning for text recognition. 2205.03873 [cs] (2022)

  19. Lee, D.H.: Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop : Challenges in Representation Learning (WREPL) (2013)

  20. Xie, Q., Luong, M.T., Hovy, E., et al.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  21. Pham, H., Dai, Z., Xie, Q., et al.: Meta pseudo labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11557–11568 (2021)

  22. Nagai, A.: Recognizing Japanese historical cursive with pseudo-labeling-aided crnn as an application of semi-supervised learning to sequence labeling. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 97–102 (2020)

  23. Stuner, B., Chatelain, C., Paquet, T.: Self-training of BLSTM with lexicon verification for handwriting recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 633–638 (2017)

  24. Leifert, G., Labahn, R., Sánchez, J.A.: Two semi-supervised training approaches for automated text recognition. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 145–150 (2020)

  25. Das, D., Jawahar, C.V.: Adapting OCR with Limited Supervision. In: Bai, X., Karatzas, D., Lopresti, D. (eds.) Document Analysis Systems, pp. 30–44. Springer International Publishing, Cham (2020)

    Chapter  Google Scholar 

  26. Sohn, K., Berthelot, D., Carlini, N., et al.: FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc., pp. 596–608 (2020)

  27. Weninger, F., Mana, F., Gemello, R., et al.: Semi-supervised learning with data augmentation for end-to-end ASR. Proc. Interspeech 2020, 2802–2806 (2020)

    Google Scholar 

  28. Wolf, F., Fink, G.A.: Self-training of handwritten word recognition for synthetic-to-real adaptation. In: Proceedings of the International Conference on Pattern Recognition (2022)

  29. Zhang, H., Cisse, M., Dauphin, Y.N., et al.: Mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

  30. Frinken, V., Bunke, H.: Evaluating retraining rules for semi-supervised learning in neural network based cursive word recognition. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 31–35 (2009)

  31. Constum, T., Kempf, N., Paquet, T., et al.: Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early \(20^{\rm th }\) Century Paris Census. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 143–157. Springer International Publishing (2022)

    Chapter  Google Scholar 

  32. Gao, Y., Chen, Y., Wang, J., et al.: Semi-supervised scene text recognition. IEEE Trans. Image Process. 30, 3005–3016 (2021)

    Article  MathSciNet  Google Scholar 

  33. Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition. Proc. Interspeech 2017, 3707–3711 (2017)

    Article  Google Scholar 

  34. Watanabe, S., Hori, T., Kim, S., et al.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Select. Topics Signal Process. 11(8), 1240–1253 (2017)

    Article  Google Scholar 

  35. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  36. Bangalore, B., Bordel, G., Riccardi, G.: Computing consensus translation from multiple machine translation systems. In: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01., pp 351–354 (2001)

  37. Rosti, A.V., Ayan, N.F., Xiang, B., et al.: Combining outputs from multiple machine translation systems. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics, pp. 228–235 (2007)

  38. Fiscus, J.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 347–354 (1997)

  39. Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput. Speech Lang. 14(4), 373–400 (2000)

    Article  Google Scholar 

  40. Sanchez, J.A., Romero, V., Toselli, A.H., et al.: ICDAR2017 Competition on handwritten text recognition on the READ dataset. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1383–1388. IEEE, Kyoto (2017)

  41. Sánchez, J.A., Romero, V., Toselli, A.H., et al.: ICFHR2014 competition on handwritten text recognition on transcriptorium datasets (HTRtS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790, (2014)

  42. Serrano, N., Castro, F., Juan, A.: The RODRIGO database. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA) (2010)

  43. Sánchez, J.A., Romero, V., Toselli, A.H., et al.: ICFHR2016 competition on handwritten text recognition on the READ dataset. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 630–635 (2016)

  44. Kohút, J., Hradiš, M.: Finetuning is a surprisingly effective domain adaptation baseline in handwriting recognition. 2302.06308 [cs] (2023)

  45. Kohút, J., Hradiš, M., Kišš, M.: Towards writing style adaptation in handwriting recognition. 2302.06318 [cs] (2023)

Download references

Author information

Authors and Affiliations

Authors

Contributions

MK, MH, and KB wrote the paper and did the experiments. PB and MK contributed to the development of the loss function.

Corresponding author

Correspondence to Martin Kišš.

Ethics declarations

Conflict of interest

The authors do not have any competing interests, financial or other.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kišš, M., Hradiš, M., Beneš, K. et al. SoftCTC—semi-supervised learning for text recognition using soft pseudo-labels. IJDAR (2023). https://doi.org/10.1007/s10032-023-00452-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10032-023-00452-9

Keywords

Navigation