Abstract
The attention-based encoder-decoder (AED) models are increasingly used in handwritten mathematical expression recognition (HMER) tasks. Given the recent success of Transformer in computer vision and a variety of attempts to combine Transformer with convolutional neural network (CNN), in this paper, we study 3 ways of leveraging Transformer and CNN designs to improve AED-based HMER models: 1) Tandem way, which feeds CNN-extracted features to a Transformer encoder to capture global dependencies; 2) Parallel way, which adds a Transformer encoder branch taking raw image patches as input and concatenates its output with CNN’s as final feature; 3) Mixing way, which replaces convolution layers of CNN’s last stage with multi-head self-attention (MHSA). We compared these 3 methods on the CROHME benchmark. On CROHME 2016 and 2019, Tandem way attained the ExpRate of 54.85% and 58.56%, respectively; Parallel way attained the ExpRate of 55.63% and 57.39%; and Mixing way achieved the ExpRate of 53.93% and 55.64%. This result indicates that Parallel and Tandem ways perform better than Mixing way, and have little difference between each other.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. In: Symposium on Interactive Systems for Experimental Applied Mathematics: Proceedings of the Association for Computing Machinery Inc., Symposium, pp. 436–459 (1967)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. arXiv preprint arXiv:2203.12273 (2022)
Dai, Z., et al.: CoAtNet: marrying convolution and attention for all data sizes. Adv. Neural. Inf. Process. Syst. 34, 3965–3977 (2021)
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: International Conference on Machine Learning, pp. 980–989. PMLR (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, H., Chen, K., Huo, Q.: An encoder-decoder approach to handwritten mathematical expression recognition with multi-head attention and stacked decoder. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 602–616. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_39
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Hong, Z., You, N., Tan, J., Bi, N.: Residual BiRNN based Seq2Seq model with transition probability matrix for online handwritten mathematical expression recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 635–640. IEEE (2019)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Le, A.D., Indurkhya, B., Nakagawa, M.: Pattern generation strategies for improving recognition of handwritten mathematical expressions. Pattern Recognit. Lett. 128, 255–262 (2019)
Li, Z., Jin, L., Lai, S., Zhu, Y.: Improving attention-based handwritten mathematical expression recognition with scale augmentation and drop attention. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 175–180. IEEE (2020)
Mahdavi, M., Zanibbi, R., Mouchere, H., Viard-Gaudin, C., Garain, U.: ICDAR 2019 CROHME+ TFD: competition on recognition of handwritten mathematical expressions and typeset formula detection. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1533–1538. IEEE (2019)
Mouchere, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: ICFHR 2014 competition on recognition of on-line handwritten mathematical expressions (CROHME 2014). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 791–796. IEEE (2014)
Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: ICFHR2016 CROHME: competition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 607–612. IEEE (2016)
Pang, N., Yang, C., Zhu, X., Li, J., Yin, X.C.: Global context-based network with transformer for image2latex. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4650–4656. IEEE (2021)
Peng, S., Gao, L., Yuan, K., Tang, Z.: Image to LaTeX with graph neural network for mathematical formula recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 648–663. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_42
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
Truong, T.N., Nguyen, C.T., Phan, K.M., Nakagawa, M.: Improvement of end-to-end offline handwritten mathematical expression recognition by weakly supervised learning. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 181–186. IEEE (2020)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wu, J.-W., Yin, F., Zhang, Y.-M., Zhang, X.-Y., Liu, C.-L.: Handwritten mathematical expression recognition via paired adversarial learning. Int. J. Comput. Vis. 128(10), 2386–2401 (2020). https://doi.org/10.1007/s11263-020-01291-5
Wu, J.-W., Yin, F., Zhang, Y.-M., Zhang, X.-Y., Liu, C.-L.: Image-to-markup generation via paired adversarial learning. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) ECML PKDD 2018. LNCS (LNAI), vol. 11051, pp. 18–34. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10925-7_2
Yan, Z., Zhang, X., Gao, L., Yuan, K., Tang, Z.: ConvMath: a convolutional sequence network for mathematical expression recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4566–4572. IEEE (2021)
Yuan, Y., et al.: Syntax-aware network for handwritten mathematical expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4553–4562 (2022)
Zhang, J., Du, J., Dai, L.: Multi-scale attention with dense encoder for handwritten mathematical expression recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2245–2250. IEEE (2018)
Zhang, J., Du, J., Yang, Y., Song, Y.Z., Wei, S., Dai, L.: A tree-structured decoder for image-to-markup generation. In: International Conference on Machine Learning, pp. 11076–11085. PMLR (2020)
Zhang, J., et al.: Watch, attend and parse: an end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recogn. 71, 196–206 (2017)
Zhao, W., Gao, L., Yan, Z., Peng, S., Du, L., Zhang, Z.: Handwritten mathematical expression recognition with bidirectionally trained transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 570–584. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_37
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Zhang, Y. (2022). Combining CNN and Transformer as Encoder to Improve End-to-End Handwritten Mathematical Expression Recognition Accuracy. In: Porwal, U., Fornés, A., Shafait, F. (eds) Frontiers in Handwriting Recognition. ICFHR 2022. Lecture Notes in Computer Science, vol 13639. Springer, Cham. https://doi.org/10.1007/978-3-031-21648-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-21648-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21647-3
Online ISBN: 978-3-031-21648-0
eBook Packages: Computer ScienceComputer Science (R0)