Skip to main content

Document Image Classification with Vision Transformers

  • 125 Accesses

Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST,volume 436)

Abstract

Document image classification has received huge interest in business automation processes. Therefore, document image classification plays an important role in the document image processing (DIP) systems. And it is necessary to develop an effective framework for this task. Many methods have been proposed for the classification of document images in literature. In this paper we propose an efficient document image classification task that uses vision transformers (ViTs) and benefits from visual information of the document. Transformers are models developed for natural language processing tasks. Due to its high performances, their structures have been modified and they have started to be applied on different problems. ViT is one of these models. ViTs have demonstrated imposing performance in computer vision tasks compared with baselines. Since, scans the image and models the relation between the image patches using multi-head self-attention Experiments are conducted on a real-world dataset. Despite the limited size of training data available, our method achieves acceptable performance while performing document image classification.

Keywords

  • Document image classification
  • Deep learning
  • Transformers
  • Vision transformers

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-01984-5_6
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   64.99
Price excludes VAT (USA)
  • ISBN: 978-3-031-01984-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   84.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes, L.E.: HDLTex: hierarchical deep learning for text classification. In: 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 364–371. IEEE, Cancun, Mexico (2017).

    Google Scholar 

  2. Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: progress over two decades. Neurocomputing 453, 223–240 (2021)

    CrossRef  Google Scholar 

  3. Gallo, I., Noce, L., Zamberletti, A., Calefeti, A.: Deep neural networks for page stream segmentation and classification. In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–6. IEEE, Gold Coast, Australia (2016)

    Google Scholar 

  4. Sevim, S., İlhan Omurca, S., Ekinci, E.: Improving accuracy of document image classification through soft voting ensemble. In: 3rd International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2021), pp. 1–14. Springer, Cham (2021). https://doi.org/10.1186/s13059-022-02636-8

  5. Jain, R., Wigington, C.: Multimodal document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77. Sydney, Australia (2019)

    Google Scholar 

  6. Augereau, O., Journet, N., Vialard, A., Domenger, J.P.: Improving classification of an industrial document image database by combining visual and textual features. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 314–318. IEEE, Tours, France (2014)

    Google Scholar 

  7. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(150), 1–68 (2019)

    Google Scholar 

  8. Srinivasulu, K.: Health-related tweets classification: a survey. In: Gunjan, V.K., Zurada, J.M. (eds.) International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications Advances in Intelligent Systems and Computing, vol. 1245, pp. 259–268, Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-7234-0

  9. Nguyen, Q.D., Le, D.A., Phan, N.M., Zelinka, I.: OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal. Appl. 24, 701–721 (2021)

    CrossRef  Google Scholar 

  10. Kumar, J., Ye, P., Doermann, D.: Learning document structure for retrieval and classification. In: 21st International Conference on Pattern Recognition (ICPR), pp. 1558–1561. IEEE (2012).

    Google Scholar 

  11. Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE, Tsukuba, Japan (2014)

    Google Scholar 

  12. Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: Cross-modal deep networks for document image classification. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 2556–2560. IEEE (2020)

    Google Scholar 

  13. Hatamizadeh, A., et al.: UNETR: Transformers for 3D medical image segmentation. CoRR abs/2103.10504 (2021). http://arxiv.org/abs/2103.10504

  14. Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 1–13 (2021)

    Google Scholar 

  15. Mandivarapu, J.K., Bunch, E., You, Q., Fung, G.: Efficient document image classification using region-based graph neural network. CoRR abs/2106.13802 (2021). http://arxiv.org/abs/2106.13802

  16. Baumann, S., et al.: Message extraction from printed documents—a complete solution—. In: Fourth International Conference on Document Analysis and Recognition, pp. 1055–1059. IEEE, Ulm, Germany (1997)

    Google Scholar 

  17. Eken, S., Menhour, H., Köksal, K.: DoCA: a content-based automatic classification system over digital documents. IEEE Access 7, 97996–98004 (2019)

    CrossRef  Google Scholar 

  18. Şahin, S. et al.: Dijital dokümanların anahtar kelime tabanlı doğrulanması. In: 6. Ulusal Yüksek Başarımlı Hesaplama Konferansı. Ankara, Turkey (2020)

    Google Scholar 

  19. Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43, 119–126 (2014)

    CrossRef  Google Scholar 

  20. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. CoRR abs/1502.07058 (2015). http://arxiv.org/abs/1502.07058

  21. Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115. Tunis, Tunisia (2015)

    Google Scholar 

  22. Roy, S., Das, A., Bhattacharya, U.: Generalized stacking of layerwise-trained deep convolutional neural networks for document image classification. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 1273–1278 (2016)

    Google Scholar 

  23. Csurka, G.: Document image classification, with a specific view on applications of patent images. CoRR abs/1601.03295 (2016). http://arxiv.org/abs/1601.03295

  24. Csurka, G., Larlus, D., Gordo, A., Almaz´an, J.: What is the right way to represent document images?. CoRR abs/1603.01076 (2016). http://arxiv.org/abs/1603.01076

  25. Yaman, D., Eyiokur, F.I., Ekenel, H.K.: Comparison of convolutional neural network models for document image classification. In: 2017 25th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE, Antalya, Turkey (2017)

    Google Scholar 

  26. Zavalishin, S., Bout, A., Kurilin, I., Rychagov, M.: Document image classification on the basis of layout information. Electr. Imaging 2017, 78–86 (2017)

    CrossRef  Google Scholar 

  27. Tensmeyer, C., Martinez, T.R.: Analysis of convolutional neural networks for document image classification. CoRR abs/1708.03273 (2017). http://arxiv.org/abs/1708.03273

  28. Kölsch, A., Afzal, M.Z., Ebbecke, M., Liwicki, M.: Real-time document image classification using deep CNN and extreme learning machines. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1318–1323. Kyoto, Japan (2017)

    Google Scholar 

  29. Afzal, M.Z., Kölsch, A., Ahmed, S., Liwicki, M.: Cutting the error by half: Investigation of very deep CNN and advanced training strategies for document image classification. CoRR abs/1704.03557 (2017). http://arxiv.org/abs/1704.03557

  30. Das, A., Roy, S., Bhattacharya, U.: Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. CoRR abs/1801.09321 (2018). http://arxiv.org/abs/1801.09321

  31. Hassanpour, M., Malek, H.: Document image classification using squeezenet convolutional neural network. In: 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 1–4. IEEE, Shahrood, Iran (2019)

    Google Scholar 

  32. Mohsenzadegan, K., et al.: A convolutional neural network model for robust classification of document-images under real-world hard conditions. In: Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS), pp. 1023– 1030. World Scientific (2020)

    Google Scholar 

  33. Jadli, A., Hain, M., Hasbaoui, A.: An improved document image classification using deep transfer learning and feature reduction. Int. Adv. Trends Comput. Sci. Eng. 10, 549–557 (2021)

    CrossRef  Google Scholar 

  34. Jadli, A., Hain, M., Jaize, A.: A novel approach to data augmentation for document image classification using deep convolutional generative adversarial networks. In: Motahhir, S., Bossoufi, B. (eds.) Digital Technologies and Applications, ICDTA 2021, LNNS, vol. 211, pp. 135–144. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73882-2

  35. Noce, L., Gallo, I., Zamberletti, A., Calefati, A.: Embedded textual content for document image classification with convolutional neural networks. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 165–173. ACM, Vienna, Austria (2016)

    Google Scholar 

  36. Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for text and image-based document classification. CoRR abs/1907.06370 (2019). http://arxiv.org/abs/1907.06370

  37. Jain, R., Wigington, C.: Multimodal document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77. IEEE, Sydney, Australia (2019)

    Google Scholar 

  38. Ferrando, J., et al.: Improving accuracy and speeding up document image classification through parallel systems. In: Krzhizhanovskaya, V., et al. (eds.) Computational Science – ICCS 2020. ICCS 2020. LNCS, vol. 12138, pp. 387–400. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50417-5_29

  39. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200. ACM (2020)

    Google Scholar 

  40. Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: Visual and textual deep feature fusion for document image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 562–563. IEEE (2020)

    Google Scholar 

  41. Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: EAML: ensemble self-attention based mutual learning network for document image classification. Int. J. Doc. Anal. Recogn. (IJDAR) 24, 1–18 (2021)

    Google Scholar 

  42. Mandivarapu, J.K., Bunch, E., You, Q., Fung, G.: Efficient document image classification using region-based graph neural network. CoRR abs/2106.13802 (2021). https://arxiv.org/abs/2106.13802

  43. Xiong, Y., Dai, Z., Liu, Y., Ding, X.: Document image classification method based on graph convolutional network. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A. N. (eds.) Neural Information Processing. ICONIP 2021, LNCS, vol. 13108, pp. 317–329. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92185-9_26

  44. Siddiqui, S.A., Dengel, A., Ahmed, S.: Analyzing the potential of zero-shot recognition for document image classification. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR. LNCS, vol. 12824, pp. 293–304. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_20

    CrossRef  Google Scholar 

  45. Sellami, A., Tabbone, S.: EDNets: deep feature learning for document image classification based on multi-view encoder-decoder neural networks. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR. LNCS, vol. 12824, pp. 318–332. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_22

    CrossRef  Google Scholar 

  46. Mandivarapu, J.K., Bunch, E., Fung, G.: Domain agnostic few-shot learning for document intelligence. CoRR abs/2111.00007 (2021). https://arxiv.org/abs/2111.00007

  47. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). https://arxiv.org/abs/1706.03762

  48. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). https://arxiv.org/abs/1810.04805

  49. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. CoRR abs/2010.11929 (2020). https://arxiv.org/abs/2010.11929

Download references

Acknowledgments

This work has been supported by the Kocaeli University Scientific Research and Development Support Program (BAP) in Turkey under project number FBA-2020-2152.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Semih Sevim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Sevim, S., Omurca, S.İ., Ekinci, E. (2022). Document Image Classification with Vision Transformers. In: Seyman, M.N. (eds) Electrical and Computer Engineering. ICECENG 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-031-01984-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-01984-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-01983-8

  • Online ISBN: 978-3-031-01984-5

  • eBook Packages: Computer ScienceComputer Science (R0)