Abstract
The Brazilian Supreme Court receives tens of thousands of cases each semester. Court employees spend thousands of hours to execute the initial analysis and classification of those cases—which takes effort away from posterior, more complex stages of the case management workflow. In this paper, we explore multimodal classification of documents from Brazil’s Supreme Court. We train and evaluate our methods on a novel multimodal dataset of 6510 lawsuits (339,478 pages) with manual annotation assigning each page to one of six classes. Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text extracted through optical character recognition. We first train two unimodal classifiers: A ResNet pre-trained on ImageNet is fine-tuned on the images, and a convolutional network with filters of multiple kernel sizes is trained from scratch on document texts. We use them as extractors of visual and textual features, which are then combined through our proposed fusion module. Our fusion module can handle missing textual or visual input by using learned embeddings for missing data. Moreover, we experiment with bidirectional long short-term memory (biLSTM) networks and linear-chain conditional random fields to model the sequential nature of the pages. The multimodal approaches outperform both textual and visual classifiers, especially when leveraging the sequential nature of the pages.
Similar content being viewed by others
Data availability
Data used in this work is available at http://ailab.unb.br/victor/lrec2020/.
Code Availability
Code used in this work is available at https://github.com/peluz/victor-visual-text.
Notes
To the best of our knowledge.
References
Agam, G., Argamon, S., Frieder, O., Grossman, D., Lewis, D.: The Complex Document Image Processing (CDIP) test collection project (2006). http://ir.iit.edu/projects/CDIP.html
Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for text and image-based document classification. CoRR abs/1907.06370 (2019). http://arxiv.org/abs/1907.06370
Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: Visual and textual deep feature fusion for document image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. CoRR abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051.
Braz, F.A., da Silva, N.C., Lima, J.A.S.: Leveraging effectiveness and efficiency in page stream deep segmentation. Eng. Appl. Artif. Intell. 105, 104394 (2021). https://doi.org/10.1016/j.engappai.2021.104394.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Chen, N., Blostein, D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Document Anal. Recogn. (IJDAR) 10(1), 1–16 (2007). https://doi.org/10.1007/s10032-006-0020-2
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1107–1116. Association for Computational Linguistics, Valencia, Spain (2017). http://www.aclweb.org/anthology/E17-1104
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Dimmick, D., Garris, M., Wilson, C., Flanagan, P.: Nist special database 2 - structured forms database users’ guide (2017). https://doi.org/10.6028/NIST.NSRDS.2-2017
Engin, D., Emekligil, E., Oral, B., Arslan, S., Akpınar, M.: Multimodal deep neural networks for banking document classification. In: International Conference on Advances in Information Mining and Management, pp. 21–25 (2019)
Ford, G., Thoma, G.R.: Ground truth data for document image analysis. In: Symposium on document image understanding and technology (SDIUT), pp. 199–205 (2003)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995. IEEE (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1031
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning - Volume 37, pp. 448–456. JMLR.org (2015). http://proceedings.mlr.press/v37/ioffe15.html
Jain, R., Wigington, C.: Multimodal document image classification. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77 (2019). https://doi.org/10.1109/ICDAR.2019.00021
Kingma, D.P., Ba, J.: Adam: A method for stochastic optmisation. In: International Conference on Learning Representations (ICLR) (2015). Preprint available at https://arxiv.org/abs/1412.6980
Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43, 119–126 (2014)
Lafferty, J.D., Andrew, M., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, p. 665-666. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1148170.1148307
Luz de Araujo, P.H., de Campos, T.E., Ataides Braz, F., Correia da Silva, N.: VICTOR: a dataset for Brazilian legal documents classification. In: Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pp. 1449–1458. European Language Resources Association, Marseille, France (2020). https://www.aclweb.org/anthology/2020.lrec-1.181
Mota, C., Lima, A., Nascimento, A., Miranda, P., de Mello, R.: Classificação de páginas de petições iniciais utilizando redes neurais convolucionais multimodais. In: Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional, pp. 318–329. SBC, Porto Alegre, RS, Brasil (2020). https://doi.org/10.5753/eniac.2020.12139
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Natural language processing using very large corpora, pp. 157–176. Springer (1999). https://doi.org/10.1007/978-94-017-2390-9_10. Preprint available at http://arxiv.org/abs/cmp-lg/9505040
Rosenstein, M.T., Marx, Z., Kaelbling, L.P., Dietterich, T.G.: To transfer or not to transfer. In: In NIPS’05 Workshop, Inductive Transfer: 10 Years Later (2005)
Rusiñol, M., Frinken, V., Karatzas, D., Bagdanov, A.D., Lladós, J.: Multimodal page classification in administrative document image streams. Int. J. Document Anal. Recogn. (IJDAR) 17(4), 331–341 (2014). https://doi.org/10.1007/s10032-014-0225-8
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Visi. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Secretaria de Comunicação Social do Conselho Nacional de Justiça: Sumário executivo do relatório justiça em números 2020 (2018). https://www.cnj.jus.br/wp-content/uploads/2020/08/WEB_V2_SUMARIO_EXECUTIVO_CNJ_JN2020.pdf
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Smith, L.N.: Cyclical learning rates for training neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472 (2017). https://doi.org/10.1109/WACV.2017.58
Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks using large learning rates. CoRR abs/1708.07120 (2017). http://arxiv.org/abs/1708.07120
Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 629–633. IEEE (2007)
Supremo Tribunal Federal: Ministra Cármen Lúcia anuncia início de funcionamento do Projeto Victor, de inteligência artificial (2018). http://www.stf.jus.br/portal/cms/verNoticiaDetalhe.asp?idConteudo=388443
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Wiedemann, G., Heyer, G.: Multi-modal page stream segmentation with convolutional neural networks. Language Res. Evalu. (2019). https://doi.org/10.1007/s10579-019-09476-2
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-Training of Text and Layout for Document Image Understanding, p. 1192–1200. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3403172
Xu, Y., Lv, T., Cui, L., Wang, G., Lu, Y., Florencio, D., Zhang, C., Wei, F.: Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding (2021)
Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. TdC received support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), grant PQ 314154/2018-3. We acknowledge the support of “Projeto de Pesquisa & Desenvolvimento de aprendizado de máquina (machine learning) sobre dados judiciais das repercussões gerais do Supremo Tribunal Federal - STF”. We are also grateful for the support from Fundação de Apoio à Pesquisa do Distrito Federal (FAPDF, project KnEDLe, convênio 07/2019) and Fundação de Empreendimentos Científicos e Tecnológicos (Finatec). TdC is currently on a leave of absence from the University of Brasilia and works at Vicon Motion Systems, Oxford Metrics Group.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Luz de Araujo, P.H., de Almeida, A.P.G.S., Ataides Braz, F. et al. Sequence-aware multimodal page classification of Brazilian legal documents. IJDAR 26, 33–49 (2023). https://doi.org/10.1007/s10032-022-00406-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-022-00406-7