Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification

Freire, Daniela L.; de Almeida, Alex M. G.; de S. Dias, Márcio; Rivolli, Adriano; Pereira, Fabíola S. F.; de Godoi, Giliard A.; de Carvalho, Andre C. P. L. F.

doi:10.1007/978-3-031-40725-3_10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14001))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

575 Accesses

Abstract

This article investigates the application of machine learning algorithms in the legal domain, focusing on text classification tasks and addressing the challenges posed by imbalanced class distributions. Given the very high number of ongoing legal cases in Brazil, the integration of machine learning tools in the workflow of courts has the potential to enhance justice efficiency and speed. However, the imbalanced nature of legal datasets presents a significant hurdle for traditional machine learning algorithms, which tend to prioritize the majority class and disregard minority classes. To mitigate this problem, researchers have developed imbalance learning techniques that either modify supervised learning or improve the dataset class distribution to improve predictive performance. Data sampling techniques, such as oversampling and undersampling, play a crucial role in balancing class distributions and enabling the training of accurate machine learning models. In this study, a real dataset comprising lawsuits from the Court of Justice of São Paulo, in the state of São Paulo, Brazil, is used to evaluate the effects of different imbalance learning techniques, including oversampling, undersampling, and combined methods, in predictive performance for a binary classification task. The experimental results provided valuable insights into the comparative performance of these techniques and their applicability in the legal domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdi, L., Hashemi, S.: To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Knowl. Data Eng. 28(1), 238–251 (2015)
Article Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Coelho, G.M.C., et al.: Text classification in the Brazilian legal domain. In: International Conference on Enterprise Information Systems, pp. 355–363 (2022)
Google Scholar
Feng, W., et al.: Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12(7), 2159–2169 (2019)
Google Scholar
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets, vol. 10. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-98074-4
Book Google Scholar
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Network, pp. 1322–1328 (2008)
Google Scholar
Ivan, T.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6, 769–772 (1976)
MathSciNet Google Scholar
Jo, W., Kim, D.: OBGAN: minority oversampling near borderline with generative adversarial networks. Expert Syst. Appl. 197, 116694 (2022)
Article Google Scholar
de Justiça Secretaria de Jurisprudência, S.T.: Precedentes qualificados (2023)
Google Scholar
Ma, Y., He, H.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken (2013)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
de Justiça Departamento de Pesquisas Judiciárias, C.N.: Justiça em números 2022. Justiça em números 2022 (2022)
Google Scholar
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Article Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Cham (2006). https://doi.org/10.1007/978-3-540-37256-1_89
Chapter Google Scholar

Download references

Acknowledgment

The authors would like to express their sincere gratitude to the São Paulo Justice Court, Brazil, for their valuable financial and intellectual support in conducting this research. The support provided by all the individuals involved in this collaboration, with their insightful discussions, guidance, and expertise, contributed to this study’s completion.

Author information

Authors and Affiliations

University of Sao Paulo, Sao Paulo, Brazil
Daniela L. Freire & Andre C. P. L. F. de Carvalho
Ourinhos College of Technology, Sao Paulo, Brazil
Alex M. G. de Almeida
Federal University of Catalan, Catalão, Brazil
Márcio de S. Dias
Federal Technological University of Paraná, Curitiba, Brazil
Adriano Rivolli & Giliard A. de Godoi
Federal University of Uberlândia, Uberlândia, Brazil
Fabíola S. F. Pereira

Authors

Daniela L. Freire
View author publications
You can also search for this author in PubMed Google Scholar
Alex M. G. de Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Márcio de S. Dias
View author publications
You can also search for this author in PubMed Google Scholar
Adriano Rivolli
View author publications
You can also search for this author in PubMed Google Scholar
Fabíola S. F. Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Giliard A. de Godoi
View author publications
You can also search for this author in PubMed Google Scholar
Andre C. P. L. F. de Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniela L. Freire .

Editor information

Editors and Affiliations

University of Deusto, Bilbao, Spain
Pablo García Bringas
University of Leon, León, Spain
Hilde Pérez García
University of La Rioja, Logroño, La Rioja, Spain
Francisco Javier Martínez de Pisón
Pablo de Olavide University, Seville, Spain
Francisco Martínez Álvarez
Pablo de Olavide University, Seville, Spain
Alicia Troncoso Lora
University of Burgos, Burgos, Spain
Álvaro Herrero
University of A Coruña, Ferrol - Coruña, Spain
José Luis Calvo Rolle
University of A Coruña, Ferrol - Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Freire, D.L. et al. (2023). Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-40725-3_10
Published: 29 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40724-6
Online ISBN: 978-3-031-40725-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification