Abstract
This article investigates the application of machine learning algorithms in the legal domain, focusing on text classification tasks and addressing the challenges posed by imbalanced class distributions. Given the very high number of ongoing legal cases in Brazil, the integration of machine learning tools in the workflow of courts has the potential to enhance justice efficiency and speed. However, the imbalanced nature of legal datasets presents a significant hurdle for traditional machine learning algorithms, which tend to prioritize the majority class and disregard minority classes. To mitigate this problem, researchers have developed imbalance learning techniques that either modify supervised learning or improve the dataset class distribution to improve predictive performance. Data sampling techniques, such as oversampling and undersampling, play a crucial role in balancing class distributions and enabling the training of accurate machine learning models. In this study, a real dataset comprising lawsuits from the Court of Justice of São Paulo, in the state of São Paulo, Brazil, is used to evaluate the effects of different imbalance learning techniques, including oversampling, undersampling, and combined methods, in predictive performance for a binary classification task. The experimental results provided valuable insights into the comparative performance of these techniques and their applicability in the legal domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdi, L., Hashemi, S.: To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Knowl. Data Eng. 28(1), 238–251 (2015)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Coelho, G.M.C., et al.: Text classification in the Brazilian legal domain. In: International Conference on Enterprise Information Systems, pp. 355–363 (2022)
Feng, W., et al.: Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12(7), 2159–2169 (2019)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets, vol. 10. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-98074-4
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Network, pp. 1322–1328 (2008)
Ivan, T.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6, 769–772 (1976)
Jo, W., Kim, D.: OBGAN: minority oversampling near borderline with generative adversarial networks. Expert Syst. Appl. 197, 116694 (2022)
de Justiça Secretaria de Jurisprudência, S.T.: Precedentes qualificados (2023)
Ma, Y., He, H.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken (2013)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
de Justiça Departamento de Pesquisas Judiciárias, C.N.: Justiça em números 2022. Justiça em números 2022 (2022)
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Cham (2006). https://doi.org/10.1007/978-3-540-37256-1_89
Acknowledgment
The authors would like to express their sincere gratitude to the São Paulo Justice Court, Brazil, for their valuable financial and intellectual support in conducting this research. The support provided by all the individuals involved in this collaboration, with their insightful discussions, guidance, and expertise, contributed to this study’s completion.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Freire, D.L. et al. (2023). Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-40725-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40724-6
Online ISBN: 978-3-031-40725-3
eBook Packages: Computer ScienceComputer Science (R0)