Skip to main content

Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification

  • Conference paper
  • First Online:
Hybrid Artificial Intelligent Systems (HAIS 2023)

Abstract

This article investigates the application of machine learning algorithms in the legal domain, focusing on text classification tasks and addressing the challenges posed by imbalanced class distributions. Given the very high number of ongoing legal cases in Brazil, the integration of machine learning tools in the workflow of courts has the potential to enhance justice efficiency and speed. However, the imbalanced nature of legal datasets presents a significant hurdle for traditional machine learning algorithms, which tend to prioritize the majority class and disregard minority classes. To mitigate this problem, researchers have developed imbalance learning techniques that either modify supervised learning or improve the dataset class distribution to improve predictive performance. Data sampling techniques, such as oversampling and undersampling, play a crucial role in balancing class distributions and enabling the training of accurate machine learning models. In this study, a real dataset comprising lawsuits from the Court of Justice of São Paulo, in the state of São Paulo, Brazil, is used to evaluate the effects of different imbalance learning techniques, including oversampling, undersampling, and combined methods, in predictive performance for a binary classification task. The experimental results provided valuable insights into the comparative performance of these techniques and their applicability in the legal domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdi, L., Hashemi, S.: To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Knowl. Data Eng. 28(1), 238–251 (2015)

    Article  Google Scholar 

  2. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  4. Coelho, G.M.C., et al.: Text classification in the Brazilian legal domain. In: International Conference on Enterprise Information Systems, pp. 355–363 (2022)

    Google Scholar 

  5. Feng, W., et al.: Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12(7), 2159–2169 (2019)

    Google Scholar 

  6. Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets, vol. 10. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-98074-4

    Book  Google Scholar 

  7. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)

    Article  Google Scholar 

  8. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  9. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Network, pp. 1322–1328 (2008)

    Google Scholar 

  10. Ivan, T.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6, 769–772 (1976)

    MathSciNet  Google Scholar 

  11. Jo, W., Kim, D.: OBGAN: minority oversampling near borderline with generative adversarial networks. Expert Syst. Appl. 197, 116694 (2022)

    Article  Google Scholar 

  12. de Justiça Secretaria de Jurisprudência, S.T.: Precedentes qualificados (2023)

    Google Scholar 

  13. Ma, Y., He, H.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken (2013)

    Google Scholar 

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  15. de Justiça Departamento de Pesquisas Judiciárias, C.N.: Justiça em números 2022. Justiça em números 2022 (2022)

    Google Scholar 

  16. Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)

    Article  Google Scholar 

  17. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  18. Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Cham (2006). https://doi.org/10.1007/978-3-540-37256-1_89

    Chapter  Google Scholar 

Download references

Acknowledgment

The authors would like to express their sincere gratitude to the São Paulo Justice Court, Brazil, for their valuable financial and intellectual support in conducting this research. The support provided by all the individuals involved in this collaboration, with their insightful discussions, guidance, and expertise, contributed to this study’s completion.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniela L. Freire .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Freire, D.L. et al. (2023). Exploratory Study of Data Sampling Methods for Imbalanced Legal Text Classification. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40725-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40724-6

  • Online ISBN: 978-3-031-40725-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics