Advertisement

Exploiting EuroVoc’s Hierarchical Structure for Classifying Legal Documents

  • Erwin FiltzEmail author
  • Sabrina Kirrane
  • Axel Polleres
  • Gerhard Wohlgenannt
Conference paper
  • 715 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11877)

Abstract

Multi-label document classification is a challenging problem because of the potentially huge number of classes. Furthermore, real-world datasets often exhibit a strongly varying number of labels per document, and a power-law distribution of those class labels. Multi-label classification of legal documents is additionally complicated by long document texts and domain-specific use of language. In this paper we use different approaches to compare the performance of text classification algorithms on existing datasets and corpora of legal documents, and contrast the results of our experiments with results on general-purpose multi-label text classification datasets. Moreover, for the EUR-Lex legal datasets, we show that exploiting the hierarchy of the EuroVoc thesaurus helps to improve classification performance by reducing the number of potential classes while retaining the informative value of the classification itself.

Keywords

Document classification EuroVoc Eur-Lex Legal domain Word embeddings Deep learning 

Notes

Acknowledgment

The research leading to this work was partly funded by the Federal Ministry of Digital and Economic Affairs of the Republic of Austria and the Jubliaeumsfonds der Stadt Wien. Gerhard Wohlgenannt’s work was supported by the Government of the Russian Federation (Grant 074-U01) through the ITMO Fellowship and Professorship Program.

References

  1. 1.
    Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012).  https://doi.org/10.1007/978-1-4614-3223-4_6CrossRefGoogle Scholar
  2. 2.
    Alkhatib, W., Rensing, C., Silberbauer, J.: Multi-label text classification using semantic features and dimensionality reduction with autoencoders. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI), vol. 10318, pp. 380–394. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-59888-8_32CrossRefGoogle Scholar
  3. 3.
    Alkhatib, W., Sabrin, S., Neitzel, S., Rensing, C.: Towards ontology-based training-less multi-label text classification. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 389–396. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-91947-8_40CrossRefGoogle Scholar
  4. 4.
    Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. CoRR abs/1707.02919 (2017). http://arxiv.org/abs/1707.02919
  5. 5.
    Altinel, B., Ganiz, M.C.: Semantic text classification: a survey of past and recent advances. Inf. Process. Manage. 54(6), 1129–1153 (2018).  https://doi.org/10.1016/j.ipm.2018.08.001CrossRefGoogle Scholar
  6. 6.
    Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In: Advances in Neural Information Processing Systems, pp. 730–738 (2015)Google Scholar
  7. 7.
    Boella, G., et al.: Linking legal open data: breaking the accessibility and language barrier in European legislation and case law. In: Sichelman, T., Atkinson, K. (eds.) Proceedings of the 15th International Conference on Artificial Intelligence and Law, ICAIL 2015, San Diego, 8–12 June 2015, pp. 171–175. ACM (2015).  https://doi.org/10.1145/2746090.2746106
  8. 8.
    Boella, G., Caro, L.D., Lesmo, L., Rispoli, D., Robaldo, L.: Multi-label classification of legislative text into EuroVoc. In: Schäfer, B. (ed.) Legal Knowledge and Information Systems - JURIX 2012: The Twenty-Fifth Annual Conference, University of Amsterdam, The Netherlands, 17–19 December 2012. Frontiers in Artificial Intelligence and Applications, vol. 250, pp. 21–30. IOS Press (2012).  https://doi.org/10.3233/978-1-61499-167-0-21
  9. 9.
    Chen, Y.N., Lin, H.T.: Feature-aware label space dimension reduction for multi-label classification. In: Advances in Neural Information Processing Systems, pp. 1529–1537 (2012)Google Scholar
  10. 10.
    European Union: Eurovoc thesaurus user guide (2007)Google Scholar
  11. 11.
    Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.): Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language. LNCS (LNAI), vol. 6036. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-12837-0CrossRefGoogle Scholar
  12. 12.
    Hotho, A., Nürnberger, A., Paass, G.: A brief survey of text mining. LDV Forum 20(1), 19–62 (2005). http://www.jlcl.org/2005_Heft1/19-62_HothoNuernbergerPaass.pdf
  13. 13.
    Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146
  14. 14.
    Jacovi, A., Shalom, O.S., Goldberg, Y.: Understanding convolutional neural networks for text classification. arXiv preprint arXiv:1809.08037 (2018)
  15. 15.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 60(5), 493–502 (2004).  https://doi.org/10.1108/00220410410560573CrossRefGoogle Scholar
  16. 16.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, 21–26 June 2014. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196. JMLR.org (2014). http://proceedings.mlr.press/v32/le14.html
  17. 17.
    Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: Ge, N., et al. (eds.) 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, 6–8 July 2015, pp. 136–140. IEEE Computer Society (2015).  https://doi.org/10.1109/ICCI-CC.2015.7259377
  18. 18.
    Liu, S.M., Chen, J.: A multi-label classification based approach for sentiment classification. Expert Syst. Appl. 42(3), 1083–1093 (2015).  https://doi.org/10.1016/j.eswa.2014.08.036CrossRefGoogle Scholar
  19. 19.
    Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi et al. [11], pp. 192–215.  https://doi.org/10.1007/978-3-642-12837-0_11CrossRefGoogle Scholar
  20. 20.
    Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with built-in topic segmentation. In: Macdonald, C., Ounis, I., Ruthven, I. (eds.) Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, 24–28 October 2011, pp. 383–392. ACM (2011).  https://doi.org/10.1145/2063576.2063636
  21. 21.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, 2–4 May 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301.3781
  22. 22.
    Prabhu, Y., Varma, M.: FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–272. ACM (2014)Google Scholar
  23. 23.
    Quaresma, P., Gonçalves, T.: Using linguistic information and machine learning techniques to identify entities from juridical documents. In: Francesconi et al. [11], pp. 44–59.  https://doi.org/10.1007/978-3-642-12837-0_3CrossRefGoogle Scholar
  24. 24.
    Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)zbMATHGoogle Scholar
  25. 25.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002).  https://doi.org/10.1145/505282.505283 MathSciNetCrossRefGoogle Scholar
  26. 26.
    Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc indexer JEX - a freely available multi-label categorisation tool. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, 23–25 May 2012, pp. 798–805. European Language Resources Association (ELRA) (2012). http://www.lrec-conf.org/proceedings/lrec2012/summaries/875.html
  27. 27.
    Steinberger, R., et al.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Calzolari, N., et al. (eds.) Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, 22–28 May 2006, pp. 2142–2147. European Language Resources Association (ELRA) (2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf.pdf
  28. 28.
    Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Warehouse. Min. 3, 1–13 (2007)CrossRefGoogle Scholar
  29. 29.
    Zhang, W., Wang, L., Yan, J., Wang, X., Zha, H.: Deep extreme multi-label learning. CoRR abs/1704.03718 (2017). http://arxiv.org/abs/1704.03718

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Vienna University of Economics and BusinessViennaAustria
  2. 2.Complexity Science HubViennaAustria
  3. 3.ITMO UniversitySt. PetersburgRussia

Personalised recommendations