Exploiting EuroVoc’s Hierarchical Structure for Classifying Legal Documents

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11877)


Multi-label document classification is a challenging problem because of the potentially huge number of classes. Furthermore, real-world datasets often exhibit a strongly varying number of labels per document, and a power-law distribution of those class labels. Multi-label classification of legal documents is additionally complicated by long document texts and domain-specific use of language. In this paper we use different approaches to compare the performance of text classification algorithms on existing datasets and corpora of legal documents, and contrast the results of our experiments with results on general-purpose multi-label text classification datasets. Moreover, for the EUR-Lex legal datasets, we show that exploiting the hierarchy of the EuroVoc thesaurus helps to improve classification performance by reducing the number of potential classes while retaining the informative value of the classification itself.


Document classification EuroVoc Eur-Lex Legal domain Word embeddings Deep learning 



The research leading to this work was partly funded by the Federal Ministry of Digital and Economic Affairs of the Republic of Austria and the Jubliaeumsfonds der Stadt Wien. Gerhard Wohlgenannt’s work was supported by the Government of the Russian Federation (Grant 074-U01) through the ITMO Fellowship and Professorship Program.


  1. 1.
    Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012). Scholar
  2. 2.
    Alkhatib, W., Rensing, C., Silberbauer, J.: Multi-label text classification using semantic features and dimensionality reduction with autoencoders. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI), vol. 10318, pp. 380–394. Springer, Cham (2017). Scholar
  3. 3.
    Alkhatib, W., Sabrin, S., Neitzel, S., Rensing, C.: Towards ontology-based training-less multi-label text classification. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 389–396. Springer, Cham (2018). Scholar
  4. 4.
    Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. CoRR abs/1707.02919 (2017).
  5. 5.
    Altinel, B., Ganiz, M.C.: Semantic text classification: a survey of past and recent advances. Inf. Process. Manage. 54(6), 1129–1153 (2018). Scholar
  6. 6.
    Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In: Advances in Neural Information Processing Systems, pp. 730–738 (2015)Google Scholar
  7. 7.
    Boella, G., et al.: Linking legal open data: breaking the accessibility and language barrier in European legislation and case law. In: Sichelman, T., Atkinson, K. (eds.) Proceedings of the 15th International Conference on Artificial Intelligence and Law, ICAIL 2015, San Diego, 8–12 June 2015, pp. 171–175. ACM (2015).
  8. 8.
    Boella, G., Caro, L.D., Lesmo, L., Rispoli, D., Robaldo, L.: Multi-label classification of legislative text into EuroVoc. In: Schäfer, B. (ed.) Legal Knowledge and Information Systems - JURIX 2012: The Twenty-Fifth Annual Conference, University of Amsterdam, The Netherlands, 17–19 December 2012. Frontiers in Artificial Intelligence and Applications, vol. 250, pp. 21–30. IOS Press (2012).
  9. 9.
    Chen, Y.N., Lin, H.T.: Feature-aware label space dimension reduction for multi-label classification. In: Advances in Neural Information Processing Systems, pp. 1529–1537 (2012)Google Scholar
  10. 10.
    European Union: Eurovoc thesaurus user guide (2007)Google Scholar
  11. 11.
    Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.): Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language. LNCS (LNAI), vol. 6036. Springer, Heidelberg (2010). Scholar
  12. 12.
    Hotho, A., Nürnberger, A., Paass, G.: A brief survey of text mining. LDV Forum 20(1), 19–62 (2005).
  13. 13.
    Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018).
  14. 14.
    Jacovi, A., Shalom, O.S., Goldberg, Y.: Understanding convolutional neural networks for text classification. arXiv preprint arXiv:1809.08037 (2018)
  15. 15.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 60(5), 493–502 (2004). Scholar
  16. 16.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, 21–26 June 2014. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196. (2014).
  17. 17.
    Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: Ge, N., et al. (eds.) 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, 6–8 July 2015, pp. 136–140. IEEE Computer Society (2015).
  18. 18.
    Liu, S.M., Chen, J.: A multi-label classification based approach for sentiment classification. Expert Syst. Appl. 42(3), 1083–1093 (2015). Scholar
  19. 19.
    Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi et al. [11], pp. 192–215. Scholar
  20. 20.
    Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with built-in topic segmentation. In: Macdonald, C., Ounis, I., Ruthven, I. (eds.) Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, 24–28 October 2011, pp. 383–392. ACM (2011).
  21. 21.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, 2–4 May 2013, Workshop Track Proceedings (2013).
  22. 22.
    Prabhu, Y., Varma, M.: FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–272. ACM (2014)Google Scholar
  23. 23.
    Quaresma, P., Gonçalves, T.: Using linguistic information and machine learning techniques to identify entities from juridical documents. In: Francesconi et al. [11], pp. 44–59. Scholar
  24. 24.
    Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)zbMATHGoogle Scholar
  25. 25.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002). MathSciNetCrossRefGoogle Scholar
  26. 26.
    Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc indexer JEX - a freely available multi-label categorisation tool. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, 23–25 May 2012, pp. 798–805. European Language Resources Association (ELRA) (2012).
  27. 27.
    Steinberger, R., et al.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Calzolari, N., et al. (eds.) Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, 22–28 May 2006, pp. 2142–2147. European Language Resources Association (ELRA) (2006).
  28. 28.
    Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Warehouse. Min. 3, 1–13 (2007)CrossRefGoogle Scholar
  29. 29.
    Zhang, W., Wang, L., Yan, J., Wang, X., Zha, H.: Deep extreme multi-label learning. CoRR abs/1704.03718 (2017).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Vienna University of Economics and BusinessViennaAustria
  2. 2.Complexity Science HubViennaAustria
  3. 3.ITMO UniversitySt. PetersburgRussia

Personalised recommendations