Exploiting EuroVoc’s Hierarchical Structure for Classifying Legal Documents
- 895 Downloads
Abstract
Multi-label document classification is a challenging problem because of the potentially huge number of classes. Furthermore, real-world datasets often exhibit a strongly varying number of labels per document, and a power-law distribution of those class labels. Multi-label classification of legal documents is additionally complicated by long document texts and domain-specific use of language. In this paper we use different approaches to compare the performance of text classification algorithms on existing datasets and corpora of legal documents, and contrast the results of our experiments with results on general-purpose multi-label text classification datasets. Moreover, for the EUR-Lex legal datasets, we show that exploiting the hierarchy of the EuroVoc thesaurus helps to improve classification performance by reducing the number of potential classes while retaining the informative value of the classification itself.
Keywords
Document classification EuroVoc Eur-Lex Legal domain Word embeddings Deep learningNotes
Acknowledgment
The research leading to this work was partly funded by the Federal Ministry of Digital and Economic Affairs of the Republic of Austria and the Jubliaeumsfonds der Stadt Wien. Gerhard Wohlgenannt’s work was supported by the Government of the Russian Federation (Grant 074-U01) through the ITMO Fellowship and Professorship Program.
References
- 1.Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_6CrossRefGoogle Scholar
- 2.Alkhatib, W., Rensing, C., Silberbauer, J.: Multi-label text classification using semantic features and dimensionality reduction with autoencoders. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI), vol. 10318, pp. 380–394. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59888-8_32CrossRefGoogle Scholar
- 3.Alkhatib, W., Sabrin, S., Neitzel, S., Rensing, C.: Towards ontology-based training-less multi-label text classification. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 389–396. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91947-8_40CrossRefGoogle Scholar
- 4.Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. CoRR abs/1707.02919 (2017). http://arxiv.org/abs/1707.02919
- 5.Altinel, B., Ganiz, M.C.: Semantic text classification: a survey of past and recent advances. Inf. Process. Manage. 54(6), 1129–1153 (2018). https://doi.org/10.1016/j.ipm.2018.08.001CrossRefGoogle Scholar
- 6.Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In: Advances in Neural Information Processing Systems, pp. 730–738 (2015)Google Scholar
- 7.Boella, G., et al.: Linking legal open data: breaking the accessibility and language barrier in European legislation and case law. In: Sichelman, T., Atkinson, K. (eds.) Proceedings of the 15th International Conference on Artificial Intelligence and Law, ICAIL 2015, San Diego, 8–12 June 2015, pp. 171–175. ACM (2015). https://doi.org/10.1145/2746090.2746106
- 8.Boella, G., Caro, L.D., Lesmo, L., Rispoli, D., Robaldo, L.: Multi-label classification of legislative text into EuroVoc. In: Schäfer, B. (ed.) Legal Knowledge and Information Systems - JURIX 2012: The Twenty-Fifth Annual Conference, University of Amsterdam, The Netherlands, 17–19 December 2012. Frontiers in Artificial Intelligence and Applications, vol. 250, pp. 21–30. IOS Press (2012). https://doi.org/10.3233/978-1-61499-167-0-21
- 9.Chen, Y.N., Lin, H.T.: Feature-aware label space dimension reduction for multi-label classification. In: Advances in Neural Information Processing Systems, pp. 1529–1537 (2012)Google Scholar
- 10.European Union: Eurovoc thesaurus user guide (2007)Google Scholar
- 11.Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.): Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language. LNCS (LNAI), vol. 6036. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0CrossRefGoogle Scholar
- 12.Hotho, A., Nürnberger, A., Paass, G.: A brief survey of text mining. LDV Forum 20(1), 19–62 (2005). http://www.jlcl.org/2005_Heft1/19-62_HothoNuernbergerPaass.pdf
- 13.Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146
- 14.Jacovi, A., Shalom, O.S., Goldberg, Y.: Understanding convolutional neural networks for text classification. arXiv preprint arXiv:1809.08037 (2018)
- 15.Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 60(5), 493–502 (2004). https://doi.org/10.1108/00220410410560573CrossRefGoogle Scholar
- 16.Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, 21–26 June 2014. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196. JMLR.org (2014). http://proceedings.mlr.press/v32/le14.html
- 17.Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: Ge, N., et al. (eds.) 14th IEEE International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC 2015, Beijing, 6–8 July 2015, pp. 136–140. IEEE Computer Society (2015). https://doi.org/10.1109/ICCI-CC.2015.7259377
- 18.Liu, S.M., Chen, J.: A multi-label classification based approach for sentiment classification. Expert Syst. Appl. 42(3), 1083–1093 (2015). https://doi.org/10.1016/j.eswa.2014.08.036CrossRefGoogle Scholar
- 19.Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi et al. [11], pp. 192–215. https://doi.org/10.1007/978-3-642-12837-0_11CrossRefGoogle Scholar
- 20.Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with built-in topic segmentation. In: Macdonald, C., Ounis, I., Ruthven, I. (eds.) Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, 24–28 October 2011, pp. 383–392. ACM (2011). https://doi.org/10.1145/2063576.2063636
- 21.Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, 2–4 May 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301.3781
- 22.Prabhu, Y., Varma, M.: FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–272. ACM (2014)Google Scholar
- 23.Quaresma, P., Gonçalves, T.: Using linguistic information and machine learning techniques to identify entities from juridical documents. In: Francesconi et al. [11], pp. 44–59. https://doi.org/10.1007/978-3-642-12837-0_3CrossRefGoogle Scholar
- 24.Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)zbMATHGoogle Scholar
- 25.Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283 MathSciNetCrossRefGoogle Scholar
- 26.Steinberger, R., Ebrahim, M., Turchi, M.: JRC EuroVoc indexer JEX - a freely available multi-label categorisation tool. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, 23–25 May 2012, pp. 798–805. European Language Resources Association (ELRA) (2012). http://www.lrec-conf.org/proceedings/lrec2012/summaries/875.html
- 27.Steinberger, R., et al.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Calzolari, N., et al. (eds.) Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, 22–28 May 2006, pp. 2142–2147. European Language Resources Association (ELRA) (2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf.pdf
- 28.Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Warehouse. Min. 3, 1–13 (2007)CrossRefGoogle Scholar
- 29.Zhang, W., Wang, L., Yan, J., Wang, X., Zha, H.: Deep extreme multi-label learning. CoRR abs/1704.03718 (2017). http://arxiv.org/abs/1704.03718