Artificial Intelligence and Law

, Volume 27, Issue 2, pp 199–225 | Cite as

Unsupervised and supervised text similarity systems for automated identification of national implementing measures of European directives

  • Rohan NandaEmail author
  • Giovanni Siragusa
  • Luigi Di Caro
  • Guido Boella
  • Lorenzo Grossio
  • Marco Gerbaudo
  • Francesco Costamagna


The automated identification of national implementations (NIMs) of European directives by text similarity techniques has shown promising preliminary results. Previous works have proposed and utilized unsupervised lexical and semantic similarity techniques based on vector space models, latent semantic analysis and topic models. However, these techniques were evaluated on a small multilingual corpus of directives and NIMs. In this paper, we utilize word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to develop unsupervised semantic similarity systems to identify transpositions. We evaluate these models and compare their results with the previous unsupervised methods on a multilingual test corpus of 43 Directives and their corresponding NIMs. We also develop supervised machine learning models to identify transpositions and compare their performance with different feature sets.


Text similarity Transposition Machine learning 



Research presented in this paper is conducted as a Ph.D. research at the University of Turin, within the Erasmus Mundus Joint International Doctoral (Ph.D.) programme in Law, Science and Technology. This work has been partially supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowska-Curie Grant agreement no. 690974 for the project “MIREL: MIning and REasoning with Legal texts”.


  1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283Google Scholar
  2. Ajani G, Boella G, Di Caro L, Robaldo L, Humphreys L, Praduroux S, Rossi P, Violato A (2017) The European legal taxonomy syllabus: a multi-lingual, multi-level ontology framework to untangle the web of European legal terminology. Appl Ontol 2(4):325–375CrossRefGoogle Scholar
  3. Aletras N, Tsarapatsanis D, Preoţiuc-Pietro D, Lampos V (2016) Predicting judicial decisions of the European court of human rights: a natural language processing perspective. PeerJ Comput Sci 2:e93CrossRefGoogle Scholar
  4. Bergamaschi S, Po L (2014) Comparing lda and lsa topic models for content-based movie recommendation systems. In: International conference on web information systems and technologies. Springer, pp 247–263Google Scholar
  5. Bird S, Loper E (2004) Nltk: the natural language toolkit. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 31Google Scholar
  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  7. Boella G, Di Caro L, Humphreys L, Robaldo L, van der Torre L (2012) Nlp challenges for eunomos, a tool to build and manage legal knowledge. In: Language resources and evaluation (LREC). pp 3672–3678Google Scholar
  8. Boella G, Di Caro L, Robaldo L (2013) Semantic relation extraction from legislative text using generalized syntactic dependencies and support vector machines. Springer, Berlin, pp 218–225Google Scholar
  9. Boella G, Di Caro L, Humphreys L, Robaldo L, Rossi R, van der Torre L (2016) Eunomos, a legal document and knowledge management system for the web to provide relevant, reliable and up-to-date information on the law. Artif Intell Law 24:245–283CrossRefGoogle Scholar
  10. Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606
  11. Cardellino C, Teruel M, Alemany LA, Villata S (2017) A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of the 16th edition of the international conference on artificial intelligence and law. ACM, pp 9–18Google Scholar
  12. Ciavarini Azzi G (2000) The slow march of european legislation: the implementation of directives. In: European integration after Amsterdam: institutional dynamics and prospects for democracyGoogle Scholar
  13. Cosma G, Joy M (2012) An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans Comput 61(3):379–394MathSciNetCrossRefzbMATHGoogle Scholar
  14. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391CrossRefGoogle Scholar
  15. Eliantonio M, Ballesteros M, Rostane M, Petrovic D (2013) Tools for ensuring implementation and application of eu law and evaluation of their effectiveness. Technical reports on European ParliamentGoogle Scholar
  16. Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14(5):403–420MathSciNetCrossRefzbMATHGoogle Scholar
  17. Hartung J, Knapp G, Sinha B (2011) Statistical meta-analysis with applications, vol 738. Wiley, HobokenzbMATHGoogle Scholar
  18. Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88Google Scholar
  19. Humphreys L, Santos C, Di Caro L, Boella G, Van Der Torre L, Robaldo L (2015) Mapping recitals to normative provisions in eu legislation to assist legal interpretation. In: JURIX. pp 41–49Google Scholar
  20. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142Google Scholar
  21. Kenter T, De Rijke M (2015) Short text similarity with word embeddings. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1411–1420Google Scholar
  22. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning. pp 1188–1196Google Scholar
  23. Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605zbMATHGoogle Scholar
  24. Magerman T, Van Looy B, Song X (2010) Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82(2):289–306CrossRefGoogle Scholar
  25. Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference, Compute ’17. ACM, New York, pp 1–9Google Scholar
  26. McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282MathSciNetCrossRefGoogle Scholar
  27. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  28. Nanda R, Di Caro L, Boella G (2016) A text similarity approach for automated transposition detection of European union directives. In: 29th International conference on legal knowledge and information systems, JURIX 2016, vol 294. IOS Press, pp 143–148Google Scholar
  29. Nanda R, Di Caro L, Boella G, Konstantinov H, Tyankov T, Traykov D, Hristov H, Costamagna F, Humphreys L, Robaldo L, et al (2017) A unifying similarity measure for automated identification of national implementations of European union directives. In: Proceedings of the 16th edition of the international conference on articial intelligence and law. ACM, pp 149–158Google Scholar
  30. Nanda R, Siragusa G, Caro LD, Theobald M, Boella G, Robaldo L, Costamagna F (2017) Concept recognition in European and national law. In: Legal knowledge and information systems—JURIX 2017: the thirtieth annual conference, Luxembourg, 13–15 December 2017, pp 193–198Google Scholar
  31. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  32. Řehůřek R, Sojka P (2010) Software framework for topic modelling with Large Corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp 45–50.
  33. Robaldo L (2010) Interpretation and inference with maximal referential terms. J Comput Syst Sci 76(5):373–388MathSciNetCrossRefzbMATHGoogle Scholar
  34. Robaldo L (2011) Distributivity, collectivity, and cumulativity in terms of (in)dependence and maximality. J Log Lang Inf 20(2):233–271MathSciNetCrossRefzbMATHGoogle Scholar
  35. Robaldo L, Sun X (2017) Reified input/output logic: combining input/output logic and reification to represent norms coming from existing legislation. J Log Comput 27:2471–2503MathSciNetCrossRefzbMATHGoogle Scholar
  36. Robaldo L, Caselli T, Russo I, Grella M (2011) From Italian text to timeml document via dependency parsing. In: Computational linguistics and intelligent text processing—12th international conference, CICLing 2011, Tokyo, Japan, 2011, pp 177–187Google Scholar
  37. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of TurinTurinItaly
  2. 2.Department of LawUniversity of TurinTurinItaly

Personalised recommendations