Skip to main content

Unsupervised and supervised text similarity systems for automated identification of national implementing measures of European directives


The automated identification of national implementations (NIMs) of European directives by text similarity techniques has shown promising preliminary results. Previous works have proposed and utilized unsupervised lexical and semantic similarity techniques based on vector space models, latent semantic analysis and topic models. However, these techniques were evaluated on a small multilingual corpus of directives and NIMs. In this paper, we utilize word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to develop unsupervised semantic similarity systems to identify transpositions. We evaluate these models and compare their results with the previous unsupervised methods on a multilingual test corpus of 43 Directives and their corresponding NIMs. We also develop supervised machine learning models to identify transpositions and compare their performance with different feature sets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9





  4. The output vector is computed by multiplying the embedding vector by the hidden layer.


  • Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283

  • Ajani G, Boella G, Di Caro L, Robaldo L, Humphreys L, Praduroux S, Rossi P, Violato A (2017) The European legal taxonomy syllabus: a multi-lingual, multi-level ontology framework to untangle the web of European legal terminology. Appl Ontol 2(4):325–375

    Article  Google Scholar 

  • Aletras N, Tsarapatsanis D, Preoţiuc-Pietro D, Lampos V (2016) Predicting judicial decisions of the European court of human rights: a natural language processing perspective. PeerJ Comput Sci 2:e93

    Article  Google Scholar 

  • Bergamaschi S, Po L (2014) Comparing lda and lsa topic models for content-based movie recommendation systems. In: International conference on web information systems and technologies. Springer, pp 247–263

  • Bird S, Loper E (2004) Nltk: the natural language toolkit. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 31

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Boella G, Di Caro L, Humphreys L, Robaldo L, van der Torre L (2012) Nlp challenges for eunomos, a tool to build and manage legal knowledge. In: Language resources and evaluation (LREC). pp 3672–3678

  • Boella G, Di Caro L, Robaldo L (2013) Semantic relation extraction from legislative text using generalized syntactic dependencies and support vector machines. Springer, Berlin, pp 218–225

    Google Scholar 

  • Boella G, Di Caro L, Humphreys L, Robaldo L, Rossi R, van der Torre L (2016) Eunomos, a legal document and knowledge management system for the web to provide relevant, reliable and up-to-date information on the law. Artif Intell Law 24:245–283

    Article  Google Scholar 

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606

  • Cardellino C, Teruel M, Alemany LA, Villata S (2017) A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of the 16th edition of the international conference on artificial intelligence and law. ACM, pp 9–18

  • Ciavarini Azzi G (2000) The slow march of european legislation: the implementation of directives. In: European integration after Amsterdam: institutional dynamics and prospects for democracy

  • Cosma G, Joy M (2012) An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans Comput 61(3):379–394

    Article  MathSciNet  MATH  Google Scholar 

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391

    Article  Google Scholar 

  • Eliantonio M, Ballesteros M, Rostane M, Petrovic D (2013) Tools for ensuring implementation and application of eu law and evaluation of their effectiveness. Technical reports on European Parliament

  • Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14(5):403–420

    Article  MathSciNet  MATH  Google Scholar 

  • Hartung J, Knapp G, Sinha B (2011) Statistical meta-analysis with applications, vol 738. Wiley, Hoboken

    MATH  Google Scholar 

  • Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88

  • Humphreys L, Santos C, Di Caro L, Boella G, Van Der Torre L, Robaldo L (2015) Mapping recitals to normative provisions in eu legislation to assist legal interpretation. In: JURIX. pp 41–49

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142

  • Kenter T, De Rijke M (2015) Short text similarity with word embeddings. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1411–1420

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning. pp 1188–1196

  • Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605

    MATH  Google Scholar 

  • Magerman T, Van Looy B, Song X (2010) Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82(2):289–306

    Article  Google Scholar 

  • Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference, Compute ’17. ACM, New York, pp 1–9

  • McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282

    Article  MathSciNet  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Nanda R, Di Caro L, Boella G (2016) A text similarity approach for automated transposition detection of European union directives. In: 29th International conference on legal knowledge and information systems, JURIX 2016, vol 294. IOS Press, pp 143–148

  • Nanda R, Di Caro L, Boella G, Konstantinov H, Tyankov T, Traykov D, Hristov H, Costamagna F, Humphreys L, Robaldo L, et al (2017) A unifying similarity measure for automated identification of national implementations of European union directives. In: Proceedings of the 16th edition of the international conference on articial intelligence and law. ACM, pp 149–158

  • Nanda R, Siragusa G, Caro LD, Theobald M, Boella G, Robaldo L, Costamagna F (2017) Concept recognition in European and national law. In: Legal knowledge and information systems—JURIX 2017: the thirtieth annual conference, Luxembourg, 13–15 December 2017, pp 193–198

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Řehůřek R, Sojka P (2010) Software framework for topic modelling with Large Corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp 45–50.

  • Robaldo L (2010) Interpretation and inference with maximal referential terms. J Comput Syst Sci 76(5):373–388

    Article  MathSciNet  MATH  Google Scholar 

  • Robaldo L (2011) Distributivity, collectivity, and cumulativity in terms of (in)dependence and maximality. J Log Lang Inf 20(2):233–271

    Article  MathSciNet  MATH  Google Scholar 

  • Robaldo L, Sun X (2017) Reified input/output logic: combining input/output logic and reification to represent norms coming from existing legislation. J Log Comput 27:2471–2503

    Article  MathSciNet  MATH  Google Scholar 

  • Robaldo L, Caselli T, Russo I, Grella M (2011) From Italian text to timeml document via dependency parsing. In: Computational linguistics and intelligent text processing—12th international conference, CICLing 2011, Tokyo, Japan, 2011, pp 177–187

  • Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

Download references


Research presented in this paper is conducted as a Ph.D. research at the University of Turin, within the Erasmus Mundus Joint International Doctoral (Ph.D.) programme in Law, Science and Technology. This work has been partially supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowska-Curie Grant agreement no. 690974 for the project “MIREL: MIning and REasoning with Legal texts”.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Rohan Nanda.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nanda, R., Siragusa, G., Di Caro, L. et al. Unsupervised and supervised text similarity systems for automated identification of national implementing measures of European directives. Artif Intell Law 27, 199–225 (2019).

Download citation

  • Published:

  • Issue Date:

  • DOI: