1 Introduction

Cultural and linguistic diversity is a guiding principle of the European Union. In the context of an ever expanding Union, the concept of multilingualism stands out as one of the most prominent symbols of European historical, political, social and cultural diversity.Footnote 1 From a legal perspective, the EU commitment to multilingualism is significant as a guarantee of legal certainty, egalitarianism, clarity, transparency and democratic accountability. Accordingly, for instance, the EU legislation is generally published in all official languages.Footnote 2

While multilingualism is undoubtedly a prized part of European cultural heritage and has important benefits, it also comes with costs and challenges. To illustrate, the existence of multiple official versions of every legal act, all of which are equally authentic,Footnote 3 may inevitably create interpretative difficulties (Pozzo 2012; Whittaker 2000). In this regard, the Court of Justice has repeatedly found that the wording used in one language version of a Union provision cannot serve as the only basis for its reading, and has pointed to the role of cross-language comparisons as well as teleological and systematic methods of interpretation.Footnote 4 From a more practical perspective, the EU commitment to multilingualism requires the employment of numerous translators, working in 24 official languages. Nevertheless, the increasing sophistication of IT applications, which translators rely upon, facilitates their work significantly. This also includes machine translation software, whose potential is widely recognized (European Parliament 2017).

To respond to the needs of diverse populations, applications used in multicultural societies have to support the simultaneous recognition of mixed-language input and the generation of output in the appropriate language. In the last decade, the performance of spoken language understanding systems has markedly improved, including speech recognition, dialog systems, speech summarization and text and speech translation. In the last year, large language models such as ChatGPT or GPT-4 have brought chatbots and natural language generation and understanding enter a novel dimension (Wei et al. 2022). Building powerful multilingual tools can produce a dramatic impact on the civil society, providing valuable instruments to support consumers and, more generally, citizens. However, a crucial obstacle to the widespread development of multilingual technologies in the legal domain pertains to the lack of data resources and the language expertise bottleneck. The need to foster the creation of new multilingual approaches, algorithms, data sets and resources is not new, as pointed in a 2017 study commissioned by European Parliament’s Science and Technology Options Assessment Committee (Rivera Pastor et al. 2017). The study highlighted the importance for the EU of initiating a new, large-scale European Language Technology programme, called the Human Language Project. The initiative was foreseen as a long-term European collaborative programme between research, innovation, industry, academia, administrations and citizens with the goal of achieving the next scientific breakthroughs for the automatic processing and generation of written or spoken natural language.

Finally, the mainstreaming of e-commerce has led to the necessity, for consumers, of higher levels of service in a variety of different languages. The EU has traditionally refrained from regulating language aspects of consumer transactions, leaving this matter to national authorities (Loos 2017).Footnote 5 On the one hand, language requirements (e.g. regarding standard terms) give rise to additional costs for cross-border traders, especially small enterprises. On the other hand, they might be justified for consumer protection reasons, such as supporting consumers’ decision-making. The availability of information in different ethnic languages also facilitates market monitoring, since consumer protection authorities and non-governmental organisations in Europe tend to operate in their respective languages. Eventually, as part of its recent wave of platform regulation, the EU introduced language requirements targeting selected segments of digital consumer markets. Specifically, an obligation to publish terms and conditions in different language versions has been imposed on leading online platforms with broad customer bases.Footnote 6

In view of the above, effective consumer protection technologies must be capable of dealing with multilingual landscapes.

In this work, we focus on the automated detection of unfair clauses in Terms of Service (ToS), which increasingly are available in multiple languages. In particular, we investigate whether it is necessary to build novel corpora and train independent models for each and every language or whether it is possible to rely on methods for automatically translating documents or transferring annotations onto the corresponding versions of the same documents in a target language. The latter problem, that of projecting (i.e., transferring) tags or labels across documents in different languages via sentence similarity and alignment, has recently received growing attention in the NLP community (Eger et al. 2018; Rocha et al. 2018; Galassi et al. 2020).

This paper builds upon and significantly extends results presented in the work by Drazewski et al. (2021) in the context of the CLAUDETTE project.Footnote 7 In particular, the contributions of this study are the following: (1) we present an extension of the multilingual parallel corpus, which now consists of 50 contracts annotated in English, Italian, German and Polish; (2) we describe an extensive experimental comparison across machine learning predictors trained either on original or projected annotations, either on original or translated documents, to assess whether the projection or translation procedures can substitute the time-consuming task of document annotation performed by domain-experts; (3) we conduct a deep error analysis for each experimental scenario; (4) we make both the novel corpus and our code freely available to the community for research purposes. Our findings indicate that using a system trained on English documents and making use of automatic translation at prediction time for query documents shows no performance degradation with respect to creating a novel annotated corpus for each target language.

The paper is organized as follows. Section 2 provides an overview of the related works. Section 3 describes the multilingual corpus we created and the similarities and discrepancies in the analysed ToS. Section 4 introduces the adopted methodology, while the experimental setting and the results are reported in Sect. 5. The discussion of results is presented in Sect. 6. Finally, Sect. 7 concludes.

2 Related works

The challenge of multilingualism in the analysis of legal documents has recently gained a lot of attention, both in the community of artificial intelligence and law, and in natural language processing.

As a major example, the work by Chalkidis et al. (2021) proposes MULTI-EURLEX, a multilingual dataset for topic classification of legal documents composed by 23k EU laws, comprising their official translation in 23 languages. The paper addresses the task of cross-lingual transfer using pre-trained language models studying many different training settings. They observe that multilanguage models obtain scores that are comparable, albeit inferior, to those obtained by monolingual models when the models are tested on the same language they are fine-tuned on. In the setting of zero-shot learning, when models are fine-tuned and tested on different languages, they observe a drastic decline in performance, which can be partially mitigated by applying various adaptation strategies. Recently, the growing interest in multilingual Legal Language Models has led to the creation of two collaborative benchmarks that include several legal tasks: LegalBench (Guha et al. 2023) and LEXTREME (Niklaus et al. 2023).

Galassi et al. (2020) study different methods of annotation projection from English to German ToS, finding that the best result is obtained using a combination of neural embeddings and Dynamic Time Warping (Sakoe 1971). In the same direction, Drazewski et al. (2021) propose a novel corpus of 25 ToS in four languages—English, German, Italian, and Polish—which was used as a starting point for the dataset developed in this paper. The dataset deployed by Drazewski et al. (2021) was also inserted within LEXTREME (Niklaus et al. 2023).

Similarly, Isbister et al. (2021) study whether it is more desirable to create novel monolingual language models, especially for low-resources languages, or instead rely on machine translation and use models already available for English. This is one of the scenarios that we will consider in our experimentation. They conduct a case study on Scandinavian languages (focusing on sentiment analysis, not on the legal domain) and find that in most cases the use of machine translation leads to better results.

Other recent approaches have proposed alternatives to annotation projection to enable the application of machine learning to low-resource languages. One example is direct transfer (Zhang et al. 2016), which proposes to employ shared features across languages (e.g., multilingual embeddings) so that the trained model can be directly used on the test data, without the need of producing parallel corpora or projecting annotations. Yet, the approach has shown slightly worse performance than projection (Eger et al. 2018). In other cases, a weak supervision setting has been proposed (Cotterell and Heigold 2017; Kim et al. 2017) to exploit a setting with few labeled documents in the target language. Another recent idea is that of learning an alignment between word embeddings in different languages (Xu et al. 2018; Lample et al. 2018), so that this sort of mapping function can be exploited so as to transfer features from one language into another.

3 The source corpus

Our starting point is the multilingual parallel corpus produced by Drazewski et al. (2021), consisting of 25 Terms of Service annotated in English, Italian, German and Polish. These languages are spoken in large EU countries as well as in different regions, and they have been selected based on the availability of mother tongue legal experts for the annotation task. The existing annotations identify nine different categories for clause unfairness establishing: (1) jurisdiction for disputes in a country different than consumer’s residence (<j>); (2) choice of a foreign law governing the contract (<law>); (3) limitation of liability (<ltd>); (4) the provider’s right to unilaterally terminate the contract/access to the service (<ter>); and (5) the provider’s right to unilaterally modify the contract/the service (<ch>); (6) requiring a consumer to undertake arbitration before the court proceedings can commence (<a>); (7) the provider retaining the right to unilaterally remove consumer content from the service, including in-app purchases (<cr>); (8) having a consumer accept the agreement simply by using the service, not only without reading it, but even without having to click on “I agree/I accept” (<use>); (9) the scope of consent granted to the ToS also takes in the privacy policy, which forms part of the “General Agreement” (<pinc>). In the annotations, to indicate the degree of unfairness, a numeric value was appended to each XML tag, with a value 1 meaning clearly fair, 2 potentially unfair, and 3 clearly unfair.

We doubled the size of the dataset, which now includes 50 annotated ToS for each language, i.e., 200 ToS in total. This represents a corpus size that is quite common in the AI &Law domain (see, e.g., the LexGLUE corpus (Chalkidis et al. 2020)).Footnote 8 The new documents were retrieved from the CLAUDETTE preexisting corpus, covering 142 English ToS (Lippi et al. 2019; Ruggeri et al. 2021; Jabłonowska et al. 2021). Such terms mainly concern popular digital services provided to consumers, including leading online platforms (such as search engines and social media). The predominant language of drafting of these ToS is English, with differing availability of corresponding ToS in other languages. The annotation was performed by the same experts who annotated the CLAUDETTE corpus.Footnote 9

To carry out the present study, the ultimate 50 ToSFootnote 10 were selected on the basis of three main criteria: (i) their availability in the four selected languages; (ii) the possibility of identifying a correspondence between the different versions, given their publication date; and (iii) the similarity of their structure (e.g., number of clauses, sections, etc.). To illustrate, while ToS in German were identified for 88 out of 142 ToS contained in the pre-existing CLAUDETTE training corpus, Italian and Polish versions were respectively found for 79 and 55 of these 142 ToS. Out of the 55 ToS available in the four languages, we selected those with the more closely corresponding versions based on criteria (ii) and (iii) above. Perfect correspondence across the 4 languages, however, could not be achieved for all the 50 ToS. As further discussed in Appendix A, some relevant discrepancies may persist. Table 1 shows some statistics on the corpus, i.e., the number of annotated clauses for each tag, across the four different languages. The corpus is made available for reproducibility issues and for research purposes, together with the code to reproduce our computational results.Footnote 11

Table 1 Corpus statistics: we report the number of annotated clauses for each tag, across the four different languages

As a further level of analysis, we also studied the similarities and discrepancies between the different versions of the same document across the four languages. Our analysis revealed that a strong similarity between the English ToS and other language versions exists where the latter are translations of the former. We infer from the wording of the ToS that at least in 68 out of 150 cases, the German, Italian, and Polish documents were indeed translations of the English original version, as detailed in Table 7 in Appendix A. In 40 out of 50 documents we identified clauses referring to the language of the terms, by explicitly stating that: (i) in case of conflicts between translated versions and the English version, the latter shall prevail; or (ii) it is possible to access the contract in different languages and whenever a given language is not available, the provider will default to the English version. We report in Appendix A more details and examples.

Furthermore, we identified six sources of discrepancies across language versions: (i) asymmetric length of documents; (ii) sentence structure and segmentation; (iii) missing/extra clauses; (iv) country-specific clauses; (v) translation inaccuracy; (vi) legal concepts and terminology. As a general remark, it is important to note that deviations from the English source ToS are uneven across languages, being largest in German ToS. This may suggest that the drafting of such ToS is done by human agents, who may pay more attention to the national legal context and specific terminologies than automated translators. For example, note the markedly different take on the matter of privacy and data protection in these clauses of Spotify, where the German drafters refrained from packaging data protection consent with the agreement to the ToS:

<pinc2>Your agreement with us includes these Terms and Conditions of Use (“Terms”) and our Privacy Policy. </pinc2> <pinc2>(The Terms, Privacy Policy, and any additional terms that you agree to, as discussed in the Entire Agreement section, are referred to together as the “Agreements”.)</pinc2> (line 37)

Ihre Vereinbarung mit uns schließt diese Geschäfts- und Nutzungsbedingungen (“Bedingungen”) ein sowie jegliche weitere Vereinbarung, der Sie zustimmen, wie im Abschnitt Vollständiger Vertrag beschrieben (gemeinsam als die “Vereinbarungen” bezeichnet). (line 37)

The hypothesis of a more careful drafting being applied in the drafting of German language documents seems confirmed by the fact that retrieving identical corresponding versions of ToS was most difficult for German. Moreover, as shown in Table 7, the number of documents containing clauses that are specific to the country addressed by the ToS (CSC), which are missing in other languages, is higher for German versions. Conversely, we observed a lower mismatch in both Polish and Italian ToS, where significant structural differences can be retrieved only in limited cases.

4 Methodology

As in the original CLAUDETTE system (Lippi et al. 2019), we consider the binary classification task of detecting potentially unfair clauses in online ToS: the positive class consists of all the sentences that are annotated as unfair or potentially unfair, whereas all the remaining sentences made up the negative class. Any NLP approach can be exploited to address such task. In this paper, we are not interested in finding the best possible classifier, although we decided to test a few alternatives, as discussed in Sect. 5.Footnote 12 Our aim is instead to identify the best strategy to extend the classifier to other languages.

More specifically, given a machine learning system for potentially unfair clauses detection trained in the English language (such as CLAUDETTE), the problem we address is that of building the same kind of system also for different languages. We focus on other European languages, since the classifier is based on European consumer law, but the methodology we exploit is general.

Given (i) an annotated corpus of N sentences for the English language \(\mathcal {D}_E = \{ (x^E_i, y^E_i) \}_{i=1}^N\), where each sentence \(x^E_i\) is labeled with a binary label \(y^E_i\), and (ii) a machine learning system \(\mathcal {M}_E\) trained on that corpus, the goal is to build a system that can classify sentences in a target language T. We consider the following four alternatives, whose workflow is depicted in Fig. 1.

(1) Novel corpus for target language. In a first scenario, for any given target language T, domain experts are required to annotate a novel corpus of M sentences \(\mathcal {D}_T = \{ (x^T_j, y^T_j) \}_{j=1}^M\), in order to re-train from scratch a machine learning system \(\mathcal {M}_T\) for that specific language. This solution is completely independent from the English version of the system, as it exploits neither \(\mathcal {D}_E\) nor \(\mathcal {M}_E\). The process is illustrated in the top-left corner of Fig. 1.

(2) Annotation projection onto target language. This approach also requires to re-train a new system \(\mathcal {M}_T\) for each target language, but in this case the idea is to exploit the annotations of the original English corpus. In particular, this approach requires to collect the same contracts across the languages and to perform a projection of annotations from the English version of each document onto the corresponding document in the target language. For this step, we rely on a previous study (Galassi et al. 2020) that employs sentence embeddings and dynamic time warping to match sentences across the same document in different languages. After having obtained the corpus \(\mathcal {D}_T = \{ (x^T_j, \tilde{y}^T_j) \}_{j=1}^M\) in the target language with the projected annotations \(\tilde{y}^T_i\), it is possible to re-train from scratch the machine learning system \(\mathcal {M}_T\) for the target language. The process is illustrated in the top-right corner of Fig. 1.

(3) Training set translation to target language. In this scenario, the original documents of the English corpus are translated from English to the target language T. In so doing, the original annotations \(y_i\) can be directly attached to the translated sentences \(\tilde{x}^T_i\), thus obtaining a corpus \(\mathcal {D}_T = \{ (\tilde{x}^T_i, y^T_i) \}_{i=1}^N\) for T. In this way, the machine learning system \(\mathcal {M}_T\) can be re-trained from scratch, without the time-consuming activity of annotating a novel corpus. The process is illustrated in the bottom-left corner of Fig. 1. Automatic machine translation is used ex-ante only, for training corpus creation.

(4) Test set translation to English. This final approach does not require to re-train any novel machine learning system \(\mathcal {M}_T\), but just relies on the translation of test documents into English. The translated test sentences \(\tilde{x}^E_k\) are then classified using the English version \(\mathcal {M}_E\) of the system, and predictions are associated back to the original sentences in the target language T. The process is illustrated in the bottom-right corner of Fig. 1. Automatic machine translation is used ex-post only, for test queries at prediction time.

Fig. 1
figure 1

The four alternatives tested in our approach, exemplified for the German language. From the top left corner, clockwise: (1) re-train from scratch a German version of CLAUDETTE, with an original German corpus; (2) project labels from English to German documents, and re-train a German CLAUDETTE; (3) translate documents from English to German, keep the original English annotations, and re-train a German CLAUDETTE; (4) use the English CLAUDETTE, and translate query documents from German

The first option seems to be the most natural scenario, that should likely give the best performance overall. Nevertheless, we remark that such a statement has yet to be proven, since the same task can have different complexity levels across different languages, due to the nature of the task and on the specific NLP resources that can be exploited (Bender 2011; Mielke et al. 2019). This scenario is also quite costly in terms of resources needed for the creation of a novel corpus for the target language. The second and third scenarios are approximations of the first one, since they both rely on a novel machine learning system trained for the target language: both cases introduce noise, either in training documents, which are the results of a translation process (second scenario), or at the level of annotations, which are projected from English (third scenario). Therefore, in both cases we expect performance to be worse than in the first scenario. Finally, the fourth setting needs neither a novel corpus, nor a new machine learning system, while simply relying on machine translation at prediction time. Table 2 summarizes the steps and procedures needed in each one of the considered scenarios.

Table 2 Summary of steps needed in the four considered scenarios

5 Evaluation

Our experimental setup is based on having an original training corpus for the English language. Following the methodology described in Sect. 4, our evaluation aims to address the following research questions:

(RQ1):

Does unfair clause detection for a novel language show better performance if the system is trained on a novel annotated corpus for that language, with respect to the case in which the English version of the system is used, relying on machine translation for the queries?

(RQ2):

Does the answer to RQ1 change with different machine translation systems having different quality levels?

(RQ3):

Does projecting labels from the original English corpus onto the documents in the target language significantly worsen performance with respect to the case of building a corpus for each target language? This would allow to re-train a different system for each language, but without the need to annotate a novel corpus.

(RQ4):

In case the original English training documents are translated to the target language, while keeping the original annotations, does system performance degrade with respect to the case in which original documents for each target language are used for training?

We conducted computational experiments on three target languages: German, Italian and Polish. In order to address the research questions, we first needed to pick a machine learning technique to employ across the different scenarios. To this aim, we compared a few classifiers on the original corpus of each language, independently (i.e., the first scenario described in Sect. 4). To keep the computational burden of the experimental evaluation low, and to avoid approaches with some level of uncertainty (such as the initialization step in neural networks), we considered a very simple and easily reproducible setting. The approach consists in using a linear Support Vector Machine (SVM) classifier (Schölkopf et al. 2002) that exploits a set of features describing each sentence: we compared plain bag-of-words with sentence embeddings computed with ELMo (Peters et al. 2018) or BERT (Devlin et al. 2018). We did not consider fine-tuning embeddings, since this is a highly time-consuming procedure which would be hard to employ for the whole experimental validation.

Table 3 Preliminary results that aim to select the best set of features (embeddings or plain bag-of-words) to be used in the computational evaluation

We implemented all the classifiers in Python using the scikit-learn library, relying on the standard ELMo and BERT pre-trained models for the computation of embeddings.Footnote 13 In the bag-of-words representation, we used plain unigrams and bigrams, without the TF-IDF weighting. As for the translation of documents, we compared three machine translation tools: Google Translate,Footnote 14 Opus-MT, an open source neural machine translation toolkit that is part of the Helsinki-NLP suite.Footnote 15, and Apache JoshuaFootnote 16 as a representative of older (thus, lower quality) statistical learning tools. For all our computational experiments, we used a 5-fold cross-validation at document level and we report the macro-average over the 5 folds for precision, recall, and \(F_1\). To assess statistical significance of the results, we conducted a paired t-test on the \(F_1\) score across the values obtained over the 5 folds.

Our preliminary results for the choice of the best classifier are reported in Table 3. For this classification task, the bag-of-words representation results to be the best set of features: this is not a surprising result, since even in the original CLAUDETTE system this kind of classifier achieved the best performance (Lippi et al. 2019) among several competitors. The main reason for this result is that the lexical and syntactic information is crucial for the detection of potentially unfair clauses in contracts.

After choosing SVM with bag-of-words representation as our reference classifier, we address our four research questions. Tables 45 and 6 report the results obtained on German, Italian and Polish, respectively.

Regarding RQ1 and RQ2, the results across the three language consistently indicate that the fourth scenario, that of keeping only the English version of the machine learning system, relying on the translation of the queries at test time, does not perform worse than the first scenario, in which novel corpora are used for each language. For the German language with Google Translate, and for both German and Italian with Opus-MT, the improvement in \(F_1\) score for scenario 4 is even statistically significant (p value<0.05). Note that scenario 4 is the best solution only when the translation system has a high quality (i.e., with Google Translate or Opus-MT). With lower-quality translations (i.e., with Apache Joshua) there is instead a significant drop in performance, and scenario 1 results to be the best choice according to the paired t-test (p value<0.05 for all three languages).

For what concerns RQ3, all the tables clearly indicate that the results obtained by projecting the labels across languages (second scenario) are only slightly worse than those obtained with a novel corpus with original annotations (first scenario). In particular, the difference in terms of \(F_1\) is statistically significant (p value<0.05) only for the German language. For all the languages, the small decrease in performance is mostly due to precision rather than to recall. This good performance is due to the very high reliability of the projection algorithm proposed in Galassi et al. (2020). In general, the solution of projecting annotations could be worthwhile in case of limited resources available for the development of a novel corpus.

As for RQ4, performance is similar between methods 2 and 3, with no major differences across languages (no p value is smaller than 0.05). Therefore, in this sense we can conclude that projecting annotations while keeping the original documents of each language, and translating training documents while keeping the original annotations are both valuable alternatives to avoid the creation of novel corpora.

Table 4 Results comparison for the German language
Table 5 Results comparison for the Italian language
Table 6 Results comparison for the Polish language

Both the code and the corpus are freely available for research purposes.Footnote 17

6 Discussion

In the following, we discuss the results obtained under the four methods, providing a quantitative and qualitative analysis, so as to identify the most frequent typologies of errors.

6.1 Method 1: novel corpus for target language

As for the first method, we observe a very high percentage of false negatives (with respect to the total number of clearly and potentially unfair clauses in each category) concerning arbitration and privacy included clauses, for all the target languages. This is likely due to the lower amount of these categories within all data sets (see Table 1). Conversely, regarding false positives,Footnote 18 the higher number of errors pertains to fair clauses belonging to the categories of limitation of liability, jurisdiction and applicable law: this suggests that the system tends to over-predict those (potentially) unlawful clauses. Note that, for the last two categories, the higher percentage of misclassification is related to documents containing multiple country clauses (MCC) (see Table 7). From a qualitative perspective, a large group of false positives could be linked to textual indicators and word patterns which are typically symptomatic of unfairness (Lippi et al. 2019). In the target languages, such indicators often appear in different contexts, so that the concerned clauses cannot be classified as (potentially) unfair. This is the case of expressions such as “reserves the (right to)”, “at any time” and “to the maximum extent permitted by law”, which in Polish, German and Italian feature several times among false positives. The first two expressions are usually linked to termination and content removal clauses, while the latter often concerns liability.

Moreover, some errors seems to be due to variations in the subjects allowed to take certain actions. In particular, this relates to the performance of actions by users which would be considered unfair if executed by service providers. For example, clauses stating that users can delete uploaded content at any time were found several times among false positive miscalssifications. The same is true for the related provisions on contract termination. Such clauses have not been marked as (potentially) unlawful, since their unfairness only concern traders’ actions. In some cases the system nonetheless classified them as such, albeit more frequently in Polish than in German and Italian. Detailed examples of errors are reported in the Appendix B.1.

6.2 Method 2: annotation projection onto target language

Similarly to what has been observed under the first method, the higher percentage of false negatives concerns arbitration and privacy included clauses, for all the target languages. However, for Italian and Polish documents, a relevant percentage of errors has been found also with regard to termination clauses. We argue this is likely due to the noise in the annotation projection for this category, which is one of the largest in the training set. As regards false positives, in line with the observations above, the larger number of errors concern liability, jurisdiction and applicable law clauses, in particular with regard to documents containing MCC (see Table 7). This is true for all the target languages.

From a qualitative perspective, discrepancies across language versions may significantly affect the projection task and consequently the performance of the system. In particular, we found that erroneous results are often due to the following causes: (i) there is no correspondence between clauses (e.g., some of them are missing in one or more language versions); (ii) sentences are split differently; (iii) there is a mismatch in the ToS structure; and (iv) incorrect translations or specific linguistic choices can be identified.

Lack of correspondence between clauses. Sometimes errors can be linked to the lack of correspondence between clauses in source and target languages, such as when certain sentences were only present in one language version. This is the case for ToS containing country-specific clauses, which typically, though not exclusively, concern (i) limitation of liability, (ii) applicable law, (iii) jurisdiction, (iv) agreement to the contract, (v) contract modification and (vi) agreement to the processing of personal data. German ToS appear to be particularly affected by this issue. For instance, country-specific terms on liability can be found among others, in the AmazonFootnote 19 and Western Union ToS.Footnote 20 Similarly, country-specific terms on applicable law can be found, e.g., in the terms of Google Payment,Footnote 21 UberFootnote 22 and Groupon,Footnote 23 and are often accompanied by the corresponding terms on jurisdiction. Detailed examples of false positives and negatives are reported in Appendix B.2.

Different segmentation of sentences. Division of longer passages into shorter sentences has been identified as a recurrent cause of false positives and false negatives. Usually, the meaning of a clause in source and target languages remains unaltered, but the corresponding information is differently split. In particular, we identified the following cases: (i) a long sentence is split into two or more sentences, all of which need to be annotated (hence, the tag is simply reproduced); (ii) tags are nested in one language version and split in another version; (iii) only some of the sentences in the target languages are relevant and should be annotated. Detailed examples are reported in Appendix B.2. A more sophisticated projection methodology could be designed to overcome this issue, for example allowing the projection of a single tag across multiple consecutive sentences.

Mismatch in the ToS structure. In some cases the mismatch between the ToS structure in the source and target languages can be so great that similarities between the individual sentences can be particularly difficult to establish, thus causing a relevant number of errors. The ToS included in our corpus usually consist of several distinct terms. While the general ones remain broadly comparable across languages, in some cases the number of discrepancies is significant, in particular with regard to (i) the division of sections and (ii) the content of clauses, which differ depending on the language version. Some clauses are only present in one version, while other are reported in different sections. Detailed examples are reported in Appendix B.2.

Incorrect translation or linguistic choices. Translation errors and inappropriate linguistic choices may cause the incorrect projection of labels, thus affecting the performance of the system. Translation inaccuracies and inappropriate linguistic choices may be due to: (i) cultural differences; (ii) lack of context; or (iii) grammar and syntax errors. Detailed examples are reported in Appendix B.2.

6.3 Method 3: training set translation to target language

Concerning false negatives, the larger percentage of misclassifications concern arbitration and contract by using clauses in all the target languages, as well as privacy included clauses for the Polish and Italian datasets, and the applicable law for German documents. Compared to method 1, the correct classification of privacy included clauses actually slightly improved for German and Italian while it significantly worsened for Polish.

Similarly to what we observed under method 1, a relevant group of false positives is linked to MCC documents (see Table 7), in relation to law and jurisdiction clauses, as well as to textual indicators and word patterns which are typically symptomatic of unfairness (Lippi et al. 2019). In the target languages, such indicators often appear in different contexts, so that the concerned clauses cannot be classified as (potentially) unfair. As noted above, this is the case of expressions such as “reserves the (right to)”, “no liability for” and “you agree”, which in Polish, German and Italian feature several times among false positives. Moreover, in German recurrent expressions include “in our discretion” and “from time to time”; in Italian “third parties”; in Polish “at any time”.

Misclassifications could be also linked to the inaccuracy of the automated translation. While we observe that its quality is generally high, this additional source of errors cannot be ruled out. In particular, the inaccurate translation of domain-specific terminology and the general complexity of the source text seem to be the main cause of such classification errors. Incorrect choice of terms as well as grammatical and syntax errors can be hard to avoid in those cases. A further challenge is posed by long or vague formulations in source text, which are frequently found in terms and conditions. In order to be comprehensible, such sentences may need to be reformulated or specified in the target language, which automated translators can still find difficult to perform. Detailed examples are reported in Appendix B.3.

6.4 Method 4: test set translation to English

A high percentage of false negatives concerns privacy included and arbitration clauses for all the target languages. This tendency is true for Google, Joshua and Opus-MT, although the absolute number of errors is notably higher for Joshua. By contrast, in case of the contract by using category, the number of false negatives is relatively low in the Google scenario, even outperforming the results obtained under the first method. At the same time, this category emerged as problematic when relying on Joshua, where, for instance, the number of misclassifications in Polish and German was more than doubled.

Clearly, the differences in the total number of errors can be connected to translation accuracy. In line with the results described in Sect. 5, the number of false negatives is significantly higher for Joshua than Google and Opus-MT in all categories. The results of Opus-MT are generally comparable to, and for some categories even better than Google when it comes to German and Italian. However, Google turned out to be preferable for Polish. The error analysis revealed a variety of translation mistakes, mostly affecting Joshua.

In particular, we identify the following types of errors: (i) incorrect choice of terms when the same word in the target language can have multiple meanings, (ii) grammatical and syntax errors, e.g., when there is no alignment between relevant nouns and pronouns, (iii) incomplete translations, e.g., when the predicate is entirely missing from the translated sentence. Detailed examples of these errors are reported in Appendix B.4.

Conversely, the number of false positives for the analyzed clause categories, i.e., limitation of liability, applicable law and competent jurisdiction, is higher for Google and Opus-MT than it is for Joshua. This tendency is true for all the target languages. Moreover, as far as the last two categories are concerned, a higher percentage of misclassification is mostly related to MCC documents (see Table 7). This result is in line with what has been noted for all the other methods.

These results suggest that a difficulty exists in establishing a correspondence of meaning between concepts reflecting different legal, social, and cultural contexts. Given the domain-specificity of the legal language, the ability to guarantee a consistent horizontal equivalence strongly depends on the quality of translation. The lexical correspondence of two terms may satisfy neither the semantic correspondence of the concepts they denote, nor the requirements of the different legal systems (Ajani 2007; Tiscornia and Sagri 2012; Pozzo 2016).

7 Conclusions

In this paper we considered the problem of multilingualism in the context of unfair clause detection in online Terms of Service. In particular, we studied the problem from both a legal perspective and from a machine learning point of view. As for the former, we developed and analyzed a wide corpus of 200 contracts (50 documents for four different languages, namely English, German, Italian and Polish), highlighting correspondences and discrepancies between the different versions of the same contract. As for the latter, we compared four different approaches to the problem of developing clause detectors in different languages: (i) building independent corpora and systems; (ii) projecting annotation from a single, reference corpus; (iii) translating training documents while keeping original annotations; (iv) using a single system for the English language while relying on machine translation at prediction time.

An extensive computational evaluation was performed, to show the advantages and disadvantages of the different approaches. In particular, relying on machine translation at prediction time seems to be the best solution, but only in case the quality of the translation system is adequate. Projecting annotations or translating training documents is also a reasonable option, as they avoid the time- and resource-consuming procedure of building a novel corpus and system for each language of interest, while achieving only slightly worse performance.

In the future, we plan to employ also multilingual embeddings (Feng et al. 2022) to capture relationships and dependencies across different languages, and we aim to study the problem of attaching legal rationales as explanations for the unfairness of a clause, also in a multilingual setting Ruggeri et al. (2021). Since the uneven distribution of the classes may negatively impact the performances, we also want to explore the use of data augmentation Perçin et al. (2022) to balance the dataset. From a legal perspective, we see a very promising line of research in applying this kind of methodology also to other relevant problems in consumer protection, as in the domain of privacy policies.