Introduction

The technology revolution in the 20th century has significantly facilitated the development and growth of corpora and computer tools in the fields of descriptive and applied linguistics. Corpus Linguistics has changed the way language is interpreted and has led to conducting of numerous empirical studies investigating every aspect of language (McCarthy & O'Keeffe, 2012: 4). For example, corpora and pragmatics (Rühlemann, 2019) is one of the important fields contributing to the analysis of legal language, and construction of legal meaning, e.g. analysis of speech acts (Austin, 1962; Searle, 1969, 1979) such as directives (Cao, 2007; Visconti, 2009 and Solum, 2021) and the analysis of modality of obligation, permission and prohibition (Biel, 2014; El-Farahaty & Elewa, 2020; Palmer, 2001).

It quickly became evident that corpus linguistics contributes to many fields such as law which depends on language heavily (Goźdź-Roszkowski, 2021:1515; Solan, 2017:1315). Therefore, corpora and corpus-based translation methodologies are increasingly becoming more integral in translation practice and research since the 1990s (see Baker et.al.,1993; Baker, 1995, 2019). These empirical methodologies uncover translation norms, universals (Olohan, 2004), characteristics and rigorous analyses of legal concepts and phraseology in original and translated texts. They focus on ‘the interplay of theoretical constructs and hypotheses, variety of data, novel descriptive categories and a rigorous, flexible methodology (Kruger, 2004:1), hence facilitating translation training and enhancing translation competencies. More recent research dedicated to the use of corpora in translation and language teaching has been published (see Zaki, 2020, 2021; Zaki et al., 2021, among others).

Researchers at The University of Leeds have been active in creating open-access Arabic corpora and promoting projects in Arabic corpus linguistics (see Alfaifi & Atwel, 2016; Sharoff, 2006). The need for advanced tools and morpho-syntactically annotated Arabic corpora to be used in teaching and research is increasing (see section on Arabic Corpora: A Brief Review). Likewise, parallel corpora and specialised legal corpora are rare in spite of the growing need to use them in empirical linguistic investigations of authentic Arabic texts and for language and translation teaching. Therefore, this paper fills an existing gap in the field of Arabic corpus linguistics and legal translation. Building these corpora and sub-corpora is needed to address the possible inconsistencies in the legal drafting and translation of specific linguistic and system-bound areas and will pave the way for more research in the field of Arabic legal translation, an area which is under-researched worldwide. Examples of these research applications using the current corpora include a comparative analysis of performative verbs in the diachronic parallel corpus of Arabic constitutions. Other corpus-based pragmatic investigations include the analysis of directive speech acts, their translations in the parallel corpus of Arabic constitutions and comparative diachronic investigations of deontic modals of obligation, permission and prohibitions in the above parallel corpus.

This paper aims to discuss the building of a collection of specialised parallel and monolingual legal corpora. In particular, it will discuss the building of diachronic corpora, which include all available constitutions of 22 Arabic countries: Algeria, Bahrain, Comoros, Djibouti, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, the United Arab Emirates, and Yemen.

The paper answers the following overarching research question: What is the process of building diachronic corpora of Arabic countries' constitutions and what are the challenges of building these corpora? To answer this research question, we will demonstrate the process of building the specialised corpora for all available constitutions and discuss the solutions/suggestions for challenges encountered throughout this process. It will discuss: (a) researching and collecting the extensive set of data (b) pre-processing the data, i.e. corpora cleaning; alignment of the corpora and the semi-automatic verification of the articles using Lf-aligner (c) uploading and organising the corpora on Sketch Engine and testing them. Sketch Engine provides a bespoke corpus management system and offers several useful functionalities for Arabic language data processing (Kilgarriff et al., 2004; Kilgarriff, 2014). (See Method). Below is a list of the corpora we aim to build:

  • The Leeds Parallel Corpus of Arabic Countries' Constitutions (LPCACC)

The final version of this corpus includes the constitutions of 20 Arabic countries and their English translations, from 1922 to 2022; Arabic version (407,633 words) and English version (489,448 words). The LPCACC will include sub-corpora for each country separately but it excludes Comoros and Somalia.

  • The Parallel Corpus of Preambles of Arabic Countries' Constitutions (PCPACC)

This version contains all available preambles of the Arabic countries’ constitutions (32.660 words).

  • Monolingual Corpora of Arabic and English Constitutions (MCAEC)

These corpora contain two separate versions of constitutions in Arabic (788,477 words) and in English (343,582 words).

  • The Leeds Monolingual Corpus of English Countries' Constitutions (LMCECC) (Comparative Corpus)

This corpus consists of the constitutions of 8 countries of which English is the official language, from 1985 to 2016. It consists of 677,056 words. (For more information, see sections on Method and Corpora)

The aims of building all available versions of Arabic constitutions and the comparative English corpus are (1) interdisciplinary corpus-based, socio-cultural investigations, comparative Arabic-English legal translation and advanced comparative linguistic analysis (cf. Biel, 2014). (2) research-led and blended-learning pedagogical approaches to translation teaching and learning. The research of building corpora starts with collecting original texts and translated texts in their various forms (e.g.txt, pdfs and images), storing, documenting, processing, aligning them automatically or manually, and uploading them to the Sketch Engine platform (Kilgarriff, 2004; 2014).

The paper is structured as follows. Section two provides a review of the current literature on Arabic corpora and Arabic legal corpora. Section three presents the detailed methodology used to build the data set and prepare it before uploading and publishing it on Sketch Engine. It will give full details about the data set used in building the corpora and sub-corpora. The challenges encountered in building the data set and the tools used to solve them will also be discussed in this section. Finally, value propositions, as well as future projects, will be highlighted in section four.

Arabic Corpora: A Brief Review

Corpus and computational linguists across the world have been working in the past three decades on building Arabic corpora (Abbas & Smaili, 2005; Al-Sulaiti & Atwell, 2006; Brierley & El-Farahaty, 2019; El-Farahaty & Elewa, 2020; El-Haj & Koulali, 2013; Goweder & De Roeck, 2001); databases (Boudelaa & Marslen-Wilson, 2010; Khwaileh et al., 2018); online interfaces (Dukes & Atwell, 2012; Sharoff, 2006) and developing NLP tools (Habash, 2010; Al-Jawfi, 2009, among others). Although they are constantly increasing (Alfaifi & Atwell, 2016), Arabic is understudied by corpus-based methodologies compared to its demographic and societal relevance (McEnery et al., 2019:1). It is worth mentioning that creating Arabic corpora comes with challenges, such as creating NLP algorithms and techniques for Arabic and developing tools specific to the language (Al-Thubaity et al., 2013). Other challenges include the ambiguity and difficulty of Arabic, the employment of several dialects of Arabic, each having distinct features and the shortage of freely accessible databases that may be utilised in the study and development of Arabic information extraction and processing (Al-Thubaity et al., 2013). Some academics started constructing electronically searchable corpora of Arabic literature not long after such intrinsic problems in Arabic computing started to be overcome.

This section will review the current literature on Arabic monolingual and parallel corpora, including those built by researchers and scholars at the University of Leeds. We will then focus on Arabic legal corpora and will end up by highlighting the research gap that this paper addresses.

Arabic Monolingual and Parallel Corpora

Corpus linguists and researchers built various Arabic corpora in different fields and genres for pedagogical or research-based purposes. For example, in 1992, the Linguistic Data Consortium (LDC) at the University of Pennsylvania, produced many corpora in twenty languages, including Arabic. Although The LDC individually licensed many of the corpora in its catalogues, users need to have a membership or have access through libraries and universities to be able to use the corpora (Cieri, et al., 2022).

One of the first projects on monolingual Arabic corpora, which contain Arabic texts only, is the Al-Hayat newspaper corpus (Goweder & De Roeck, 2001). This accessible Arabic language corpus includes 18.5 million words. After that, other web-based corpora were built using specialised technology to collect specialised linguistic content covering different genres and text types from the internet. For example, ArTenTen (Belinkov et al., 2013) is a general Arabic corpus from the family of linguistic corpora known as (TenTen corpora) that includes many languages such as English, Japanese, Russian, and Chinese, among others. The ArTenTen corpus, used as a reference corpus, contains more than 10 billion words in Arabic and is accessible via Sketch Engine.

Researchers have worked over the past decade to create Arabic corpora (Alfaifi & Atwell, 2016; Zeroual & Lakhouaja, 2018). Several Arabic corpora were mainly derived from newspapers and designed primarily for researchers' projects but could not be accessed online (Al-Thubaity et al., 2013). However, between 2005 and 2013, many Arabic corpora were created and made available on an online platform or could be downloaded, which led to a shift in the status of Arabic corpus linguistics. Examples of these include the International Corpus of Arabic (ICA) (Alansary & Nagi, 2014), and King Saud University Corpus of Classical Arabic (KSUCCA) (Alrabiah et al., 2013) and arabiCorpus (Parkinson, 2012). Significant efforts have been made at the University of Leeds to establish different types of Arabic corpora including the Quranic Arabic Corpus (Dukes et al., 2013) and Corpus of Contemporary Arabic (CCA), for teaching Arabic as a foreign language (Al-Sulaiti & Atwell, 2006) in addition to the web interface introduced by Sharoff (2006). It is also important to mention the work done by Lancaster University Centre for Computer Corpus Research on Language (e.g. El-Haj et al., 2015; McEnery et al., 2019), which significantly impacted the field. Zaghouani, (2017) presented a Critical survey discussing the freely available Arabic corpora, and more recently, Ahmed et al. (2022) founded 48 free and accessible Arabic corpora by searching the most popular information technology (IT) resourcesFootnote 1.

A significant project in (Arabic) Parallel corpora is the English-Arabic Political Parallel Corpus (EAPPC) by Ahmad et al. (2017). It is specialised since it focuses on contemporary political issues in Jordan but it is not available to the public. As Alotaibi (2016) reported, there is a rise in interest among scholars in studying parallel corpora. For instance, the first general parallel corpus of English texts /Arabic translations was published by the National Council for Culture, Arts and Letters (NCCAL) in Kuwait (Al-Ajmi, 2004), the open parallel corpus known as OPUS (Tiedemann, 2012) and Arabic/English Parallel Corpus (AEPC) by Alotaibi (2016). The AEPC is not available, and its website dedicated to it is not working.

Arabic Legal Corpora

There is a growing interest in building specialised corpora. However, due to the fact that certain legal papers are private by their very nature or produced inside institutional frameworks, there may be issues with the accessibility of legal materials. There are several studies about current legal corpora (Goźdź-Roszkowski, 2021; Vogel et al., 2018), and the Sources of Language and Law (SOULL, 2020)Footnote 2, an open online platform consistently updated to offer a wealth of knowledge on current data and corpora of legal language.

International organisations like the UN, the EU and the WTO have been responsible for the creation of publicly accessible legal corpora. Examples of these corpora include the United Nations Parallel Corpus, which is significant and vital for Arabic legal translation research and pedagogy (Ziemski et al., 2016). The corpora, however, are displayed as separate files, requiring a well-trained user who knows how to parse the files to get them into an accessible, easy-to-use version. The JRC-Acquis Parallel Corpus contains a collection of written legislative texts in 23 official languages of European Union countries (Steinberger et. al., 2006), and the Arab-Acquis contains over 600,000 words altogether has been professionally translated from both English and French. In addition to these major corpora, the Digital Corpus of the European Parliament (DCEP) consists of different document types produced between 2001 and 2012 in various subject areas and the majority of the material is taken from the official website of the European Parliament (Hajlaoui. et. al., 2014).

Salhi (2013) built the English-Arabic Parallel Corpus of United Nations Texts (EAPCOUNT). It comprised 341 English-Arabic bitexts aligned on paragraph level, from data (mostly resolutions and annual reports) covering the period between 1996 and 2009 collected from different international organisations, mainly the UN (e.g. UNICEF, IMF, UNESCO), among others. Unfortunately, in spite of its importance in research and translation training, this source is not available online.

The Corpus of Arabic Legal Documents (CALD) (Müller, 2021)., a database developed by the ERC-AdG-project Islamic Law Materialised (ILM), contains a tool for studying and comparing legal Arabic documents in Islam from the 7th to the 16th centuries from various regions of the Muslim world aiming to facilitate the study of Islamic law from a historical perspective. Furthermore, it is a handy interface and tool for teaching and researching Arabic/English legal translation.

The Women's Learning Partnership, a body of non-profit, non-governmental organisations, launched a database/Corpus of Laws in 2012. It is a freely available collection of constitutions, civil family laws, gender-based violence-related penal codes, and victim-protection legislation in different countries and elsewhere around the globe. It is just a large dataset of legal documents in this genre, and the files are available in Arabic, English and many other languages. The corpus is not in a parallel format (i.e., aligned) and does provide online tools for teaching legal language or translation.

From this short review, it is clear that the variety of legal corpora is expanding due to the ongoing addition of new legal genres and legal languages. However, many of these corpora reviewed in this paper are not freely available. The available ones are not easy to use and require specialist computational tools that may not be accessible by many users (e.g. translation tutors and PGR researchers). This review also revealed that there are no available legal corpora of constitutions. The current paper seeks to fill an existing research gap and responds to an important need in the field of Arabic/English corpus-based legal translation by building a series of open-access monolingual and parallel corpora of Arabic and English constitutions (See Introduction and Method).

Method

This research adopts an applied methodology to build Arabic legal corpora. This methodology enables researchers to benefit from the procedures and tools used in building the corpora in future projects and research in many fields. In the following, we will present the stages of building legal corpora in detail and present the types of corpora that we have complied. Finally, we will discuss the main research question of methods, challenges, and solutions.

Corpus Building Procedures

In developing the Arabic constitutions corpora, we collected all available versions of the constitutions from 22 Arabic countries: Algeria, Bahrain, Comoros, Djibouti, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, the United Arab Emirates, and Yemen. The oldest version of the constitution dates back to 1923 (in Egypt), and the most recent version dates back to 2021 (in Oman).

Corpora building is divided into three main steps: (i) data collection, (ii) pre-processing the set of data, and (iii) corpora alignment. The following subsections discuss these steps in more detail.

  1. (1)

    Data Collection


We collected the original Arabic versions of the constitutions, along with the English translation of those versions, from the Arab government's official websites as well as websites specialised in publishing state constitutions. We used the Constitute Project (Zachary et al., 2012) which provides a chronological compilation of constitutions from around the world, but many old versions of the Arabic constitutions were unavailable on this site. In addition, some versions were available in Arabic or English only. Therefore, we manually selected the relevant documents from the Constitute Project and downloaded all files separately. We then searched the World Intellectual Property Organisation (WIPO)Footnote 3 website to find translations of unavailable constitutions.

We compiled at least one parallel constitution for the 22 countries, except for Comoros and Djibouti, for which only the Arabic versions were available, and there was not a parallel English version for both constitutions. We found several different English translations of some constitutions, and these were added to a monolingual corpus in English (see Types of compiled corpora). We faced another challenge while collecting the constitutions; some Arab government countries' websites did not contain any information about the original constitutions. In addition, some library websites have placed restrictions on constitution files in PDF format, such as the US Library of Congress.

  1. (B)

    Pre-processing


The stage of pre-processing went through three steps: (i) standardisation of file formatting, (ii) cleaning, formatting, and normalising data, and (iii) parallel corpus alignment. To undertake these procedures, we used technical tools that we will show in each of the following steps:

  1. (i)

    Standardising File Format


Most of the PDF files included texts or images, while the rest of the documents were bundled as (.txt) text files or HTML web pages. In this step, we converted all the bundled files to text (.txt) file format. It was challenging to choose one Optical Character Recognition (OCR) tool to rely on for the Arabic language since many lack accuracy, efficiency, and encoding issues in Arabic. Since these tools work differently depending on the hidden format of PDF files, we used the following free OCR tools:

  • Sotoor OCRFootnote 4: is an online tool that reliably and accurately recognises Arabic characters, but it is limited to 100 free pages per user;

  • i2OCRFootnote 5: is another web-based tool that recognises each page individually;

  • Google translate: we used the OCR embedded tool inside the Google Translate mobile application. The use of this application was used only if the previous systems failed to recognise Arabic texts.

In addition to the general challenges faced by Arabic OCRs, some PDF files contained old images (typewritten old Arabic script), as shown in Fig. 1, a sample of the 1923 Egyptian constitution, while Fig. 2 shows the text similar to the first column (in figure two) that was recognised by the OCR tool, Sotoor.

Fig. 1
figure 1

A snapshot of 1923 Egyptian Constitution

Fig. 2
figure 2

The OCR result of the first column in figure one

Furthermore, to make it easier to look for a certain file, we standardised file naming as [publication year] _ [country name] _ [Arabic] or [English] or [Ar_En] describing the language of the constitution. For example, "1923_Egypt_Ar_En" refers to the parallel constitution file of Egyptian that was published in 1923.

  1. (b)

    Data Cleaning


This step involves cleaning, formatting, and normalising files (such as removing Arabic diacritics and symbols). This is done by following general procedures to standardise Arabic writing in files. For example, some constitutions used only a number or an abbreviated form, as in "art." to refer to the number of articles and in those cases, we changed all the abbreviations to the word "Article" so that all the files are unified into a single format. Likewise, in the Arabic files, we replaced the word "فصل” with the word “article” in some Arab constitutions, as it was used to refer to the word “المادة” as in the Constitution of Morocco.

  1. (iii)

    Parallel Corpus Alignment


It is the last step in text processing, which is related to the alignment of the Arabic and English files. Manual alignment is a laborious and time-consuming process. Thus, the adoption of the LF-alignerFootnote 6 application, which aligns the translated compiled files, gives the option to manually review the alignment before it is completed, as shown in Fig. 3. Therefore, this tool facilitated a semi-automated alignment process for all articles of the collected constitutions. It is worth mentioning that the alignment was done at the level of the entire article and not at the sentence level to allow users specialised in other disciplines (e.g. politics, law) to use the corpus.

Fig. 3
figure 3

LF-aligner automatic alignment for Oman 1996 constitution

Parallel Corpus Compilation

All constitutions are collected, then aligned in excel sheets, and finally saved as text files. We used Sketch Engine to test and automate the corpora. This platform can be used for processing users’ data, collecting data from the web, and exploring a vast number of open access/available corpora for many languages. It allows fundamental features (Arabic Word Sketches; Arabic Concordance or keyword in context (KWIC); Arabic Thesaurus; Arabic Word Lists and Arabic N-grams which identify patterns relating to multi-word units (MWU) in Arabic (Kilgarriff, 2004; 2014).

Types of Compiled Corpora

The following corpora have been built and made available on the Sketch Engine platform:

The Leeds Parallel Corpus of Arabic Countries' Constitutions (LPCACC)

The final version of this corpus includes the constitutions of 20 Arabic countries and their English translations from 1922 to 2022, Arabic version (407,633 words) and English version (489,448 words). The LPCACC excludes Comoros and Somalia since there are no parallel versions of these constitutions, and it excludes the preambles, which are compiled in a separate corpus. Figure 3 shows the number of parallel constitutions in each country and indicates that Egypt represents the most significant number of constitutions in the parallel corpus with eight parallel files. The United Arab Emirates, Lebanon, Qatar, and Saudi Arabia are all represented in the constitution by one parallel constitution. The other 15 states are represented by an average of three parallel constitutions for.

The Parallel Corpus of Preambles of Arabic Countries' Constitutions (PCPACC)

This version includes all available preambles of the Arab countries’ constitutions (32.660 words), Arabic (12,360) and English (20,300).

Monolingual Corpora of Arabic and English Constitutions (MCAEC)

Some countries publish constitutions either in Arabic or in English only, and in some cases, the constitutions are published in French and Arabic with no English translations. Therefore, we created two separate corpora for these versions of constitutions, one in Arabic and another in English (see Fig. 4). The total number of words for the Arabic language corpus is (788,477), and for the English language, the corpus is (343,582). Figure 4 shows a comparison of the number of constitutions issued in each language from 1922 to 2022. In the early years, more versions of constitutions were available in Arabic without English translations. However, after 1990, English translations of the constitutions were available and published on the web, which facilitated access to them, and they were included in the English language corpus (Fig. 5).

Fig. 4
figure 4

The total number of parallel constitutions in each Arabic country

Fig. 5
figure 5

Monolingual corpora of Arabic and English constitutions

The Leeds Monolingual Corpus of English Countries' Constitutions (LMCECC) (Comparative Corpus)

This comparative corpus was created from the revised and updated versions of the constitutions of eight countries whose official language is English, which consists of Australia, Canada, Ireland, New Zealand, Singapore, South Africa, USA and UK. The oldest version of this corpus is the 1985 Australia, and the latest revised version is for the USA, 2016. This corpus consists of 677,056 words and 557,086 words (See El-Farahaty & Elewa, 2020)Footnote 7.

Concluding Remarks

In this paper, we have presented the methods, procedures and challenges of building sustainable open access diachronic parallel and monolingual corpora of all available versions of Arabic countries’ constitutions and their preambles as well as a comparative corpus of constitutions of English-speaking countries.

Throughout the process of building the corpora, we faced different challenges. In the data collection phase, it was challenging to find all the versions of the original constitutions for each Arabic country or their translations since some of the files date back to 1922. This resulted in searching several online websites in addition to searching the translations of the constitutions manually because not all of them were available on each country’s government website. Although many of them were available in the Library of Congress, they were in PDF files. Therefore, file conversion was one major challenge and it was not enough to choose one OCR tool to rely on for the Arabic language since many lack accuracy, efficiency as well as encoding issues in Arabic. We used more than one OCR tool to deal with the challenges faced during the data processing phase. For example, PDF files contained old images (typewritten old Arabic script) for which we used the Sotoor OCR.

For the parallel corpus file alignment, it was not possible to fully automate the alignment of the Arabic and English files. Therefore, we used the LF-alignerFootnote 8 application, to undertake a semi-automated alignment process for all articles of the collected constitutions. As some of the Arabic files were distorted, this application gives the option to manually review the alignment before it is completed. This semi-automation saved time and effort of manually aligning the parallel corpus which is such a laborious process although we used Excel sheets to align some files. Not all of the constitutions have a parallel Arabic-English version, e.g. Comoros and Djibouti for which only the Arabic versions were available and there was not a parallel English version for both constitutions. All versions of the constitutions which did not have a parallel version were included as part of a monolingual corpus for each language.

The corpora offer different values to different end users. They offer sustainable pedagogical tools for law students, and postgraduate translation and interpreting students enrolled in Arabic and English translation UG and MA programmes all over the world. They are important research tools/data sets for postgraduate researchers, academics, law professionals across the world. The corpora will be used in mapping specialised terminology and phraseology and will be of big value nationally for (law drafters/professionals, NGOs). The corpora will be used to undertake empirical investigations of a range of topics, using corpus-based tools, e.g. a diachronic corpus-based pragmatic investigation of directives (for more examples of such investigations, see El-Farahaty & Elewa, 2020; Brierley & El-Farahaty, 2019). We aim to scale up the project and extend the building of these corpora to include other MSA and Classical Islamic legal genres and text types.