1 Introduction

Availability and accessibility of substantial parallel corpora are essential for training high-quality statistical and neural machine translation systems (SMT and NMT). However, these resources are often not freely available for many language translation pairs due to difficulties in creating high-quality parallel corpora and financial constraints, as well as professional translation requirements [46]. While large monolingual corpora are readily accessible online, they are typically limited to specific language domains. This is particularly true for low-resource African languages (LRAL), such as Twi [20]. Additionally, phrase-based statistical machine translation (PBSMT) models are widely regarded as the state-of-the-art (SOTA) for language pairs in multi-domain settings where ample parallel corpora exist [14].

A parallel corpus comprises a collection of texts translated into multiple languages, which, when processed in Sketch Engine, become monolingual corpora linked together for multi-language searches. These corpora serve as crucial resources for Natural Language Processing (NLP), notably Machine Translation (MT). Liu et al. [33] note that languages with high-resource datasets benefit from numerous parallel, bilingual, and monolingual corpora, facilitating the development of state-of-the-art MT systems. In contrast, low-resource datasets are common among many African languages. Schwartz [45] highlight that out of the 7,111 living languages globally, Africa alone has about 2,144 distinct languages. In Ghana, the most spoken local language is Akan, which includes Fante, Ahafo, Akyem, and Twi dialects such as Asante and Akuapem. According to Sackitey and Adomako [43], Akan has three main dialects: Asante Twi, Akuapem Twi, and Fante.

The Akan language, part of the Central Tano group of the Niger-Congo language family, is spoken by over nine million people. Other widely spoken languages in Ghana include Ewe (3.8 million speakers), Abron (1.2 million speakers), and Dagbani (1.2 million speakers) Sssu, 2023). English remains the official language, with nearly ten million speakers. Adebara and Abdul-Mageed [2] point out that among the numerous African languages, only Zulu, Yoruba, Swahili, and Igbo have well-developed digital datasets. However, these still fall short of global standards. Ranathunga et al. [42] observe that few Low-Resource African Languages (LRALs) and High-Resource Languages (HRLs) have bilingual, monolingual, and parallel corpora, which are typically available online as open-source resources.

The development of linguistic resources for the Twi dialect in Ghana remains notably underdeveloped, resulting in most experts relying predominantly on the Akuapem-translated Bible as their primary reference for translations. Alabi et al. [5] highlighted the absence of standard corpora for Twi, with the JW300 corpus being the only exception. However, this reliance on the JW300 corpus or the Bible is ideologically problematic, as it can introduce biases and organizational issues. Beyond the Bible corpus, Twi has a limited range of standard literature, including dictionaries, storybooks, and the Holy Bible. Additionally, some freely available online and offline articles are often curated, annotated, and POS-tagged.

Machine Translation (MT), which involves translating text from one language to another through various MT models, typically requires parallel corpora to train the models. Unfortunately, such datasets are significantly lacking for the Twi dialect. Prior studies have underscored the necessity of creating language datasets and resources for low-resource African languages (LRALs) [37, 54]. Prominent scholars, such as Azunre et al. [7,8,9], Gyasi and Schippe [19], and [39], have attempted to validate the efficacy of parallel corpus-based machine translation (MT) from English to Twi, with a particular emphasis on multi-domain translation capable of converting one specific language into another. Despite their efforts, they were unable to develop a reliable Twi-to-English word search engine that could accurately translate low-resource languages (LRL) like Twi into high-resource languages (HRL) like English.

This study uses parallel corpora designed for natural language processing (NLP) applications, such as MT, to advance computational linguistic research focused on the Twi dialect. The research draws upon the Sketch Engine by Fagbolu et al. [15] and corpus curation and analysis software by Ghaddar and Langlais [16] to build a robust corpus for this study. This study designs and develops a digital Twi-English parallel corpus consisting of approximately 5.2 k sentences (texts) sourced from the parliamentary Hansard, the Bible, medical literature, social media networks, and news sub-corpora. This corpus aims to be searchable, publicly accessible, scalable, and essential for NLP and MT, representing a significant innovation in the field.

This study addresses significant gaps in the resources available for Low-Resource African Languages (LRALs), particularly the Twi language of Ghana, by developing the first open-source Twi-English parallel corpus. Unlike previous efforts by Afram et al. [4], Azunre et al. [7,8,9], and Adjeisah et al. [3], which failed to provide comprehensive NLP and Machine Translation (MT) resources for Twi, this research creates a heterogeneous corpus spanning all genres of Ghanaian Twi-speaking culture and communities. This corpus will serve multiple user groups: researchers will benefit from a single search platform enabling accurate translations and deeper linguistic understanding,language teachers and students will access high-quality materials for improved grammar and vocabulary learning; and language learners will gain from authentic data illustrating grammar construction and word frequency patterns. Additionally, the study offers a robust wordlist, effective N-grams, a Parallel Concordance feature, and a Word Sketch Engine, collectively enhancing the utility and accessibility of Twi language resources.

2 Literature review

2.1 The orthography of Twi dialect

According to Aboagye Da-Costa and Adade-Yeboah [1], there are approximately 75 local languages currently spoken in Ghana, a total of 34)of them are well-established Ghanaian languages, 18 still in use. [56] also noted that among the language vitality count, 10 are institutional, 15 are stable, 8 are endangered, and 1 is extinct. The Twi dialect is made up of 22 alphabets (a, b, d, e, ɛ, f, g, h, i, k, l, m, n, o, ɔ, p, r, s, t, u, w, y), 15 consonants /b, d, f, g, h, k, l, m, n, p, r, s, t, w, y/ and 7 vowel words (a, e, i, o, u, ɔ, ɛ). According to Tuffour [52], these vowels appear before syllables with the vowels /i/ and / u/ in the Akuapem and Asante dialects. There is also a tenth vowel /æ/, which is a variant of the vowel /a/ [41].

2.2 The Twi dialect

Akan is believed to be a language that emigrated from one country to Ghana and later became a prominent means of exchange among Ghanaians. Sackitey and Adomako [43] mentioned that Akan is a Kwa language of the Niger-Congo language family and the most widely spoken language in Ghana by about 8.3 million people. Just like any other language in the world with various levels of tones, Akan has two levels of tones, which agrees with the known assertion of Tsiwah et al. [51] that Akan is a tone language with two-level tones (high and low) that has both lexical and grammatical functions (e.g., tense and aspect marking).

As of 2021, Akan was the most spoken local language in Ghana, encompassing Twi dialects such as Fante, Akuapem, Akyem, Ahafo, and Asante [21]. From Fig. 1, about 80% of Ghanaians speak the Akan language as their first or second language, while 10 million people speak the Akan language globally. Most of them are in the Ashanti Region, Côte d'Ivoire, Benin, Togo, Jamaica, the United States, Nigeria, etc. [17]. Twi, on the other hand, is one of the four major dialects of Akan, aiming to become the principal language of Ghana. It is also a tonal language that involves high, mid, and low tones, which changes the meaning of every word whenever one changes the tone of the syllable. However, the dialects and their various regions (see Fig. 1).

Fig. 1
figure 1

describes the Akan, which is the Kwa language of the Niger-Congo family spoken in Ghana that has the Twi dialect as a sub-dialect in the above distribution of speakers in the country according to Statista report by Sasu [44]

According to Grindlay et al. [18], approximately 8 million people, or precisely around 58% of Ghana's population, speak Twi, and about 30% of them can be found on the Ivory Coast. Basically, the Twi dialect and its users are spread across Ghana, but its native speakers can be found in the Ashanti region and its surroundings, as shown in Table 1. The Akuapem Twi was the first dialect used in translating the Bible from English, and it can be written with the help of a standard script created by the Bureau of Ghana Languages.

Table 1 The regional groupings and Twi dialect communities in Ghana

2.3 Review of related literature

Most researchers adhered to the terminology used by Alotaibi [6] and Islam et al. [23], which classify parallel corpus as a collection of texts in one or more languages with their translation into another language or languages that have been stored in a machine-readable format. This involves the translation of original texts into other languages. For instance, (English), how are you? Translated as (Twi) wo ho te sɛn? Hu [22], Laviosa [30], and Kenny [25] have already mentioned that a corpus usually consists of original texts and their translations, which have also been referred to as a translation corpus. Generally, a parallel corpus is very useful in the automatic construction of lexicons and research into translating problems in two or more languages contrastively. Parallel corpora can be used to train MT models Pham et al. [40] a corpus building for under-resourced languages, a Crubadan project by Afram et al. [4] tried constructing an Akan (Twi) parallel corpus where they obtained close to 547,909 Twi words among 176 documents crawled on the web, and Facebook Translation App with Facebook uses to crowdsource its numerous translators globally to convert Facebook posts to their languages, ranging from features of Facebook as well as language-related words that help Facebook in building the needed corpus to establish a translator for language that reveal data difficulties associated with the LRL in Africa. [55] also discovered that automatic translation systems have recently started using parallel corpora increasingly to supply desired translations.

Chatzikoumi [12], on the other hand, defined MT as a computer program that is designed to translate text from one language (source language) to another language (target language) without the help of humans. Maimaiti et al. [35], in their study entitled, 'Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural MT,' also discovered that MT has finally achieved SOTA performance in recent years for a few HRLs. However, this achievement was possible because of the ready availability of parallel corpora and efficient machine translation models, such as the Transformer Architecture and its variants [14]. Contrary to the LRL, which lacks detailed research and available online datasets, the HRLs are highly studied, researched, funded, and used for NLP, especially MT, primarily because of the datasets and tools such as language corpora coupled with efficient NMT model readily available. Specifically, the HRLs are only 2.5% of the world's living languages, as identified by Leben [31], and even with that, only a handful of the LRLs like Twi exist, while the opus and sketch engine still contain an extensive database of parallel corpora for the HRLs. Moreover, MT exists to provide a system that can translate a text from the source language into the target language with the same meaning as in the source language.

Additionally, the Twi dialect to date is lagging in terms of its online dataset availability. According to Azunre et al. [7,8,9], there is no single language pair for the numerous Ghanaian Languages apart from the JW300. This forms part of the reason. Strassel and Tracey [48] proposed the LORELEI (Low Resource Languages for Emergent Incidents) program, which was later developed by the Defense Advanced Research Projects Agency (DARPA) in conjunction with the Linguistic Data Consortium (LDC) under research department of the University of Pennsylvania to focus mainly on research and upgrades of beneficial language technology that can overcome overreliance on manually transcribes and translations, or manually-annotated corpora as an alternative. Carious research organizations in Africa are also putting in more effort to tackle such linguistic setbacks with LRALs. For instance, Kilgarriff and Kosem [26] built 'MASAKHANE,' a Deep Learning Indaba, which is a research group that constructs machine learning tools for African Languages. This MASAKHANE has successfully established approximately 45 benchmarks and 38 language pairs for African languages, which are distributed online and spread across the African continent. Christianson et al. [13] also mentioned that at the beginning of the LORELEI program, out of 6600 LRLs of the world, 32 language samples as well as 12 incident languages were picked for research purposes, including Hausa, Zulu, Yoruba, Twi, Somali, Swahili, and Wolof. Even the Typecraft Akan corpus and the Akan/Twi corpus are almost empty with the dataset [10]. Again, the JW300 corpus, religiously skewed, is ideologically biased because of its overreliance on the Bible texts.

3 Research methodology

This study relied on already Twi-translated literature work by Martinus et al. [36] on the Twi literature. It later employed the methodology used by Kovář et al. [28] to translate into Twi in creating our Twi-2-ENG corpus. This helped in the creation of our genuine Twi-English, which is also known as Twi-2-ENG parallel corpus in a multiple-domain data source with special assistance from the online news portals, Ghanaian Parliamentary Hansard, Twi-English Bible (NT Bible), Social Media network or crowdsourcing and the published Twi literature. However, we additionally downloaded the English news from the national news archives, including digital news hubs such as myjoyonline.com, citionline.com, peacefmonline.com, and the Parliamentary Hansard (excerpts) crawled on the Ghana Parliament's official website. All these sources were considered since their contents are sourced and span across Ghanaian communities, culture, and social life. Medical Glossary, the Twi-English New Testament Bible, and other downloaded English documents.

3.1 Corpus formation outlined

A corpus is a collection of naturally occurring language text written to characterize a state or varieties in languages. It is general knowledge that parallel corpus data are usually acquired through various related websites, which is true of HRLs but technically wrong with LRALs. Lowphansiriku et al. [34] agreed that parallel text demands developing a modernized digital parallel corpus, crawled from publicly readily data online through web crawlers. The researchers, however, settled on a two-way data collection approach, namely, the self-organized or manually acquired English sentences converted into Twi with professional translators' standard literature support and online media portals, which is in line with the approach by Kashefi [24]. The next one is the Twi-English parallel sentences automatically crawling across websites. Yet, after months of constant online search for Twi-English parallel text in building the required significant parallel, the result was highly not encouraging due to the underlying issues with the dataset for LRALs, which experts continue to blame on the lack of research interest in this area and in Ghana one can attribute that to the English as the lingua franca of the Ghanaian language. Although the Ghana Education Service has ordered primary school teachers to use Twi in teaching at the first cycle institutions, it is still an unofficial language in Ghana. Currently, the online crawling of parallel data is also not feasible, regardless of the limited Twi text and sentences on the web.

The following approach was to construct a Twi text form on the Google platform and establish a link with both language students in the second and third circle institutions, language lovers, and social media friends to Crowdsource for Twi-English sentence pairs. All the responses were received, scrutinized, and analyzed in MS Excel and a professional review process to achieve perfection.

Additionally, the researchers further crawled a few free web pieces of literature and public news archives such as Myjoyonline news archives, a heterogeneous Ghanaian Parliamentary Hansards, GhanaWeb news, Adomfmonline news, Peacefmonline news, and Citinewsroom, Ghana Broadcasting Corporation Daily Graphic, The Twi-English New Testament Bible downloaded in pdf format from the JW website Twi Medical Glossary and the like (see Table 2). We also followed the Universal Declaration of Human Rights on the datasets with the LRLs and the best means of handling such shortfalls. Despite these approaches, the issue with the online available pre-aligned Twi-English text pairs persists because almost all the readily available online text is crafted in English only. The researchers were, however, left with no option but to translate all the English text into Twi text and align them individually to correspond to the required datasets for this study, The Beautiful Soup, which is a Python package used in parsing HTML and XML documents that creates a parse tree necessary for parsed pages was employed to mine data from HTM for web scraping as well as Web Scraper freely on Google chrome extension to crawl the text from the websites.

Table 2 Twi-2-ENG corpus data source and the sentence pairs demonstrating the trusted low-resourced language achievers in Ghana and the associated links

3.1.1 Twi-2-ENG Corpus conceptual framework

Figure 2 depicts websites crawled data formulated, analyzed, and systematically distributed into the sketch Engine to generate perfect Twi to English parallel corpus translation.

Fig. 2
figure 2

Twi-2-ENG corpus (conceptual construction)

3.2 Texts/sentences translations, alignment, and text crawling arrangements

All the received data went through rigorous verification and checks to avoid misleading information. For example, with each sentence alignment, we employed a spreadsheet program to align the Twi sentences with those of the English. We used MS Excel to align the holy Bible at the verse level, which was also cost-effective. Our Twi-2-ENG corpus contains 5,756 quality sentence pairs and approximately 138 k tokens, which were aligned to the web-downloaded documents arranged paragraph by paragraph because the best corpus is the one that can align correctly with HRLs as well as LRLs sentence pairs.

With the assistance of 16 qualified Twi translators and linguistic professionals, we were able to translate and align all the monolingual text into the Twi language, ideally at the sentence translation level. We crowdsource most of our Twi to English sentence pairs from Google-generated forms, which were large volumes that run over 1 k sentence variety. As indicated in Table 2 above, we relied on HTML files to generate all the raw text from different websites, such as BeautifulSoup script, Spiderling, jw.org, etc.

3.3 Text duplication and data cleaning

Again, we relied on online sentences and text duplication removal tools such as Trancemyip, Textfixer, Duplicate Cleaner, and Sketch Engine, which has a duplication removal interface that can automatically extract all double texts from the whole corpus content. To achieve perfect data, we used the jusText tool program to remove boilerplate content and navigate the links, headers, and footers from HTML pages, legal text, advertising, icons, tabular data, and social handles. The unique jargon and double spaces were all deleted from the sentence pairs, leaving only the recognized Twi orthographic characters behind. All unrelated images, hyperlinks, titles, URLs, chats, and tables were also removed from the crawled sentences to have only the best texts and sentences.

3.4 Lemmatization and tokenization

We further applied the lemmatization process to reduce or break down the complex words into their recognized root to reveal text and word similarities. The tokenizer employed in the Sketch Engine was utilized to convert all the sequenced text into smaller units. Meanwhile, we realized that the available universal tokenizer could only identify token boundaries with whitespace characters and disobey all the critical rules with languages. However, the POS professional tag to our Twi-2-ENG corpus is the future honour we anticipate.

4 Result discussions

To ensure effective results, we utilized a verified, registered online Sketch Engine embedded with analysis tools and corpus curation to create and host our Twi-2-ENG corpus. This methodology aligns with the approach employed by Kilgarriff et al. [27] in their "Corpus Factory for Many Languages" project.

4.1 1st Move. Sketch engine for Twi-2-ENG corpus (algorithm)

To effectively simplify the process, an algorithm, which is a set of finite rules or instructions often followed through in calculating or solving other mathematical operations as well as preparing sketch engine text corpus, is detailed in Table 3 below.

Table 3 .

Capitalizing on the noted Algorithm Sketch Engine and the best level of preparation from the data sourcing to corpus verification of the selected words in the text corpus arrangement.

4.2 2nd Move: Twi-2-ENG corpus statistic outlook

The statistical presentation of the Twi-2-ENG corpus and how they stand is clearly presented in Table 4.

Table 4 Twi-2

ENG corpus Statistic submission of the different datasets. The LRL stands for Low Resource Language, which is the Twi language of Ghana, while HRL represents the High Resource Language, which is the English language.

The Twi-2-ENG corpus statistic Table 4 above reveals the HRL (English) with a word count of 50,124, tokenized as 61,221, and sentence formation as 5,756. At the same time, the LRL (Twi) has 52,401 words tokenized as 65,183 and available sentences of 5,756, respectively.

4.3 3rd Move: Twi-2-ENG corpus functions (Navigations)

The word sketch of the Onyame (God) is correctly presented in Fig. 3 below. Also, it affirms the claim by Stewart et al. [47] that a word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocation behaviour.

Fig. 3
figure 3

Word Sketch Onyame (God). Representation of the Twi-2-ENG corpus search engine processing the word God and producing Onyame in Twi

4.4 The parallel concordance

Whenever the parallel corpora are aligned carefully, it allows the parallel corpus to search for words, text types, phrases, documents, tags, or corpus structures in one language, displaying together the results with aligned translated segments in another language (see Fig. 4). A concordance helps to search through translation memory for a particular word or phrase to know its similar translation elsewhere. It also cites Biblical words with references to the instances in which they were applied. Our Twi-2-ENG platform is made of a vast number of up-to-date Twi words, phrases, tags, and text documents that assist the concordance function in the search engine, bringing together all the valid Twi-translated words and their meanings.

Fig. 4
figure 4

Source: sketchengine.eu/tagsets/english-part-of-speech

Onyame and God parallel concordance.

4.5 The N-grams

N-grams, also known as multi-word expressions (MWEs), are continuous sequences of words, symbols, or tokens in a document defined as the neighbouring. Figure 5 demonstrates 3–4 g of 3,985-word frequencies, which are the single most successful type of feature in authorship attribution. They are primarily used in text categorization problems and can easily be applied to a new language. In our Twi-2-ENG search platform, users have the option to filter both the known and unknown regular Twi expressions to specific detail n-grams to generate any attribute with word and lemma as the most frequently used ones. This function in the Twi-2-ENG assists in the process of effective identification of symbols, texts, punctuation locations, and numbering.

Fig. 5
figure 5

Twi-2-ENG parallel Corpus of 3–4 g: Source: sketchengine.eu/tagsets/english-part-of-speech

4.6 The wordlist

The Twi-2-ENG corpus achieved maximum frequencies of 65,183 and specialized text itemization of 51,624 at the wordlist tool programming. The Wordlist in Fig. 6 presents lists of a language's lexicon found within some text corpus for vocabulary acquisition purposes to create verbs, adjectives, nouns, or parts of speech words ending with tags, lemmas, and other attributes or all the above. The Twi-2-ENG corpus with Wordlist will assist users in updating or upgrading their Twi vocabularies, increasing their objectives and better use of verbs or nouns in their speech also agree with Brezina and Gablasova [11] after applying the Wordlist tool in their projects, finally reaffirm that the three different frequency measures can be displayed in the Wordlist: frequency, frequency per million, and ARF.

Fig. 6
figure 6

Source: sketchengine.eu/tagsets/english-part-of-speech

Twi-2-ENG wordlist.

4.7 The keywords

Keywords are words or individual token apparel and are relatively scarce in the focus corpus and the reference corpus, as revealed in Fig. 7. They are also used to encipher or decipher a cryptogram as a basis for a complex substitution or pattern for a transposition procedure. Keywords allow one to quickly locate all essential words and phrases in large datasets, even though keywords and terms have similar functions. Liu [32] discovered that these particular Terms deal with multiple-word expressions appearing more frequently in the focus corpus than the reference corpus matching the selected format of the language technology. Since our Twi-2-ENG corpus uses large volumes of complex text, words, symbols, punctuations, phrases, etc., the keywords function within the search engine to assist in the accessible location of all important words, as well as multiple expressions and related needs.

Fig. 7
figure 7

Source: sketchengine.eu/tagsets/english-part-of-speech

Twi-2-ENG corpus keyword in perspective.

5 Concluding remarks

Even though the Ghanaian language (Twi), which is part of the LRL, has made strives and is widely spoken even beyond its boundaries, primarily among Ivorian communities, while others regard it as the best resource for Twi Machine Translation, the problem persists with the Twi- English parallel corpus in the multiple domain dataset category partly because of the limited digital Twi lexicon and complex design structure. This study relied on the designed parallel corpus for NLP, such as MT, to support the computational linguistic research in the Twi language naturally. Equally, it deployed the Sketch Engine by Fagbolu et al. [15] and Ghaddar and Langlais [16] on their corpus curation and analysis software to construct the trusted corpus. Furthermore, we utilized 5,756 large datasets from Ghana Parliament Hansard, NT Bible (English-Twi), Twi Medical Glossary Dictionary (Twi), Citinewsroom, Ghanaweb, Daily Graphic, Myjoyonline Peacefmonline, Adomonline Archives, UDHR Twi-English (Crowdsourced), and Ghana Broadcasting Corporation which together make up the Twi-2-ENG parallel corpus of the well-organized 13,488 tokenized sentences pairs given unique identity to our particular Twi-2-ENG corpus.

The significant role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus that is freely accessible on the Sketch Engine online. This study clearly opens up a new search platform capable of breaking down both significant complex Twi words such as "Onyankopɔn" (God) to simple "Onyame," which has the same meaning as God, and which is not available in various research works and ideally will add up to the literature and academia. However, it also suggested to future researchers and the academic body in Ghana and across Africa to allocate some funds in their education budgets to sponsor the promotion and effective uplifting of the LRLs in Africa, using the HRLs as the case study to ensure quality datasets for the LRLs on all websites which was the major limitation to our research.