Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Agyei, Emmanuel; Zhang, Xiaoling; Bannerman, Stephen; Quaye, Ama Bonuah; Yussi, Sophyani Banaamwini; Agbesi, Victor Kwaku

doi:10.1007/s10791-024-09451-8

Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Brief Communication
Open access
Published: 05 July 2024

Volume 27, article number 17, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Computing Aims and scope Submit manuscript

Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Download PDF

Emmanuel Agyei¹,
Xiaoling Zhang¹,
Stephen Bannerman¹,
Ama Bonuah Quaye¹,
Sophyani Banaamwini Yussi¹ &
…
Victor Kwaku Agbesi¹

431 Accesses
Explore all metrics

Abstract

Although Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However, it continues to be seen as the perfect resource for Twi Machine Translation (MT) of IS0 639-3. The issue with the Twi-English parallel corpus is eminent at the multiple domain dataset level, partly due to the complex design structure and scarcity of the digital Twi lexicon. This study introduced Twi-2-ENG, a large-scale multiple domain Twi to English parallel corpus, Twi digital Dictionary, and lexicon version of Twi. Also, it employed the Ghanaian Parliamentary Hansards, crowdsourcing, and digital Ghana News Portals to crawl all the English sentences. Our curled news portals accumulated 5,765 parallel corpus sentences, the Twi New Testament Bible, and social media platforms. The data-gathering method used means of translation, compilation, tokenization, and the final alignments with the Twi-English parallel sentences, including the technology employed in compiling and hosting the corpus, were duly discussed. The results reveal that the role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus. Finally, all the sentences were curated with the help of a corpus manager, sketch engine, linguistics, and professional translators to align and tokenize all texts, allowing the Twi professional linguists to evaluate the corpus.

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Availability and accessibility of substantial parallel corpora are essential for training high-quality statistical and neural machine translation systems (SMT and NMT). However, these resources are often not freely available for many language translation pairs due to difficulties in creating high-quality parallel corpora and financial constraints, as well as professional translation requirements [46]. While large monolingual corpora are readily accessible online, they are typically limited to specific language domains. This is particularly true for low-resource African languages (LRAL), such as Twi [20]. Additionally, phrase-based statistical machine translation (PBSMT) models are widely regarded as the state-of-the-art (SOTA) for language pairs in multi-domain settings where ample parallel corpora exist [14].

A parallel corpus comprises a collection of texts translated into multiple languages, which, when processed in Sketch Engine, become monolingual corpora linked together for multi-language searches. These corpora serve as crucial resources for Natural Language Processing (NLP), notably Machine Translation (MT). Liu et al. [33] note that languages with high-resource datasets benefit from numerous parallel, bilingual, and monolingual corpora, facilitating the development of state-of-the-art MT systems. In contrast, low-resource datasets are common among many African languages. Schwartz [45] highlight that out of the 7,111 living languages globally, Africa alone has about 2,144 distinct languages. In Ghana, the most spoken local language is Akan, which includes Fante, Ahafo, Akyem, and Twi dialects such as Asante and Akuapem. According to Sackitey and Adomako [43], Akan has three main dialects: Asante Twi, Akuapem Twi, and Fante.

The Akan language, part of the Central Tano group of the Niger-Congo language family, is spoken by over nine million people. Other widely spoken languages in Ghana include Ewe (3.8 million speakers), Abron (1.2 million speakers), and Dagbani (1.2 million speakers) Sssu, 2023). English remains the official language, with nearly ten million speakers. Adebara and Abdul-Mageed [2] point out that among the numerous African languages, only Zulu, Yoruba, Swahili, and Igbo have well-developed digital datasets. However, these still fall short of global standards. Ranathunga et al. [42] observe that few Low-Resource African Languages (LRALs) and High-Resource Languages (HRLs) have bilingual, monolingual, and parallel corpora, which are typically available online as open-source resources.

The development of linguistic resources for the Twi dialect in Ghana remains notably underdeveloped, resulting in most experts relying predominantly on the Akuapem-translated Bible as their primary reference for translations. Alabi et al. [5] highlighted the absence of standard corpora for Twi, with the JW300 corpus being the only exception. However, this reliance on the JW300 corpus or the Bible is ideologically problematic, as it can introduce biases and organizational issues. Beyond the Bible corpus, Twi has a limited range of standard literature, including dictionaries, storybooks, and the Holy Bible. Additionally, some freely available online and offline articles are often curated, annotated, and POS-tagged.

Machine Translation (MT), which involves translating text from one language to another through various MT models, typically requires parallel corpora to train the models. Unfortunately, such datasets are significantly lacking for the Twi dialect. Prior studies have underscored the necessity of creating language datasets and resources for low-resource African languages (LRALs) [37, 54]. Prominent scholars, such as Azunre et al. [7,8,9], Gyasi and Schippe [19], and [39], have attempted to validate the efficacy of parallel corpus-based machine translation (MT) from English to Twi, with a particular emphasis on multi-domain translation capable of converting one specific language into another. Despite their efforts, they were unable to develop a reliable Twi-to-English word search engine that could accurately translate low-resource languages (LRL) like Twi into high-resource languages (HRL) like English.

This study uses parallel corpora designed for natural language processing (NLP) applications, such as MT, to advance computational linguistic research focused on the Twi dialect. The research draws upon the Sketch Engine by Fagbolu et al. [15] and corpus curation and analysis software by Ghaddar and Langlais [16] to build a robust corpus for this study. This study designs and develops a digital Twi-English parallel corpus consisting of approximately 5.2 k sentences (texts) sourced from the parliamentary Hansard, the Bible, medical literature, social media networks, and news sub-corpora. This corpus aims to be searchable, publicly accessible, scalable, and essential for NLP and MT, representing a significant innovation in the field.

This study addresses significant gaps in the resources available for Low-Resource African Languages (LRALs), particularly the Twi language of Ghana, by developing the first open-source Twi-English parallel corpus. Unlike previous efforts by Afram et al. [4], Azunre et al. [7,8,9], and Adjeisah et al. [3], which failed to provide comprehensive NLP and Machine Translation (MT) resources for Twi, this research creates a heterogeneous corpus spanning all genres of Ghanaian Twi-speaking culture and communities. This corpus will serve multiple user groups: researchers will benefit from a single search platform enabling accurate translations and deeper linguistic understanding,language teachers and students will access high-quality materials for improved grammar and vocabulary learning; and language learners will gain from authentic data illustrating grammar construction and word frequency patterns. Additionally, the study offers a robust wordlist, effective N-grams, a Parallel Concordance feature, and a Word Sketch Engine, collectively enhancing the utility and accessibility of Twi language resources.

2 Literature review

2.1 The orthography of Twi dialect

According to Aboagye Da-Costa and Adade-Yeboah [1], there are approximately 75 local languages currently spoken in Ghana, a total of 34)of them are well-established Ghanaian languages, 18 still in use. [56] also noted that among the language vitality count, 10 are institutional, 15 are stable, 8 are endangered, and 1 is extinct. The Twi dialect is made up of 22 alphabets (a, b, d, e, ɛ, f, g, h, i, k, l, m, n, o, ɔ, p, r, s, t, u, w, y), 15 consonants /b, d, f, g, h, k, l, m, n, p, r, s, t, w, y/ and 7 vowel words (a, e, i, o, u, ɔ, ɛ). According to Tuffour [52], these vowels appear before syllables with the vowels /i/ and / u/ in the Akuapem and Asante dialects. There is also a tenth vowel /æ/, which is a variant of the vowel /a/ [41].

2.2 The Twi dialect

Akan is believed to be a language that emigrated from one country to Ghana and later became a prominent means of exchange among Ghanaians. Sackitey and Adomako [43] mentioned that Akan is a Kwa language of the Niger-Congo language family and the most widely spoken language in Ghana by about 8.3 million people. Just like any other language in the world with various levels of tones, Akan has two levels of tones, which agrees with the known assertion of Tsiwah et al. [51] that Akan is a tone language with two-level tones (high and low) that has both lexical and grammatical functions (e.g., tense and aspect marking).

As of 2021, Akan was the most spoken local language in Ghana, encompassing Twi dialects such as Fante, Akuapem, Akyem, Ahafo, and Asante [21]. From Fig. 1, about 80% of Ghanaians speak the Akan language as their first or second language, while 10 million people speak the Akan language globally. Most of them are in the Ashanti Region, Côte d'Ivoire, Benin, Togo, Jamaica, the United States, Nigeria, etc. [17]. Twi, on the other hand, is one of the four major dialects of Akan, aiming to become the principal language of Ghana. It is also a tonal language that involves high, mid, and low tones, which changes the meaning of every word whenever one changes the tone of the syllable. However, the dialects and their various regions (see Fig. 1).

According to Grindlay et al. [18], approximately 8 million people, or precisely around 58% of Ghana's population, speak Twi, and about 30% of them can be found on the Ivory Coast. Basically, the Twi dialect and its users are spread across Ghana, but its native speakers can be found in the Ashanti region and its surroundings, as shown in Table 1. The Akuapem Twi was the first dialect used in translating the Bible from English, and it can be written with the help of a standard script created by the Bureau of Ghana Languages.

Table 1 The regional groupings and Twi dialect communities in Ghana

Full size table

2.3 Review of related literature

Most researchers adhered to the terminology used by Alotaibi [6] and Islam et al. [23], which classify parallel corpus as a collection of texts in one or more languages with their translation into another language or languages that have been stored in a machine-readable format. This involves the translation of original texts into other languages. For instance, (English), how are you? Translated as (Twi) wo ho te sɛn? Hu [22], Laviosa [30], and Kenny [25] have already mentioned that a corpus usually consists of original texts and their translations, which have also been referred to as a translation corpus. Generally, a parallel corpus is very useful in the automatic construction of lexicons and research into translating problems in two or more languages contrastively. Parallel corpora can be used to train MT models Pham et al. [40] a corpus building for under-resourced languages, a Crubadan project by Afram et al. [4] tried constructing an Akan (Twi) parallel corpus where they obtained close to 547,909 Twi words among 176 documents crawled on the web, and Facebook Translation App with Facebook uses to crowdsource its numerous translators globally to convert Facebook posts to their languages, ranging from features of Facebook as well as language-related words that help Facebook in building the needed corpus to establish a translator for language that reveal data difficulties associated with the LRL in Africa. [55] also discovered that automatic translation systems have recently started using parallel corpora increasingly to supply desired translations.

Chatzikoumi [12], on the other hand, defined MT as a computer program that is designed to translate text from one language (source language) to another language (target language) without the help of humans. Maimaiti et al. [35], in their study entitled, 'Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural MT,' also discovered that MT has finally achieved SOTA performance in recent years for a few HRLs. However, this achievement was possible because of the ready availability of parallel corpora and efficient machine translation models, such as the Transformer Architecture and its variants [14]. Contrary to the LRL, which lacks detailed research and available online datasets, the HRLs are highly studied, researched, funded, and used for NLP, especially MT, primarily because of the datasets and tools such as language corpora coupled with efficient NMT model readily available. Specifically, the HRLs are only 2.5% of the world's living languages, as identified by Leben [31], and even with that, only a handful of the LRLs like Twi exist, while the opus and sketch engine still contain an extensive database of parallel corpora for the HRLs. Moreover, MT exists to provide a system that can translate a text from the source language into the target language with the same meaning as in the source language.

Additionally, the Twi dialect to date is lagging in terms of its online dataset availability. According to Azunre et al. [7,8,9], there is no single language pair for the numerous Ghanaian Languages apart from the JW300. This forms part of the reason. Strassel and Tracey [48] proposed the LORELEI (Low Resource Languages for Emergent Incidents) program, which was later developed by the Defense Advanced Research Projects Agency (DARPA) in conjunction with the Linguistic Data Consortium (LDC) under research department of the University of Pennsylvania to focus mainly on research and upgrades of beneficial language technology that can overcome overreliance on manually transcribes and translations, or manually-annotated corpora as an alternative. Carious research organizations in Africa are also putting in more effort to tackle such linguistic setbacks with LRALs. For instance, Kilgarriff and Kosem [26] built 'MASAKHANE,' a Deep Learning Indaba, which is a research group that constructs machine learning tools for African Languages. This MASAKHANE has successfully established approximately 45 benchmarks and 38 language pairs for African languages, which are distributed online and spread across the African continent. Christianson et al. [13] also mentioned that at the beginning of the LORELEI program, out of 6600 LRLs of the world, 32 language samples as well as 12 incident languages were picked for research purposes, including Hausa, Zulu, Yoruba, Twi, Somali, Swahili, and Wolof. Even the Typecraft Akan corpus and the Akan/Twi corpus are almost empty with the dataset [10]. Again, the JW300 corpus, religiously skewed, is ideologically biased because of its overreliance on the Bible texts.

3 Research methodology

This study relied on already Twi-translated literature work by Martinus et al. [36] on the Twi literature. It later employed the methodology used by Kovář et al. [28] to translate into Twi in creating our Twi-2-ENG corpus. This helped in the creation of our genuine Twi-English, which is also known as Twi-2-ENG parallel corpus in a multiple-domain data source with special assistance from the online news portals, Ghanaian Parliamentary Hansard, Twi-English Bible (NT Bible), Social Media network or crowdsourcing and the published Twi literature. However, we additionally downloaded the English news from the national news archives, including digital news hubs such as myjoyonline.com, citionline.com, peacefmonline.com, and the Parliamentary Hansard (excerpts) crawled on the Ghana Parliament's official website. All these sources were considered since their contents are sourced and span across Ghanaian communities, culture, and social life. Medical Glossary, the Twi-English New Testament Bible, and other downloaded English documents.

3.1 Corpus formation outlined

A corpus is a collection of naturally occurring language text written to characterize a state or varieties in languages. It is general knowledge that parallel corpus data are usually acquired through various related websites, which is true of HRLs but technically wrong with LRALs. Lowphansiriku et al. [34] agreed that parallel text demands developing a modernized digital parallel corpus, crawled from publicly readily data online through web crawlers. The researchers, however, settled on a two-way data collection approach, namely, the self-organized or manually acquired English sentences converted into Twi with professional translators' standard literature support and online media portals, which is in line with the approach by Kashefi [24]. The next one is the Twi-English parallel sentences automatically crawling across websites. Yet, after months of constant online search for Twi-English parallel text in building the required significant parallel, the result was highly not encouraging due to the underlying issues with the dataset for LRALs, which experts continue to blame on the lack of research interest in this area and in Ghana one can attribute that to the English as the lingua franca of the Ghanaian language. Although the Ghana Education Service has ordered primary school teachers to use Twi in teaching at the first cycle institutions, it is still an unofficial language in Ghana. Currently, the online crawling of parallel data is also not feasible, regardless of the limited Twi text and sentences on the web.

The following approach was to construct a Twi text form on the Google platform and establish a link with both language students in the second and third circle institutions, language lovers, and social media friends to Crowdsource for Twi-English sentence pairs. All the responses were received, scrutinized, and analyzed in MS Excel and a professional review process to achieve perfection.

Additionally, the researchers further crawled a few free web pieces of literature and public news archives such as Myjoyonline news archives, a heterogeneous Ghanaian Parliamentary Hansards, GhanaWeb news, Adomfmonline news, Peacefmonline news, and Citinewsroom, Ghana Broadcasting Corporation Daily Graphic, The Twi-English New Testament Bible downloaded in pdf format from the JW website Twi Medical Glossary and the like (see Table 2). We also followed the Universal Declaration of Human Rights on the datasets with the LRLs and the best means of handling such shortfalls. Despite these approaches, the issue with the online available pre-aligned Twi-English text pairs persists because almost all the readily available online text is crafted in English only. The researchers were, however, left with no option but to translate all the English text into Twi text and align them individually to correspond to the required datasets for this study, The Beautiful Soup, which is a Python package used in parsing HTML and XML documents that creates a parse tree necessary for parsed pages was employed to mine data from HTM for web scraping as well as Web Scraper freely on Google chrome extension to crawl the text from the websites.

Table 2 Twi-2-ENG corpus data source and the sentence pairs demonstrating the trusted low-resourced language achievers in Ghana and the associated links

Full size table

3.1.1 Twi-2-ENG Corpus conceptual framework

Figure 2 depicts websites crawled data formulated, analyzed, and systematically distributed into the sketch Engine to generate perfect Twi to English parallel corpus translation.

3.2 Texts/sentences translations, alignment, and text crawling arrangements

All the received data went through rigorous verification and checks to avoid misleading information. For example, with each sentence alignment, we employed a spreadsheet program to align the Twi sentences with those of the English. We used MS Excel to align the holy Bible at the verse level, which was also cost-effective. Our Twi-2-ENG corpus contains 5,756 quality sentence pairs and approximately 138 k tokens, which were aligned to the web-downloaded documents arranged paragraph by paragraph because the best corpus is the one that can align correctly with HRLs as well as LRLs sentence pairs.

With the assistance of 16 qualified Twi translators and linguistic professionals, we were able to translate and align all the monolingual text into the Twi language, ideally at the sentence translation level. We crowdsource most of our Twi to English sentence pairs from Google-generated forms, which were large volumes that run over 1 k sentence variety. As indicated in Table 2 above, we relied on HTML files to generate all the raw text from different websites, such as BeautifulSoup script, Spiderling, jw.org, etc.

3.3 Text duplication and data cleaning

Again, we relied on online sentences and text duplication removal tools such as Trancemyip, Textfixer, Duplicate Cleaner, and Sketch Engine, which has a duplication removal interface that can automatically extract all double texts from the whole corpus content. To achieve perfect data, we used the jusText tool program to remove boilerplate content and navigate the links, headers, and footers from HTML pages, legal text, advertising, icons, tabular data, and social handles. The unique jargon and double spaces were all deleted from the sentence pairs, leaving only the recognized Twi orthographic characters behind. All unrelated images, hyperlinks, titles, URLs, chats, and tables were also removed from the crawled sentences to have only the best texts and sentences.

3.4 Lemmatization and tokenization

We further applied the lemmatization process to reduce or break down the complex words into their recognized root to reveal text and word similarities. The tokenizer employed in the Sketch Engine was utilized to convert all the sequenced text into smaller units. Meanwhile, we realized that the available universal tokenizer could only identify token boundaries with whitespace characters and disobey all the critical rules with languages. However, the POS professional tag to our Twi-2-ENG corpus is the future honour we anticipate.

4 Result discussions

To ensure effective results, we utilized a verified, registered online Sketch Engine embedded with analysis tools and corpus curation to create and host our Twi-2-ENG corpus. This methodology aligns with the approach employed by Kilgarriff et al. [27] in their "Corpus Factory for Many Languages" project.

4.1 1^st Move. Sketch engine for Twi-2-ENG corpus (algorithm)

To effectively simplify the process, an algorithm, which is a set of finite rules or instructions often followed through in calculating or solving other mathematical operations as well as preparing sketch engine text corpus, is detailed in Table 3 below.

Table 3 .

Full size table

Capitalizing on the noted Algorithm Sketch Engine and the best level of preparation from the data sourcing to corpus verification of the selected words in the text corpus arrangement.

4.2 2^nd Move: Twi-2-ENG corpus statistic outlook

The statistical presentation of the Twi-2-ENG corpus and how they stand is clearly presented in Table 4.

Table 4 Twi-2

Full size table

ENG corpus Statistic submission of the different datasets. The LRL stands for Low Resource Language, which is the Twi language of Ghana, while HRL represents the High Resource Language, which is the English language.

The Twi-2-ENG corpus statistic Table 4 above reveals the HRL (English) with a word count of 50,124, tokenized as 61,221, and sentence formation as 5,756. At the same time, the LRL (Twi) has 52,401 words tokenized as 65,183 and available sentences of 5,756, respectively.

4.3 3^rd Move: Twi-2-ENG corpus functions (Navigations)

The word sketch of the Onyame (God) is correctly presented in Fig. 3 below. Also, it affirms the claim by Stewart et al. [47] that a word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocation behaviour.

4.4 The parallel concordance

Whenever the parallel corpora are aligned carefully, it allows the parallel corpus to search for words, text types, phrases, documents, tags, or corpus structures in one language, displaying together the results with aligned translated segments in another language (see Fig. 4). A concordance helps to search through translation memory for a particular word or phrase to know its similar translation elsewhere. It also cites Biblical words with references to the instances in which they were applied. Our Twi-2-ENG platform is made of a vast number of up-to-date Twi words, phrases, tags, and text documents that assist the concordance function in the search engine, bringing together all the valid Twi-translated words and their meanings.

4.5 The N-grams

N-grams, also known as multi-word expressions (MWEs), are continuous sequences of words, symbols, or tokens in a document defined as the neighbouring. Figure 5 demonstrates 3–4 g of 3,985-word frequencies, which are the single most successful type of feature in authorship attribution. They are primarily used in text categorization problems and can easily be applied to a new language. In our Twi-2-ENG search platform, users have the option to filter both the known and unknown regular Twi expressions to specific detail n-grams to generate any attribute with word and lemma as the most frequently used ones. This function in the Twi-2-ENG assists in the process of effective identification of symbols, texts, punctuation locations, and numbering.

4.6 The wordlist

The Twi-2-ENG corpus achieved maximum frequencies of 65,183 and specialized text itemization of 51,624 at the wordlist tool programming. The Wordlist in Fig. 6 presents lists of a language's lexicon found within some text corpus for vocabulary acquisition purposes to create verbs, adjectives, nouns, or parts of speech words ending with tags, lemmas, and other attributes or all the above. The Twi-2-ENG corpus with Wordlist will assist users in updating or upgrading their Twi vocabularies, increasing their objectives and better use of verbs or nouns in their speech also agree with Brezina and Gablasova [11] after applying the Wordlist tool in their projects, finally reaffirm that the three different frequency measures can be displayed in the Wordlist: frequency, frequency per million, and ARF.

4.7 The keywords

Keywords are words or individual token apparel and are relatively scarce in the focus corpus and the reference corpus, as revealed in Fig. 7. They are also used to encipher or decipher a cryptogram as a basis for a complex substitution or pattern for a transposition procedure. Keywords allow one to quickly locate all essential words and phrases in large datasets, even though keywords and terms have similar functions. Liu [32] discovered that these particular Terms deal with multiple-word expressions appearing more frequently in the focus corpus than the reference corpus matching the selected format of the language technology. Since our Twi-2-ENG corpus uses large volumes of complex text, words, symbols, punctuations, phrases, etc., the keywords function within the search engine to assist in the accessible location of all important words, as well as multiple expressions and related needs.

5 Concluding remarks

Even though the Ghanaian language (Twi), which is part of the LRL, has made strives and is widely spoken even beyond its boundaries, primarily among Ivorian communities, while others regard it as the best resource for Twi Machine Translation, the problem persists with the Twi- English parallel corpus in the multiple domain dataset category partly because of the limited digital Twi lexicon and complex design structure. This study relied on the designed parallel corpus for NLP, such as MT, to support the computational linguistic research in the Twi language naturally. Equally, it deployed the Sketch Engine by Fagbolu et al. [15] and Ghaddar and Langlais [16] on their corpus curation and analysis software to construct the trusted corpus. Furthermore, we utilized 5,756 large datasets from Ghana Parliament Hansard, NT Bible (English-Twi), Twi Medical Glossary Dictionary (Twi), Citinewsroom, Ghanaweb, Daily Graphic, Myjoyonline Peacefmonline, Adomonline Archives, UDHR Twi-English (Crowdsourced), and Ghana Broadcasting Corporation which together make up the Twi-2-ENG parallel corpus of the well-organized 13,488 tokenized sentences pairs given unique identity to our particular Twi-2-ENG corpus.

The significant role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus that is freely accessible on the Sketch Engine online. This study clearly opens up a new search platform capable of breaking down both significant complex Twi words such as "Onyankopɔn" (God) to simple "Onyame," which has the same meaning as God, and which is not available in various research works and ideally will add up to the literature and academia. However, it also suggested to future researchers and the academic body in Ghana and across Africa to allocate some funds in their education budgets to sponsor the promotion and effective uplifting of the LRLs in Africa, using the HRLs as the case study to ensure quality datasets for the LRLs on all websites which was the major limitation to our research.

Data availability

The data used in this study will be made available upon request.

References

Aboagye Da-Costa C, Adade-Yeboah A. Language practice and the dilemma of a national language policy in Ghana: the past, present and future. Int J Human Soc Sci. 2019. https://doi.org/10.30845/ijhss.v9n3p18.
Article Google Scholar
Adebara I, Abdul-Mageed M. Towards Afrocentric NLP for African languages: Where we are and where we can go. arXiv preprint. 2022. arXiv:2203.08351.
Adjeisah M, Liu G, Nyabuga DO, Nortey RN, Song J. Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci. 2021;2021(1):6682385.
Google Scholar
Afram GK, Weyori BA, Adekoya FA. TWIENG: a multi-domain Twi-english parallel corpus for machine translation of Twi, a Low-Resource African Language. 2022.
Alabi J, Amponsah-Kaakyire K, Adelani D, Espana-Bonet C. Massive vs curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020, May; 2754–2762.
Alotaibi HM. Arabic-English parallel corpus: a new resource for translation training and language teaching. Arab World Engl J. 2017;8:319.
Article Google Scholar
Azunre P, Osei S, Addo S, Adu-Gyamfi LA, Moore S, Adabankah B, Hayfron-Acquah JB. Contextual text embeddings for twi. arXiv preprint. 2021. arXiv:2103.15963.
Azunre P, Osei S, Addo S, Adu-Gyamfi LA, Moore S, Adabankah B, Hayfron-Acquah JB. Nlp for ghanaian languages. arXiv preprint. 2021. arXiv:2103.15475.
Azunre P, Osei S, Addo S, Adu-Gyamfi LA, Moore S, Adabankah B, Hayfron-Acquah JB. English-twi parallel corpus for machine translation. arXiv preprint. 2021. arXiv:2103.15625.
Beermann D, Hellan L, Haugland T. Convergent development of digital resources for West African Languages. Sustaining Knowledge Diversity in the Digital Age. 2018; 48.
Brezina V, Gablasova D. Is there a core general vocabulary? Introducing the new general service list. Appl Linguis. 2015;36(1):1–22.
Article Google Scholar
Chatzikoumi E. How to evaluate machine translation: a review of automated and human metrics. Nat Lang Eng. 2020;26(2):137–61.
Article Google Scholar
Christianson C, Duncan J, Onyshkevych B. Overview of the DARPA LORELEI program. Mach Transl. 2018;32:3–9.
Article Google Scholar
Dabre R, Chu C, Kunchukuttan A. A survey of multilingual neural machine translation. ACM Comput Surv. 2020;53(5):1–38.
Article Google Scholar
Fagbolu O, Ojoawo A, Ajibade K, Alese B. Digital yoruba corpus. Int J Innov Sci Eng Technol. 2015;2:2348–7968.
Google Scholar
Ghaddar A, Langlais P. Sedar: a large-scale French-english financial domain parallel corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020, May; 3595–3602.
Gods C. Topical list of entries. African religions: beliefs and practices through history. London: Bloomsbury Publishing; 2018.
Google Scholar
Grindlay K, Dako-Gyeke P, Ngo TD, Eva G, Gobah L, Reiger ST, Blanchard K. Contraceptive use and unintended pregnancy among young women and men in Accra, Ghana. PloS ONE. 2018;13(8):e0201663.
Article Google Scholar
Gyasi F, Schlippe T. Twi machine translation. Big Data Cogn Comput. 2023;7(2):114.
Article Google Scholar
Hammarström H. A survey of African languages. In: Güldemann T, editor. The languages and linguistics of Africa. Berlin: De Gruyter; 2018. p. 1–57.
Google Scholar
Hankerson S, Obiri-Yeboah MA. Language, ideologies, discrimination, and afrocentric-focused, critical language awareness writing curricula for African American Language and Akan Language speakers. In: Shei C, Schnell J, editors. The Routledge handbook of language and mind engineering. London: Routledge; 2024. p. 404–17.
Chapter Google Scholar
Hu K. Introducing corpus-based translation studies. Berlin: Springer; 2016.
Book Google Scholar
Islam S, Paul A, Purkayastha BS, Hussain I. Construction of English-Bodo parallel text corpus for statistical machine translation. Int J Nat Lang Comput. 2018. https://doi.org/10.5121/ijnlc.2018.7509.
Article Google Scholar
Kashefi O. MIZAN: a large persian-english parallel corpus. arXiv preprint. 2018. arXiv:1801.02107.
Kenny D. Lexis and creativity in translation: a corpus-based approach. London: Routledge; 2014.
Book Google Scholar
Kilgarriff A, Kosem I. Corpus tools for lexicographers. Electron Lexicogr. 2013;2013:1–37. https://doi.org/10.1093/acprof:oso/9780199654864.003.0003.
Article Google Scholar
Kilgarriff A, Reddy S, Pomikálek J, Avinesh PVS. A corpus factory for many languages. In LREC. 2010
Kovář V, Baisa V, Jakubíček M. Sketch engine for bilingual lexicography. Int J Lexicogr. 2016;29(3):339–52.
Article Google Scholar
Kuwornu-Adjaottor JET, Kodom S. A comparative study of quotation marks in the gospel of Luke of the Greek New Testament and the Asante-Twi Bible. E-J Relig Theol Stud. 2019;1(1):188–95.
Google Scholar
Laviosa S. Corpus-based translation studies: theory, findings, applications (Vol. 17). Leiden: Brill; 2021.
Google Scholar
Leben WR. Languages of the world. In: Leben WR, editor. Oxford research encyclopedia of linguistics. Oxford: Oxford University Press; 2018.
Google Scholar
Liu D. The most frequently-used multi-word constructions in academic written English: a multi-corpus study. Engl Specif Purp. 2012;31(1):25–35.
Article Google Scholar
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist. 2020;8:726–42.
Article Google Scholar
Lowphansirikul L, Polpanumas C, Rutherford AT, Nutanong S. A large English-Thai parallel corpus from the web and machine-generated text. Lang Resour Eval. 2022;56(2):477–99.
Article Google Scholar
Maimaiti M, Liu Y, Luan H, Sun M. Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation. Tsinghua Sci Technol. 2021;27(1):150–63.
Article Google Scholar
Martinus L, Ali JT, Abbott J, Marivate V, Kabongo S, Meressa M, Bashir A. Masakhane-Machine Translation for Africa. arXiv preprint. 2020. arXiv:2003.11529.
Mensa-Bonsu Q. A mixed method meta-evaluation of a Usaid Project in Sub-Saharan Africa: case of Ghana (Doctoral dissertation, Miami University). 2021.
Nakua EK, Amissah J, Tawiah P, Barnie B, Donkor P, Mock C. The prevalence and correlates of depression among older adults in greater Kumasi of the Ashanti region. BMC Public Health. 2023;23(1):763.
Article Google Scholar
Phan H, Sharma A, Jannesari A. (2021, November). Generating context-aware API calls from natural language description using neural embeddings and machine translation. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). IEEE. pp. 219–226.
Pham MQ, Crego JM, Senellart J, Yvon F. Fixing translation divergences in parallel corpora for neural mt. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018; 2967–2973.
Poslončec B. Historical influence on the typology of English on the example of its lexis (Doctoral dissertation, Josip Juraj Strossmayer University of Osijek. Faculty of Humanities and Social Sciences. Department of English Language and Literature). 2018.
Ranathunga S, Lee ESA, Prifti Skenduli M, Shekhar R, Alam M, Kaur R. Neural machine translation for low-resource languages: a survey. ACM Comput Surv. 2023;55(11):1–37.
Article Google Scholar
Sackitey M, Adomako K. A comparative analysis of tone structures in Akuapem Twi and Asante Twi: an acoustic account. J Linguist Assoc Nigeria. 2021;24(2):204–34.
Google Scholar
Sasu DD. Distribution of religions in Nigeria 2018. Statista. Accessed, 7. 2023.
Schwartz A. Linguistic analysis of written language used by young adults with and without invisible disabilities. 2018.
Shoba FM. Exploring the use of parallel corpora in the compilation of specialized bilingual dictionaries of technical terms: A case study of English and isiXhosa. Pretoria: University of South Africa dissertation. 2018.
Stewart D, Biasi P, Binelli A. Using sketch engine: an analysis of five adverbs. 2018.
Strassel S, Tracey J. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016, May; 3273–3280.
Suchomel V, Pomikálek J. Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7). 2012, April; 39–43.
Tracey J, Strassel S, Bies A, Song Z, Arrigo M, Griffitt K, Kuster N. Corpus building for low-resource languages in the DARPA LORELEI program. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages. 2019, August; 48–55.
Tsiwah F, Lartey N, Amponsah C, Martínez-Ferreiro S, Popov S, Bastiaanse R. Processing of time reference in agrammatic speakers of Akan: a language with grammatical tone. Aphasiology. 2021;35(5):658–80.
Article Google Scholar
Tuffour AD. Comparative and contrastive analysis of vowel harmony in Asante and Akuapem Twi dialects in Ghana. Int J Res Sch Commun. 2020;3(1):42–51.
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017; 30.
Weidinger L, Mellor J, Rauh M, Griffin C, Uesato J, Huang PS, Gabriel I. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112. 2021.
Zanettin F. Parallel corpora in translation studies: Issues in corpus design and analysis. In Intercultural Faultlines. Routledge, pp. 105-118.
Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Stoica I. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv Neu Infor Pro Sys. 2024; 36.

Download references

Funding

The authors did not receive any funding for this study.

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Emmanuel Agyei, Xiaoling Zhang, Stephen Bannerman, Ama Bonuah Quaye, Sophyani Banaamwini Yussi & Victor Kwaku Agbesi

Authors

Emmanuel Agyei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Bannerman
View author publications
You can also search for this author in PubMed Google Scholar
Ama Bonuah Quaye
View author publications
You can also search for this author in PubMed Google Scholar
Sophyani Banaamwini Yussi
View author publications
You can also search for this author in PubMed Google Scholar
Victor Kwaku Agbesi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Emmanuel Agyei conceived and designed the study collected the dataset, drafted the work, and validated the results. Xiao-Ling Zheng supervised the study. Emmanuel Agyei, Stephen Bannerman, Victor Kwaku Agbesi, Ama Bonuah Quaye, and Sophyani Bannamwini Yussi reviewed and edited the manuscript and performed the data analysis and interpretations.

Corresponding author

Correspondence to Stephen Bannerman.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Agyei, E., Zhang, X., Bannerman, S. et al. Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG). Discov Computing 27, 17 (2024). https://doi.org/10.1007/s10791-024-09451-8

Download citation

Received: 09 May 2024
Accepted: 24 June 2024
Published: 05 July 2024
DOI: https://doi.org/10.1007/s10791-024-09451-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Abstract

Explore related subjects

1 Introduction

2 Literature review

2.1 The orthography of Twi dialect

2.2 The Twi dialect

2.3 Review of related literature

3 Research methodology

3.1 Corpus formation outlined

3.1.1 Twi-2-ENG Corpus conceptual framework

3.2 Texts/sentences translations, alignment, and text crawling arrangements

3.3 Text duplication and data cleaning

3.4 Lemmatization and tokenization

4 Result discussions

4.1 1st Move. Sketch engine for Twi-2-ENG corpus (algorithm)

4.2 2nd Move: Twi-2-ENG corpus statistic outlook

4.3 3rd Move: Twi-2-ENG corpus functions (Navigations)

4.4 The parallel concordance

4.5 The N-grams

4.6 The wordlist

4.7 The keywords

5 Concluding remarks

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

4.1 1^st Move. Sketch engine for Twi-2-ENG corpus (algorithm)

4.2 2^nd Move: Twi-2-ENG corpus statistic outlook

4.3 3^rd Move: Twi-2-ENG corpus functions (Navigations)