A text-to-speech (TTS) synthesizer is a computer-based system that can read text aloud automatically, regardless of whether the text is introduced by a computer input stream or a scanned input submitted to an optical character recognition (OCR) engine [2]. The speech synthesis procedure consists of two main phases, first converting the input text into a phonemic internal representation and then converting this internal representation into a waveform. The first step is text analysis and the second one is waveform synthesis. These two phases are usually called high-level and low-level synthesis, respectively [11]. A lot of papers are published on the text-to-speech synthesis for different languages around the world. Some papers are reviewed that the researchers thought relevant for the study.
Jayanthi et al. [12] proposed a new TTS which is based on a unit selection synthesis approach with syllable unit for Indian languages. The main purpose of the authors’ study was to enhance the synthesis quality and experimentation on the set of 1180 prompts, which focused on using larger or variable size units in synthesis. In that study, syllable segments of the continuous speech database were extracted and classified into begin, mid, and single units based on their position in words. As a result, the authors state that their method increased synthesis quality, and it reduced search space, improving synthesis effectiveness.
Borkar and Patil [13] conducted a study on the Text-to-Speech System for Konkani Language; they proposed a concatenation technique to develop this TTS system. To verify the validity of the proposed method, the researchers conducted experimentation on more than 1000 commonly used words in Konkani Language. The data sets were organized, and the wave files were recorded by students' voices and around 3000 wave files consisting of vowels, characters, and half characters. In general, they stated that the concatenation method for the Konkani language was performed well because their system converted the complex word to speech easily.
In a separate study, Shreekanth et al. [14] conducted a unit selection-based Hindi TTS system using syllables as a basic unit. The study was implemented using.NET and MATLAB programming languages. To carry out that study, the researchers prepared a data set of about 1540 words from standard ‘Hindi to English dictionary’ and they recorded syllables with a sampling frequency of 16 kHz and represented them using 16 bits. Moreover, the researchers classified the syllable into three possible positions, namely the beginning of the word (start), between two syllables (middle), and at the end of the word (end).
Some researchers conducted TTS studies on different Ethiopian languages. For instance, Nadew [15] conducted a speech synthesis for Amharic, an Ethio-Semitic language, based on a formant based synthesis method, which is sometimes called synthesis by rule. The purpose of that study was to generate vowels from a given context, using the best selection of parameters from the decision tree. The researcher verified the validity of the proposed technique by conducting an experiment on five hundred (500) isolated Amharic words and digit strings that were recorded by twelve (12) different Amharic speakers. The author concluded that the proposed formant-based synthesis method provided high flexibility due to the potential of adjusting any of the acoustic parameters during run time. However, in that research, the researcher did not consider Amharic consonants. Hence, the author recommended that both vowels and consonants of Amharic should be considered to make the system fully functional.
In addition, Sebsibe [16] conducted a TTS system on Amharic by proposing unit selection voice for Amharic using festvox. Festivox is the tool that helps us in the creation and analysis of large speech corpora. In this study, the author developed a unit selection concatenative speech synthesizer by defining a transliteration scheme using ASCII standard characters. The system was designed based on the orthographic ordering of the scripts and the sound relationship of letters. The author also developed the phone set of the language with the phone's corresponding features such as voicing, tongue position, tongue height, place of articulation, syllabification rules, and letter to sound rules into festvox. The researcher verified the validity of the proposed approach by conducting an experiment on 29,480 diphone corpus whose instances were made up of 801 unique diphones, and he tested the corpus consisting of a total of 12,724 syllables instances and 1317 unique syllables. The conclusion from that study was that the proposed system was suitable for the Amharic TTS system, and its performance levels ranged from excellent (5) to very poor (0), with a cumulative result of 2.9. Finally, the researcher recommended the optimal selection of corpus to produce better quality.
Tewodros [17], another researcher, carried out a text-to-speech synthesizer for Wolaytta—another Ethiopian language that belongs to the Omotic family. The approach the researcher adopted in this research was diphone concatenative synthesis based on residual LPC technique. As the researcher indicated, different experimentations were conducted to verify the validity of the proposed technique on 841 diphones dataset from 1156 hypothetical which were collected from different Wolaytta word corpus. The results obtained from the experimentation showed a 78% performance average, 3.17 MOS score intelligibility and 2.77 MOS score naturalness. Finally, the researcher recommended that the phonetics and phonology of Wolaytta need to be considered to produce a better quality of the synthesizer.
All the reviewed studies reported that using the concatenative speech synthesis approaches is promising for TTS. However, the local studies conducted to develop TTS for Afan Oromo were only two pieces, although Afan Oromo is the most widely spoken language in Ethiopia [5]. From the reviewed studies the weakness of developed TTS for Afan Oromo was identified, and we have been trying to lay on the improvement of intelligibility and naturalness of TTS for Afan Oromo using unit selection approach and considering phonology of the language critically while keeping all the advantages of the earlier developed prototype. In addition, we cannot adopt TTS from other languages easily because it depends on the structure and nature of the languages. Therefore, the most effective methods and tools are used to implement a text-to-speech synthesizer for Afan Oromo.