Although current speech synthesis is more oriented towards unit synthesis there is still need for a formant based synthesiser. A formant based speech synthesiser is a fundamental tool for those fields of speech research where detailed control of speech parameters is essential. For example, research on adult learner’s vowel contrast in second language acquisition may require tight control over speech stimuli parameters while this also holds true for the investigation of vowel categorisation development of infants [6]. For the synthesis of different voices and voice characteristics and to model emotive speech formant based synthesis systems are still in use [12].
A very well known and widely used formant based speech synthesiser is the Klatt synthesiser [7, 8]. One reason for its popularity is that the FORTRAN reference code was freely available as well as several C language implementations. In Fig. 5.1 we show a schematic diagram of this synthesiser with the vocal tract section realised with filters in cascade. Since a KlattGrid is based on the same design this is also the diagram of a KlattGrid. The synthesiser essentially consists of four parts:
-
1.
The phonation part generates voicing as well as aspiration. It is represented by the top left dotted box labeled with the number 1 in its top right corner.
-
2.
The coupling part models coupling between the phonation part and the next part, the vocal tract. In the figure it is indicated by the dotted box labeled with the number 2.
-
3.
The vocal tract part filters the sound generated by the phonation part. The top right dotted box labeled 3 shows this part as a cascade of formant and antiformant filters. The vocal tract part can also be modeled with formant filters that are in parallel instead of in cascade.
-
4.
The frication part generates frication noise and is represented by the dotted box labeled 4.
A number of implementations of the Klatt synthesiser exist nowadays. However, they all show some of the limitations of the original design that originates from times that computer memory and processing power were relatively scarce. Necessarily, compromises had to be made at that time in order to achieve reasonable performance.
We present the KlattGrid speech synthesiser which is based on the original description of Klatt [7, 8]. There are several new, not necessarily innovative, aspects in the KlattGrid in comparison with some other Klatt-type synthesisers.
-
A Klatt synthesiser is frame-based, i.e. parameters are modeled to be constant during the interval of a frame, typically some 5 or 10 ms. As a consequence, instants of parameter change have to be synchronised on a frame basis. This poses some difficulty in modeling events where timing is important such as a rapidly changing amplitude for plosive bursts. We have removed this limitation by modeling all parameters in a KlattGrid as tiers. A tier represents a parameter contour as a function of time by (time, value) points. Parameter values at any time can be calculated from these time stamps by some kind of interpolation. For example, a formant frequency tier with two (time , frequency) points, namely 800 Hz at a time of 0.1 s and 300 Hz at 0.3 s, is to be interpreted as a formant frequency contour that is constant at 800 Hz for all times before 0.1 s, constant at 300 Hz for all times after 0.3 s and linearly interpolated for all times between 0.1 and 0.3 s (i.e. 675 Hz at 0.15 s, 550 Hz at 0.2 s, and so on). By leaving the frame-based approach of previous synthesisers, all parameter timings become transparent and only moments of parameter change have to be specified.
-
In a Klatt synthesiser one can normally define some six to eight oral formants and one nasal and one tracheal formant/antiformant pair. In a KlattGrid any number of oral formants, nasal formants and nasal antiformants, tracheal formants and tracheal antiformants are possible.
-
In a Klatt synthesiser there is only one set of formant frequencies that has to be shared between the vocal tract part and the frication part. In a KlattGrid the formant frequencies in the frication part and the vocal tract part have been completely decoupled from one another.
-
In the Klatt synthesiser the glottal flow function has to be specified beforehand. A KlattGrid allows varying the form of the glottal flow function as a function of times.
-
In the Klatt synthesiser only the frequency and bandwidth of the first formant can be modified during the open phase. In a KlattGrid there is no limit to the number of formants and bandwidths that can be modified during the open phase of the glottis.
-
In Klatt’s synthesiser all amplitude parameters have been quantised to 1 dB levels beforehand. In a KlattGrid there is no such quantisation. All amplitudes are represented according to the exact specifications. Quantisation only takes place on the final samples of a sound when it has to be played or saved to disk (playing with 16-bit precision, for example). Of course sampling frequencies can be chosen freely.
-
A KlattGrid is fully integrated into the speech analysis program Praat [2]. This makes the synthesiser available on the major desktop operating systems of today: Linux, Windows and Mac OS X. At the same time all scripting, visualisations and analysis methods of the Praat program become directly available for the synthesised sounds.
More details on the KlattGrid can be found in the following sections which will describe the four parts of the synthesiser in more detail. This description will be a summary of the synthesiser parameters and how they were implemented.
2.1 The Phonation Part
The phonation part serves two functions:
-
1.
It generates voicing. Part of this voicing are timings for the glottal cycle. The part responsible for these timings is shown by the box labeled “Voicing” in Fig. 5.1. The start and end times of the open phase of the glottis serve to:
-
Generate glottal flow during the open phase of the glottis.
-
Generate breathiness, i.e. noise that occurs only during the open phase of the glottis.
-
Calculate when formant frequencies and bandwidths change during the open phase (if formant change information is present in the coupling part).
-
2.
It generates aspiration. This part is indicated by the box labeled “Aspiration” in Fig. 5.1. In contrast with breathiness, aspiration may take place independently of any glottal timing.
The phonation parameter tiers do not all independently modify the glottal flow function. Some of the parameters involved have similar spectral effects, however, in this article we do not go into these details too much and only briefly summarise a tier’s function in the phonation part. For an extensive overview of the effects of several glottal flow parameters on the source spectrum see for example the article of Doval et al. [5]. The following 11 tiers form the phonation part:
- Pitch tier.:
-
For voiced sounds the pitch tier models the fundamental frequency as a function of time. Pitch equals the number of glottal opening/closing cycles per unit of time. In the absence of flutter and double pulsing, the pitch tier is the only determiner for the instants of glottal closure. Currently pitch interpolation happens on a linear frequency scale but other interpolation, for example on a log scale, can be added easily.
- Voicing amplitude tier.:
-
The voicing amplitude regulates the maximum amplitude of the glottal flow in dB. A flow with amplitude 1 corresponds to approximately 94 dB. To produce a voiced sound it is essential that this tier is not empty.
- Flutter tier.:
-
Flutter models a kind of “random” variation of the pitch and it is input as a number from zero to one. This random variation can be introduced to avoid the mechanical monotonic sound whenever the pitch remains constant during a longer time interval. The fundamental frequency is modified by a flutter component according to the following semi-periodic function that we adapted from [7]:\({F}_{0}{\prime}(t) = 0.01 \cdot \mathrm{ flutter} \cdot {F}_{0} \cdot (\sin (2\pi 12.7t) +\sin (2\pi 7.1t) +\sin (2\pi 4.7t))\)
- Open phase tier.:
-
The open phase tier models the open phase of the glottis with a number between zero and one. The open phase is the fraction of one glottal period that the glottis is open. The open phase tier is an optional tier, i.e. if no points are defined then a sensible default for the open phase is taken (0.7). If the open phase becomes smaller, necessarily the high frequency content of the source spectrum will increase.
- Power1 and power2 tiers.:
-
These tiers model the form of the glottal flow function during the open phase of the glottis as\(\mathrm{flow}(t) = {t}^{\mathrm{power1}} - {t}^{\mathrm{power2}}\), where 0 ≤ t ≤ 1 is the relative time that runs from the start to the end of the open phase. For the modelation of a proper vocal tract flow it is essential that the value of power2 is always larger than the value of power1. If these tiers have no values specified by the user, default values power1 = 3 and power2 = 4 are used. Figure 5.2 will show the effect of the values in these tiers on the form of the flow and its derivative. As power2 mainly influence the falling part of the flow function, we see that the higher the value of this parameter, the faster the flow function reaches zero, i.e. the shorter the closing time of the glottis would be and, consequently, the more high frequency content the glottal spectrum will have.
- Collision phase tier.:
-
The collision phase parameter models the last part of the flow function with an exponential decay function instead of a polynomial one. A value of 0.04, for example, means that the amplitude will decay by a factor of e ≈ 2. 7183 every 4 % of a period. The introduction of a collision phase will reduce the high frequency content in the glottal spectrum because of the smoother transition towards the closure.
- Spectral tilt tier.:
-
Spectral tilt represents the extra number of dB’s the voicing spectrum should be tilted down at 3,000 Hz [7]. This parameter is necessary to model “corner rounding”, i.e. when glottal closure is non simultaneous along the length of the vocal folds. If no points are defined in this tier, spectral tilt defaults to 0 dB and no spectral modifications are made.
- Aspiration amplitude tier.:
-
The aspiration amplitude tier models the (maximum) amplitude of noise generated at the glottis. The aspiration noise amplitude is, like the voicing amplitudes, specified in dB. This noise is independent of glottal timings and is generated from random uniform noise which is filtered by a very soft low-pass filter.
- Breathiness amplitude tier.:
-
The breathiness amplitude tier models the maximum noise amplitude during the open phase of the glottis. The amplitude of the breathiness noise, which is plain random uniform noise, is modulated by the glottal flow. It is specified in dB.
- Double pulsing tier.:
-
The double pulsing tier models diplophonia (by a number from zero to one). Whenever this parameter is greater than zero, alternate pulses are modified. A pulse is modified with this single parameter tier in two ways: it is delayed in time and its amplitude is attenuated. If the double pulsing value is maximum ( = 1), the time of closure of the first peak coincides with the opening time of the second one (but its amplitude will be zero).
2.2 The Vocal Tract Part
The sound generated by the phonation part of a KlattGrid may be modified by the filters of the vocal tract part. These filters are the oral formant filters, nasal formant filters and nasal antiformant filters. A formant filter boosts frequencies and an antiformant filter attenuates frequencies in a certain frequency region. For speech synthesis the vocal tract formant filters can be used in cascade or in parallel. Default these filters are used in cascade as is shown in Fig. 5.1 in the part numbered 3, unless otherwise specified by the user. Each formant filter is governed by two tiers: a formant frequency tier and a formant bandwidth tier. In case of parallel synthesis an additional formant amplitude tier must be specified.
Formant filters are implemented in the standard way as second order recursive digital filters of the form\({y}_{n} = a{x}_{n} + b{y}_{n-1} + c{y}_{n-2}\)as described in [7] (xirepresents input and yjoutput). These filters are also called digital resonators. The coefficients b and c at any time instant n can be calculated from the formant frequency and bandwidth values of the corresponding tiers. The a parameter is only a scaling factor and is chosen as\(a = 1 - b - c\); this makes the frequency response equal to 1 at 0 frequency. Antiformants are second order filters of the form\({y}_{n} = a{\prime}{x}_{n} + b{\prime}{x}_{n-1} + c{\prime}{x}_{n-2}\). The coefficients a′, b′ and c′ are also determined as described in [7]. When formant filters are used in cascade all formant filters start with the same value at 0 Hz. If used in parallel this is not the case anymore since each formant’s amplitude must be specified on a different tier.
As an example we show in Fig. 5.3 the frequency responses of formant/antiformant pairs where both formant and antiformant have the same “formant” frequency, namely 1,000 Hz, but different bandwidths. The bandwidth of the antiformant filter was fixed at 25 Hz but the bandwidth of the formant filter doubles at each step. From top to bottom it starts at 50 Hz and then moves to 100, 200, 400 and 800 Hz values. A perfect spectral “dip” results without hardly any side-effect on the spectral amplitude. This shows that a combination of a formant and antiformant at the same frequency can model a spectral dip: the formant compensates for the effect on the slope of the spectrum by the antiformant. Best spectral dips are obtained when the formant bandwidth is approximately 500 Hz. For larger bandwidths the dip will not become any deeper, the flatness of the spectrum will disappear and especially the higher frequencies will be amplified substantially.
2.3 The Coupling Part
The coupling part of a KlattGrid models the interactions between the phonation part, i.e. the glottis, and the vocal tract. Coupling is only partly shown in Fig. 5.1, only the tracheal formants and antiformants are shown. We have displayed them in front of the vocal tract part after the phonation part because tracheal formants and antiformants are implemented as if they filter the phonation source signal.
Besides the tracheal system with its formants and antiformants the coupling part also models the change of formant frequencies and bandwidths during the open phase of the glottis. With a so-called delta formant grid we can specify the amount of change of any formant and/or bandwidth during the open phase of the glottis. The values in the delta tiers will be added to the values of the corresponding formant tiers but only during the open phase of the glottis .
In Fig. 5.4 we show two examples where extreme coupling values have been used for a clear visual effect. In all panels the generated voiced sounds had a constant 100 Hz pitch, an constant open phase of 0.5 to make the duration of the open and closed phase equal, and only one formant. In the left part of the figure formant bandwidth is held constant at 50 Hz while formant frequency was modified during the open phase. The oral formant frequency was set to 500 Hz. By setting a delta formant point to a value of 500 Hz we accomplish that during the start of the open phase of the glottis, the formant frequency will increase by 500–1,000 Hz. At the end of the open phase it will then decrease to the original 500 Hz value of the formant tier. To avoid instantaneous changes we let the formant frequency increase and decrease with the delta value in a short interval that is one tenth of the duration of the open phase. In a future version of the synthesiser we hope to overcome this limitation [13]. The top display at the left shows the actual first formant frequency as a function of time during the first 0.03 s of the sound. This is exactly the duration of three pitch periods; the moments of glottal closure are indicated by dotted lines. The bottom left display shows the corresponding one-formant sound signal. The 100 Hz periodicity is visible as well as the formant frequency doubling in the second part of each period: we count almost two and a half periods of this formant in the first half of a period, the closed phase, and approximately five during the second half of a period, the open phase. At the right part of the same figure we show the effect of a bandwidth increase from 50 Hz during the closed phase to 450 Hz during the open phase for a one-formant vowel. As before the increase and decrease occur during the first and last one tenth of the open phase interval as is shown in the top right panel. The bottom right panel shows the corresponding synthesised sound.
2.4 The Frication Part
The frication part is an independent section in the synthesiser and it gives the opportunity to add the frication noise completely independent of the phonation and the vocal tract part. The frication sound is added to the output of the vocal tract part. A layout of the frication part is shown at the bottom of Fig. 5.1 in the dotted box labeled 4. The following tiers specify the frication sound:
- Frication amplitude tier.:
-
This tier regulates the maximum amplitude of the noise source in dB before any filtering takes place. In Fig. 5.1 this part is represented by the rectangle labeled “Frication noise”. This noise source is uniformly distributed random noise.
- Formant frequency and bandwidth tiers.:
-
To shape the noise spectrum a number of parallel formant filters are available whose frequencies, bandwidths and amplitudes can be specified. In the figure we have limited the number of formants to five but in principle this number is not limited at all.
- Formant amplitude tiers.:
-
Each formant is governed by a separate amplitude tier with values in dB. These formant amplitudes act like multipliers and may amplify or attenuate the formant filter input. For formant amplitudes 0 dB means an amplification of 1. Formants can be increased by giving positive dB values and decreased by giving negative values.
- Bypass tier.:
-
The bypass tier regulates the amplitude of the noise that bypasses the formant filters. This noise is added straight from the noise source to the output of the formant filters. The amplitude is in dB’s, where 0 dB means a multiplier of 1.
2.5 A KlattGrid Scripting Example
As the preceding sections have shown, the KlattGrid has a large number of parameters. It is difficult to get to grips with the many ways of changing a synthesiser’s sound output. To facilitate experimenting with parameter settings, the user interface has been designed to make it easy to selectively include or exclude, in the final sound generation process, some of the parameter tiers that you have given values. For example, if the breathiness amplitude has been defined, hearing the resulting sound with or without breathiness is simply achieved by a selection or deselection of the breathiness tier option in the form that regulates this special playing mode of the KlattGrid synthesiser. The same holds true for the phonation part of the synthesiser whose sound output can be generated separately with some of its parameter tiers selectively turned on or off.
As an example of the synthesisers interface we show a simple example script to synthesise a diphthong. This script can be run in Praat’s script editor. The first line of the script creates a new KlattGrid, named “kg”, with start and end times of 0 and 0.3 s, respectively. The rest of the parameters on this line specify the number of filters to be used in the vocal tract part, the coupling part and the frication part and are especially important for now (additional filters can always be added to a KlattGrid afterwards).
The second line defines a pitch point of 120 Hz at time 0.1 s. The next line defines a voicing amplitude of 90 dB at time 0.1 s. Because we keep voicing and pitch constant in this example the exact times for these points are not important, as long as they are within the domain on which the kg KlattGrid is defined. With the pitch and voicing amplitude defined, there is enough information in the KlattGrid to produce a sound and we can now Play the KlattGrid (line 4). Footnote 3 During 300 ms you will hear the sound as produced by the glottal source alone. This sound normally would be filtered by a vocal tract filter. But we have not defined the vocal tract filter yet (in this case the vocal tract part will not modify the phonation sound).
In lines 5 and 6 we add a first oral formant with a frequency of 800 Hz at time 0.1 s, and a bandwidth of 50 Hz also at time 0.1 s. The next two lines add a second oral formant at 1,200 Hz with a bandwidth of 50 Hz. If you now play the KlattGrid (line 9), it will sound like the vowel /a/, with a constant pitch of 120 Hz. Lines 10 and 11 add some dynamics to this sound; the first and second formant frequency are set to the values 350 and 600 Hz of the vowel /u/; the bandwidths have not changed and stay constant with values that were defined in lines 6 and 8. In the interval between times 0.1 and 0.3 s, formant frequency values will be interpolated. The result will now sound approximately as an /au/ diphthong.
This script shows that with only a few commands we already may create interesting sounds.

